Skip to content

tokenizers

Fast, Consistent Tokenization of Natural Language Text

v0.3.0 · Dec 22, 2022 · MIT + file LICENSE

Description

Convert natural language text into tokens. Includes tokenizers for shingled n-grams, skip n-grams, words, word stems, sentences, paragraphs, characters, shingled characters, lines, Penn Treebank, regular expressions, as well as functions for counting characters, words, and sentences, and a function for splitting longer texts into separate documents, each with the same number of words. The tokenizers have a consistent interface, and the package is built on the 'stringi' and 'Rcpp' packages for fast yet correct tokenization in 'UTF-8'.

Downloads

40.7K

Last 30 days

577th

115.2K

Last 90 days

516.4K

Last year

Trend: +6.3% (30d vs prior 30d)

CRAN Check Status

2 NOTE
12 OK
Show all 14 flavors
Flavor Status
r-devel-linux-x86_64-debian-clang NOTE
r-devel-linux-x86_64-debian-gcc NOTE
r-devel-linux-x86_64-fedora-clang OK
r-devel-linux-x86_64-fedora-gcc OK
r-devel-macos-arm64 OK
r-devel-windows-x86_64 OK
r-oldrel-macos-arm64 OK
r-oldrel-macos-x86_64 OK
r-oldrel-windows-x86_64 OK
r-patched-linux-x86_64 OK
r-release-linux-x86_64 OK
r-release-macos-arm64 OK
r-release-macos-x86_64 OK
r-release-windows-x86_64 OK
Check details (2 non-OK)
NOTE r-devel-linux-x86_64-debian-clang

CRAN incoming feasibility

Maintainer: ‘Lincoln Mullen <lincoln@lincolnmullen.com>’

Package CITATION file contains call(s) to old-style personList() or
as.personList().  Please use c() on person objects instead.
Package CITATION file contains call(s) to old-style citEntry().  Please
use bibentry() instead.
NOTE r-devel-linux-x86_64-debian-gcc

CRAN incoming feasibility

Maintainer: ‘Lincoln Mullen <lincoln@lincolnmullen.com>’

Package CITATION file contains call(s) to old-style personList() or
as.personList().  Please use c() on person objects instead.
Package CITATION file contains call(s) to old-style citEntry().  Please
use bibentry() instead.

Check History

NOTE 12 OK · 2 NOTE · 0 WARNING · 0 ERROR · 0 FAILURE Mar 10, 2026
NOTE r-devel-linux-x86_64-debian-clang

CRAN incoming feasibility

Maintainer: ‘Lincoln Mullen <lincoln@lincolnmullen.com>’

Package CITATION file contains call(s) to old-style personList() or
as.personList().  Please use c() on person objects instead.
Package CITATION file contains call(s) to old-style citEntry().  Please
use bibentry() instead.
NOTE r-devel-linux-x86_64-debian-gcc

CRAN incoming feasibility

Maintainer: ‘Lincoln Mullen <lincoln@lincolnmullen.com>’

Package CITATION file contains call(s) to old-style personList() or
as.personList().  Please use c() on person objects instead.
Package CITATION file contains call(s) to old-style citEntry().  Please
use bibentry() instead.

Reverse Dependencies (15)

Dependency Network

Dependencies Reverse dependencies stringi Rcpp SnowballC DramaAnalysis WhatsR blocking covfefe deeplr pdfsearch proustr rslp textrecipes tidypmc tidytext wactor edgarWebR sumup torchdatasets tokenizers

Version History

new 0.3.0 Mar 10, 2026
updated 0.3.0 ← 0.2.3 diff Dec 21, 2022
updated 0.2.3 ← 0.2.1 diff Sep 22, 2022
updated 0.2.1 ← 0.2.0 diff Mar 28, 2018
updated 0.2.0 ← 0.1.4 diff Mar 20, 2018
updated 0.1.4 ← 0.1.3 diff Aug 28, 2016
updated 0.1.3 ← 0.1.2 diff Aug 17, 2016
updated 0.1.2 ← 0.1.1 diff Apr 13, 2016
updated 0.1.1 ← 0.1.0 diff Apr 3, 2016
new 0.1.0 Apr 1, 2016