Skip to content

sentencepiece

Text Tokenization using Byte Pair Encoding and Unigram Modelling

v0.2.5 · Feb 9, 2026 · MPL-2.0

Description

Unsupervised text tokenizer allowing to perform byte pair encoding and unigram modelling. Wraps the 'sentencepiece' library <https://github.com/google/sentencepiece> which provides a language independent tokenizer to split text in words and smaller subword units. The techniques are explained in the paper "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing" by Taku Kudo and John Richardson (2018) <doi:10.18653/v1/D18-2012>. Provides as well straightforward access to pretrained byte pair encoding models and subword embeddings trained on Wikipedia using 'word2vec', as described in "BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages" by Benjamin Heinzerling and Michael Strube (2018) <http://www.lrec-conf.org/proceedings/lrec2018/pdf/1049.pdf>.

Downloads

355

Last 30 days

10792nd

1.6K

Last 90 days

15.8K

Last year

Trend: -33.2% (30d vs prior 30d)

CRAN Check Status

3 NOTE
11 OK
Show all 14 flavors
Flavor Status
r-devel-linux-x86_64-debian-clang OK
r-devel-linux-x86_64-debian-gcc OK
r-devel-linux-x86_64-fedora-clang OK
r-devel-linux-x86_64-fedora-gcc OK
r-devel-macos-arm64 OK
r-devel-windows-x86_64 OK
r-oldrel-macos-arm64 NOTE
r-oldrel-macos-x86_64 NOTE
r-oldrel-windows-x86_64 NOTE
r-patched-linux-x86_64 OK
r-release-linux-x86_64 OK
r-release-macos-arm64 OK
r-release-macos-x86_64 OK
r-release-windows-x86_64 OK
Check details (3 non-OK)
NOTE r-oldrel-macos-arm64

installed package size

installed size is 21.6Mb
  sub-directories of 1Mb or more:
    libs    19.9Mb
    models   1.6Mb
NOTE r-oldrel-macos-x86_64

installed package size

installed size is 23.1Mb
  sub-directories of 1Mb or more:
    libs    21.3Mb
    models   1.6Mb
NOTE r-oldrel-windows-x86_64

installed package size

installed size is  5.0Mb
  sub-directories of 1Mb or more:
    libs     3.3Mb
    models   1.6Mb

Check History

NOTE 11 OK · 3 NOTE · 0 WARNING · 0 ERROR · 0 FAILURE Mar 10, 2026
NOTE r-oldrel-macos-arm64

installed package size

installed size is 21.6Mb
  sub-directories of 1Mb or more:
    libs    19.9Mb
    models   1.6Mb
NOTE r-oldrel-macos-x86_64

installed package size

installed size is 23.1Mb
  sub-directories of 1Mb or more:
    libs    21.3Mb
    models   1.6Mb
NOTE r-oldrel-windows-x86_64

installed package size

installed size is  5.0Mb
  sub-directories of 1Mb or more:
    libs     3.3Mb
    models   1.6Mb

Reverse Dependencies (1)

suggests

Dependency Network

Dependencies Reverse dependencies Rcpp textrecipes sentencepiece

Version History

new 0.2.5 Mar 10, 2026
updated 0.2.5 ← 0.2.4 diff Feb 8, 2026
updated 0.2.4 ← 0.2.3 diff Nov 26, 2025
updated 0.2.3 ← 0.2.2 diff Nov 12, 2022
updated 0.2.2 ← 0.2.1 diff Nov 8, 2022
updated 0.2.1 ← 0.2 diff Dec 20, 2021
updated 0.2 ← 0.1.2 diff Dec 14, 2021
updated 0.1.2 ← 0.1.1 diff Jun 7, 2020
new 0.1.1 Jun 3, 2020