Skip to content

wordpiece

R Implementation of Wordpiece Tokenization

v2.1.3 · Mar 3, 2022 · Apache License (>= 2)

Description

Apply 'Wordpiece' (<arXiv:1609.08144>) tokenization to input text, given an appropriate vocabulary. The 'BERT' (<arXiv:1810.04805>) tokenization conventions are used by default.

Downloads

CRAN

335

Last 30 days

13322nd

844

Last 90 days

14.8K

Last year

Trend: +42% (30d vs prior 30d)

r2u CRAN

1

Last 30 days

24

Last 90 days

108

Last year

Trend: -88.9% (30d vs prior 30d)

autoCRAN

1

Last 7 days

8

Last 30 days

0

All-time

autoCRAN-only: this name is served only by autoCRAN, so the count is exact.

CRAN Check Status

13 NOTE
Show all 13 flavors
Flavor Status
r-devel-linux-x86_64-debian-clang NOTE
r-devel-linux-x86_64-debian-gcc NOTE
r-devel-linux-x86_64-fedora-clang NOTE
r-devel-linux-x86_64-fedora-gcc NOTE
r-devel-windows-x86_64 NOTE
r-oldrel-macos-arm64 NOTE
r-oldrel-macos-x86_64 NOTE
r-oldrel-windows-x86_64 NOTE
r-patched-linux-x86_64 NOTE
r-release-linux-x86_64 NOTE
r-release-macos-arm64 NOTE
r-release-macos-x86_64 NOTE
r-release-windows-x86_64 NOTE
Check details (15 non-OK)
NOTE r-devel-linux-x86_64-debian-clang

CRAN incoming feasibility

Maintainer: ‘Jonathan Bratt <jonathan.bratt@macmillan.com>’

The Description field contains
  Apply 'Wordpiece' (<arXiv:1609.08144>) tokenization to input text,
  given an appropriate vocabulary. The 'BERT' (<arXiv:1810.04805>)
Please refer to arXiv e-prints via their arXiv DOI <doi:10.48550/arXiv.YYMM.NNNNN>.
NOTE r-devel-linux-x86_64-debian-clang

Rd files

checkRd: (-1) wordpiece_cache_dir.Rd:16-17: Lost braces in \itemize; meant \describe ?
checkRd: (-1) wordpiece_cache_dir.Rd:18-19: Lost braces in \itemize; meant \describe ?
checkRd: (-1) wordpiece_cache_dir.Rd:20-21: Lost braces in \itemize; meant \describe ?
NOTE r-devel-linux-x86_64-debian-gcc

CRAN incoming feasibility

Maintainer: ‘Jonathan Bratt <jonathan.bratt@macmillan.com>’

The Description field contains
  Apply 'Wordpiece' (<arXiv:1609.08144>) tokenization to input text,
  given an appropriate vocabulary. The 'BERT' (<arXiv:1810.04805>)
Please refer to arXiv e-prints via their arXiv DOI <doi:10.48550/arXiv.YYMM.NNNNN>.
NOTE r-devel-linux-x86_64-debian-gcc

Rd files

checkRd: (-1) wordpiece_cache_dir.Rd:16-17: Lost braces in \itemize; meant \describe ?
checkRd: (-1) wordpiece_cache_dir.Rd:18-19: Lost braces in \itemize; meant \describe ?
checkRd: (-1) wordpiece_cache_dir.Rd:20-21: Lost braces in \itemize; meant \describe ?
NOTE r-devel-linux-x86_64-fedora-clang

Rd files

checkRd: (-1) wordpiece_cache_dir.Rd:16-17: Lost braces in \itemize; meant \describe ?
checkRd: (-1) wordpiece_cache_dir.Rd:18-19: Lost braces in \itemize; meant \describe ?
checkRd: (-1) wordpiece_cache_dir.Rd:20-21: Lost braces in \itemize; meant \describe ?
NOTE r-devel-linux-x86_64-fedora-gcc

Rd files

checkRd: (-1) wordpiece_cache_dir.Rd:16-17: Lost braces in \itemize; meant \describe ?
checkRd: (-1) wordpiece_cache_dir.Rd:18-19: Lost braces in \itemize; meant \describe ?
checkRd: (-1) wordpiece_cache_dir.Rd:20-21: Lost braces in \itemize; meant \describe ?
NOTE r-devel-windows-x86_64

Rd files

checkRd: (-1) wordpiece_cache_dir.Rd:16-17: Lost braces in \itemize; meant \describe ?
checkRd: (-1) wordpiece_cache_dir.Rd:18-19: Lost braces in \itemize; meant \describe ?
checkRd: (-1) wordpiece_cache_dir.Rd:20-21: Lost braces in \itemize; meant \describe ?
NOTE r-oldrel-macos-arm64

Rd files

checkRd: (-1) wordpiece_cache_dir.Rd:16-17: Lost braces in \itemize; meant \describe ?
checkRd: (-1) wordpiece_cache_dir.Rd:18-19: Lost braces in \itemize; meant \describe ?
checkRd: (-1) wordpiece_cache_dir.Rd:20-21: Lost braces in \itemize; meant \describe ?
NOTE r-oldrel-macos-x86_64

Rd files

checkRd: (-1) wordpiece_cache_dir.Rd:16-17: Lost braces in \itemize; meant \describe ?
checkRd: (-1) wordpiece_cache_dir.Rd:18-19: Lost braces in \itemize; meant \describe ?
checkRd: (-1) wordpiece_cache_dir.Rd:20-21: Lost braces in \itemize; meant \describe ?
NOTE r-oldrel-windows-x86_64

Rd files

checkRd: (-1) wordpiece_cache_dir.Rd:16-17: Lost braces in \itemize; meant \describe ?
checkRd: (-1) wordpiece_cache_dir.Rd:18-19: Lost braces in \itemize; meant \describe ?
checkRd: (-1) wordpiece_cache_dir.Rd:20-21: Lost braces in \itemize; meant \describe ?
NOTE r-patched-linux-x86_64

Rd files

checkRd: (-1) wordpiece_cache_dir.Rd:16-17: Lost braces in \itemize; meant \describe ?
checkRd: (-1) wordpiece_cache_dir.Rd:18-19: Lost braces in \itemize; meant \describe ?
checkRd: (-1) wordpiece_cache_dir.Rd:20-21: Lost braces in \itemize; meant \describe ?
NOTE r-release-linux-x86_64

Rd files

checkRd: (-1) wordpiece_cache_dir.Rd:16-17: Lost braces in \itemize; meant \describe ?
checkRd: (-1) wordpiece_cache_dir.Rd:18-19: Lost braces in \itemize; meant \describe ?
checkRd: (-1) wordpiece_cache_dir.Rd:20-21: Lost braces in \itemize; meant \describe ?
NOTE r-release-macos-arm64

Rd files

checkRd: (-1) wordpiece_cache_dir.Rd:16-17: Lost braces in \itemize; meant \describe ?
checkRd: (-1) wordpiece_cache_dir.Rd:18-19: Lost braces in \itemize; meant \describe ?
checkRd: (-1) wordpiece_cache_dir.Rd:20-21: Lost braces in \itemize; meant \describe ?
NOTE r-release-macos-x86_64

Rd files

checkRd: (-1) wordpiece_cache_dir.Rd:16-17: Lost braces in \itemize; meant \describe ?
checkRd: (-1) wordpiece_cache_dir.Rd:18-19: Lost braces in \itemize; meant \describe ?
checkRd: (-1) wordpiece_cache_dir.Rd:20-21: Lost braces in \itemize; meant \describe ?
NOTE r-release-windows-x86_64

Rd files

checkRd: (-1) wordpiece_cache_dir.Rd:16-17: Lost braces in \itemize; meant \describe ?
checkRd: (-1) wordpiece_cache_dir.Rd:18-19: Lost braces in \itemize; meant \describe ?
checkRd: (-1) wordpiece_cache_dir.Rd:20-21: Lost braces in \itemize; meant \describe ?

Check History

NOTE 0 OK · 14 NOTE · 0 WARNING · 0 ERROR · 0 FAILURE Mar 10, 2026
NOTE r-devel-linux-x86_64-debian-clang

CRAN incoming feasibility

Maintainer: ‘Jonathan Bratt <jonathan.bratt@macmillan.com>’

The Description field contains
  Apply 'Wordpiece' (<arXiv:1609.08144>) tokenization to input text,
  given an appropriate vocabulary. The 'BERT' (<arXiv:1810.04805>)
Please refer to arXiv e-prints via their arXiv DOI <doi:10.48550/arXiv.YYMM.NNNNN>.
NOTE r-devel-linux-x86_64-debian-gcc

CRAN incoming feasibility

Maintainer: ‘Jonathan Bratt <jonathan.bratt@macmillan.com>’

The Description field contains
  Apply 'Wordpiece' (<arXiv:1609.08144>) tokenization to input text,
  given an appropriate vocabulary. The 'BERT' (<arXiv:1810.04805>)
Please refer to arXiv e-prints via their arXiv DOI <doi:10.48550/arXiv.YYMM.NNNNN>.
NOTE r-devel-linux-x86_64-fedora-clang

Rd files

checkRd: (-1) wordpiece_cache_dir.Rd:16-17: Lost braces in \itemize; meant \describe ?
checkRd: (-1) wordpiece_cache_dir.Rd:18-19: Lost braces in \itemize; meant \describe ?
checkRd: (-1) wordpiece_cache_dir.Rd:20-21: Lost braces in \itemize; meant \describe ?
NOTE r-devel-linux-x86_64-fedora-gcc

Rd files

checkRd: (-1) wordpiece_cache_dir.Rd:16-17: Lost braces in \itemize; meant \describe ?
checkRd: (-1) wordpiece_cache_dir.Rd:18-19: Lost braces in \itemize; meant \describe ?
checkRd: (-1) wordpiece_cache_dir.Rd:20-21: Lost braces in \itemize; meant \describe ?
NOTE r-devel-macos-arm64

Rd files

checkRd: (-1) wordpiece_cache_dir.Rd:16-17: Lost braces in \itemize; meant \describe ?
checkRd: (-1) wordpiece_cache_dir.Rd:18-19: Lost braces in \itemize; meant \describe ?
checkRd: (-1) wordpiece_cache_dir.Rd:20-21: Lost braces in \itemize; meant \describe ?
NOTE r-devel-windows-x86_64

Rd files

checkRd: (-1) wordpiece_cache_dir.Rd:16-17: Lost braces in \itemize; meant \describe ?
checkRd: (-1) wordpiece_cache_dir.Rd:18-19: Lost braces in \itemize; meant \describe ?
checkRd: (-1) wordpiece_cache_dir.Rd:20-21: Lost braces in \itemize; meant \describe ?
NOTE r-patched-linux-x86_64

Rd files

checkRd: (-1) wordpiece_cache_dir.Rd:16-17: Lost braces in \itemize; meant \describe ?
checkRd: (-1) wordpiece_cache_dir.Rd:18-19: Lost braces in \itemize; meant \describe ?
checkRd: (-1) wordpiece_cache_dir.Rd:20-21: Lost braces in \itemize; meant \describe ?
NOTE r-release-linux-x86_64

Rd files

checkRd: (-1) wordpiece_cache_dir.Rd:16-17: Lost braces in \itemize; meant \describe ?
checkRd: (-1) wordpiece_cache_dir.Rd:18-19: Lost braces in \itemize; meant \describe ?
checkRd: (-1) wordpiece_cache_dir.Rd:20-21: Lost braces in \itemize; meant \describe ?
NOTE r-release-macos-arm64

Rd files

checkRd: (-1) wordpiece_cache_dir.Rd:16-17: Lost braces in \itemize; meant \describe ?
checkRd: (-1) wordpiece_cache_dir.Rd:18-19: Lost braces in \itemize; meant \describe ?
checkRd: (-1) wordpiece_cache_dir.Rd:20-21: Lost braces in \itemize; meant \describe ?
NOTE r-release-macos-x86_64

Rd files

checkRd: (-1) wordpiece_cache_dir.Rd:16-17: Lost braces in \itemize; meant \describe ?
checkRd: (-1) wordpiece_cache_dir.Rd:18-19: Lost braces in \itemize; meant \describe ?
checkRd: (-1) wordpiece_cache_dir.Rd:20-21: Lost braces in \itemize; meant \describe ?
NOTE r-release-windows-x86_64

Rd files

checkRd: (-1) wordpiece_cache_dir.Rd:16-17: Lost braces in \itemize; meant \describe ?
checkRd: (-1) wordpiece_cache_dir.Rd:18-19: Lost braces in \itemize; meant \describe ?
checkRd: (-1) wordpiece_cache_dir.Rd:20-21: Lost braces in \itemize; meant \describe ?
NOTE r-oldrel-macos-arm64

Rd files

checkRd: (-1) wordpiece_cache_dir.Rd:16-17: Lost braces in \itemize; meant \describe ?
checkRd: (-1) wordpiece_cache_dir.Rd:18-19: Lost braces in \itemize; meant \describe ?
checkRd: (-1) wordpiece_cache_dir.Rd:20-21: Lost braces in \itemize; meant \describe ?
NOTE r-oldrel-macos-x86_64

Rd files

checkRd: (-1) wordpiece_cache_dir.Rd:16-17: Lost braces in \itemize; meant \describe ?
checkRd: (-1) wordpiece_cache_dir.Rd:18-19: Lost braces in \itemize; meant \describe ?
checkRd: (-1) wordpiece_cache_dir.Rd:20-21: Lost braces in \itemize; meant \describe ?
NOTE r-oldrel-windows-x86_64

Rd files

checkRd: (-1) wordpiece_cache_dir.Rd:16-17: Lost braces in \itemize; meant \describe ?
checkRd: (-1) wordpiece_cache_dir.Rd:18-19: Lost braces in \itemize; meant \describe ?
checkRd: (-1) wordpiece_cache_dir.Rd:20-21: Lost braces in \itemize; meant \describe ?

Code intelligence has not been computed for this package yet.

Code

Structure

Lines of code

1,215

Files

40

Compiled share

0%

Has compiled src

No

Language breakdown

R 512 (42.1%)Tests 167 (13.7%)Docs 394 (32.4%)Vignettes 142 (11.7%)

API

Exported functions

7

Internal functions

19

Recent export changes

v2.0.1+4 prepare_vocab, set_wordpiece_cache_dir, wordpiece_cache_dir +1 more  −1 get_cache_dir
v1.0.2+4 get_cache_dir, load_or_retrieve_vocab, load_vocab +1 more

Testing & CI

Has tests

Yes

Test-to-code ratio

0.33

testthat edition

3

CI present

No

CI type

[]

PR gated

No

Docs

Return-value doc rate

85.7%

\dontrun example ratio

0%

Roxygen coverage

100%

Has pkgdown

No

NEWS present

Yes

Health & Security signals

Informational signals; not verdicts.

on.exit coverage

0%

Unsafe pattern score

0

Dep constraint coverage

85.7%

Secret pattern count

0

Bundled 3rd-party code

2 items

Portability & License

Min R version

3.3.0

System requirements

C++ standard

License

Apache License (>= 2)

License flags

SPDX valid, OSI approved

History

Versions

3

First release

2021-02-11

Latest release

2022-03-03

Avg cadence

193 days

Cold removal rate

100%

Dep drift

9

LOC over versions

v1.0.2: 1,424 LOCv2.0.1: 970 LOCv2.1.3: 1,215 LOC

Per-file churn detail lives in the source pipeline: https://github.com/r-observatory/cran-code-metrics.

Reverse Dependencies (1)

suggests

Dependency Network

Dependencies Reverse dependencies dlr fastmatch memoise piecemaker rlang stringi wordpiece.data textrecipes wordpiece

Version History

4 tracked
new 2.1.3 Mar 10, 2026
updated 2.1.3 ← 2.0.1 diff Mar 2, 2022
updated 2.0.1 ← 1.0.2 diff Oct 17, 2021
new 1.0.2 Feb 10, 2021