This repository has been archived by the owner on May 17, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME.Rmd
57 lines (39 loc) · 1.53 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# HuggingFace tokenizers from R
[![R build status](https://github.com/mlverse/hftokenizers/workflows/R-CMD-check/badge.svg)](https://github.com/mlverse/hftokenizers/actions)
> This is an experimental project binding HuggingFace [tokenizers](https://github.com/huggingface/tokenizers) Rust library to R using the [extendr](https://github.com/extendr/extendr) project. Do **not** use for anything meaninful yet.
## Installation
This repository uses the [helloextendr template](https://github.com/extendr/helloextendr).
Before you can install this package, you need to install a working Rust toolchain. We recommend using [rustup.](https://rustup.rs/)
On Windows, you'll also have to add the `i686-pc-windows-gnu` and `x86_64-pc-windows-gnu` targets:
rustup target add x86_64-pc-windows-gnu
rustup target add i686-pc-windows-gnu
Once Rust is working, you can install this package via:
``` {.r}
remotes::install_github("mlverse/hftokenizers")
```
## Small example
Here's a quick demo of what you can do with `hftokenizers`:
```{r}
library(hftokenizers)
download.file(
"https://raw.githubusercontent.com/mlverse/hftokenizers/main/tests/testthat/assets/small.txt",
"small.txt"
)
tokenizer$
new(models_bpe$new())$
train(normalizePath("small.txt"))$
encode(c("hello world"))$
ids
```