TinySegmenter

TinySegmenter.jl is a Julia version of TinySegmenter, which is an extremely compact Japanese tokenizer originally written in JavaScript by Mr. Taku Kudo.

Usage

using TinySegmenter

join(tokenize("私の名前は中野です"), " | ")
# "私 | の | 名前 | は | 中野 | です"

The return value of tokenize is an array of substrings of the string input, giving the locations of the tokens in the text. (Substrings are represented by the SubString Julia type.)

Benchmarks

The following are times in seconds for a benchmark (see benchmark/README.md) of TinySegmenter implementations in different languages tokenizing a large (243kB) Japanese text:

|Ruby | C++ | Perl | JavaScript(Node.js) | Go | Python | Julia | |---|---|---|---|---|---|---|---| |132.98 | 48 | 134 |105.31 | 10.50 | 111.85 | 11.70 |

The benchmark was performed on the following machine:

Intel Core i5-3210M CPU at 2.50GHz
8GB RAM (1600MHz DDR3)
MacBook Pro (Retina, 13-inch, Late 2012), MacOS 10.11 ("El Capitan")

The benchmark text was The Time Machine by H.G. Wells, translated to Japanese by Hiroo Yamagata under the CC BY-SA 2.0 License. We also use the same text for validation (in the test directory).

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
benchmark		benchmark
src		src
test		test
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE.md		LICENSE.md
Project.toml		Project.toml
README.md		README.md
appveyor.yml		appveyor.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TinySegmenter

Usage

Benchmarks

About

Releases 2

Packages

Contributors 6

Languages

License

JuliaStrings/TinySegmenter.jl

Folders and files

Latest commit

History

Repository files navigation

TinySegmenter

Usage

Benchmarks

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 6

Languages

Packages