Comparison of Japanese Sentence Segmentation Tools

Requirements

Python: 3.10

Installation

pipenv install

Run

pipenv run python run.py

Benchmark

Data¹:
- wikipedia: 15 documents sampled from Japanese Wikipedia.
- cc100: 15 documents sampled from CC-100 (web text).
- emoji
  - Example input: "もちろん大丈夫です👍よろしくお願いします。"
  - Expected output: ["もちろん大丈夫です👍", "よろしくお願いします。"]
- kaomoji
  - Example input: "いいですよ^^よろしくお願いします。"
  - Expected output: ["いいですよ^^", "よろしくお願いします。"]
- named_entity
  - Example input: "モーニング娘。は日本のアイドルグループです。"
  - Expected output: ["モーニング娘。は日本のアイドルグループです。"]
- new_line
  - Example input: "時間は現在調整中ですので決まり次第\nご連絡差し上げます。"
  - Expected output: ["時間は現在調整中ですので決まり次第\nご連絡差し上げます。"]
Evaluation metric: F1 (micro average)

Tool	Method	wikipedia	cc100	emoji	kaomoji	named_entity	new_line
pysbd	Rule-based	100.0	85.5	0.0	0.0	0.0	44.4
rhoknp	Rule-based	100.0	88.4	0.0	0.0	0.0	44.4
kuzukiri	Rule-based	100.0	85.3	0.0	0.0	72.7	44.4
hasami	Rule-based	94.8	86.2	0.0	0.0	72.7	44.4
sengiri	Rule-based	55.7	68.1	12.9	0.0	56.0	44.4
bunkai	Rule-based + Model-based	93.7	83.7	100.0	66.7	0.0	100.0
ginza (ja_ginza_electra)	Model-based	95.7	85.7	66.7	84.2	75.0	70.0

Annotation has been done by the repository owner. ↩