Skip to content

differe94nt/Jseg

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Jseg

A modified version of Jieba segmentator

Synopsis

  • Equipped with Emoticon detection
  • Emoticons will not be segmented as sequences of meaningless punctuations.

  • Data are trained with Sinica Corpus
  • Results are more accurate when dealing with Traditional Chinese (F1-score = 0.91).

  • Using Brill Tagger
  • Training data are trained with Sinica Treebank, which raises the accuracy of POS tagging.

Usage

``` from Jseg import jieba ``` Here's a sample text: ``` sample = '''台灣大學語言學研究所LOPE實驗室超強 Taco門神超罩 Amber 和 Emily 是雙胞胎 Yvonne 不是小老鼠 期末要爆炸啦! ◢▆▅▄▃崩╰(〒皿〒)╯潰▃▄▅▇◣ ''' ``` Segmentation ``` result = jieba.seg(sample) ``` Print out: ``` print result.text() ``` And the result: ``` 台灣/Nca 大學/Ncb 語言學/Nad 研究所/Ncb LOPE/FW 實驗室/Ncb 超強/VH11 Taco/FW 門神/Nad 超罩/VH14 Amber/FW 和/Caa Emily/FW 是/V_11 雙胞胎/DM Yvonne/FW 不/Dc 是/V_11 小老鼠/Nab 期末/Ng 要/Dbab 爆炸/VH11 啦/Tc !/PUNCTUATION ◢▆▅▄▃崩╰(〒皿〒)╯潰▃▄▅▇◣/EMOTICON ``` You can print out the result with colored POS tagging: ``` print result.text(mode='color') ``` Print out without POS tagging: ``` print result.nopos() ``` Result: ``` 台灣 大學 語言學 研究所 LOPE 實驗室 超強 Taco 門神 超罩 Amber 和 Emily 是 雙胞胎 Yvonne 不 是 小老鼠 期末 要 爆炸 啦 ! ◢▆▅▄▃崩╰(〒皿〒)╯潰▃▄▅▇◣ ```

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published