Collection of Thai Natural Language Processing (NLP) software libraries, dictionaries, and corpus. Always welcome for pull requests.
Library | Description | Programming Languages | Features | License | Author & Link |
---|---|---|---|---|---|
JTCC | Thai Character Cluster | Java | GPL-3.0 | Wittawat | |
TCC | Thai Character Cluster | Python | Apache 2.0 | Wannaphong |
Library | Description | Programming Languages | Features | License | Author & Link |
---|---|---|---|---|---|
sentiment_analysis_thai | JagerV3 |
Library | Description | Programming Languages | Features | License | Author & Link |
---|---|---|---|---|---|
LK82 + Udom83 | Thai Soundex | Python | Korakot |
Library | Description | Programming Languages | Features | License | Author & Link |
---|---|---|---|---|---|
Swath | SWATH (Smart Word Analysis for THai) is a word segmentation for Thai | C | Longest Matching, Maximal Matching and Part-of-Speech Bigram. | GPL | CMU |
Lexto | Lexto: Thai Lexeme Tokenizer | Java | LGPL | NECTEC | |
Python 2 | LGPL | Python2 Wrapper | |||
Python 3 | LGPL | Python3 Wrapper | |||
Wordcut | Thai word breaker for Node.js | JavaScript, Node.JS | LGPL-3.0 | veer66, github | |
wordcutpy | A simple Thai word tokenizer written in 1 Python file | Python 3 | LGPL-3.0 | veer66, github | |
CutKum | Thai Word-Segmentation with Deep Learning in Tensorflow. RNN. | Python | 93% F-measure. | MIT | Pucktada, github |
Thai Language Toolkit (tltk) | Based on a paper by Wirote Aroonmanakun in 2002. Word segmentation is based on a maximum collocation approach. Syllable segmentation is based on 3grams statistics. (Dataset is included) | Python | 97.86% F-measure. (It was tested on a different testset; it is not fair to compare it with other models.) | GPLv3 | awirote, the Python Package Index |
DeepCut | A Thai word tokenization library using Deep Neural Network. CNN. | Python | 98.8% F-measure. | MIT | rkcosmos, github |
SynThai | Thai Word Segmentation and Part-of-Speech Tagging with Deep Learning. RNN. LSTM. | Python | 99.2% F-measure | MIT | KenjiroAI, github |
CutThai | Thai word segmentation written in coffee-script Edit | Coffee-script | MIT | Pureexe/cutthai Github | |
Multi-Candidate-Word-Segmentation | Multi Candidate Word Segmentation for Thai language | Python, RNN, LSTM | 97.0% F-measure (Word Level), 98.95% F-measure (Boundary Level) | MIT | Paper, earthy123/Multi-Candidate-Word-Segmentation |
Library | Description | Programming Languages | Features | License | Author & Link |
---|---|---|---|---|---|
Jitar+NAiST | A simple Trigram HMM part-of-speech tagger | Java | Ver66, Jitar + NAiST, 1 + NAiST, 2 | ||
SynThai | Thai Word Segmentation and Part-of-Speech Tagging with Deep Learning. RNN. LSTM. | Python | 0.9163 F-measure. RNN. LSTM | MIT | KenjiroAI, github |
Chart-POS | Thai POS Tagger | C | All rights reserved | AIAT, KINDML, Thanaruk T. ([email protected]), Thodsaporn C., Demo at iApp |
Library | Description | Programming Languages | Features | License | Author & Link |
---|---|---|---|---|---|
Named Entity Tagging (Thai NEST) | Thai Named Entity tagging Specification and Tools | GPL | KINDML, SIIT, AIAT | ||
ThaiNER | Thai Named Entity Recognition for PyThaiNLP | Python | Apache 2.0 (code) & CC BY 3.0 (Dataset) | ThaiNER |
Library | Description | Programming Languages | Features | License | Author & Link |
---|---|---|---|---|---|
News Structure Tagging Program | Thai News Structure Tagging Program | Metadata tagging, Structure tagging, Automatic News Title Generation | GPL | AIAT |
Library | Description | Programming Languages | Features | License | Author & Link |
---|---|---|---|---|---|
Chart-parser | Extract Syntactic Structure from POS Tagged Sentence. | C | All rights reserved | AIAT, KINDML, Thanaruk T. ([email protected]), Thodsaporn C., Demo at iApp | |
Grammar Processing | Labelled Brackets -> Context Free Grammars (CFGs) | Python | Transform and compute probability | Thodsaporn C. |
Library | Description | Programming Languages | Features | License | Author & Link |
---|---|---|---|---|---|
kobkrit-word-embedding | Tensorflow implementation of Thai word embedding | Python | Source code, Example, Word distance graph | LGPL | Kobkrit V. |
Service | Description | License | Author & Link |
---|---|---|---|
Thai Machine Comprehension (ThaiMC) | Bidirectional Attention Flow | Copyright (As the service) | iApp-AI |
Service | Description | License | Author & Link |
---|---|---|---|
Thai Emotification | LSTM | GNU General Public License v3.0 | Demo at iApp-AI and Source, Github |
Library | Description | Size | Features | License | Link |
---|---|---|---|---|---|
Transliteration Corpus | 31K pairs | Thai-Eng Translation Pair | CC BY-NC-SA 3.0 TH | NECTEC | |
LEXiTRON | Thai<->English Dictionary | TH->EN, EN->TH | LEXiTRON License | NECTEC | |
Yaitron | LEXiTRON in machine readable format (XML) | TH->EN, EN->TH | LEXiTRON License | Veer66 Schema, Data & Conversion Code |
Library | Description | Size | Features | License | Link |
---|---|---|---|---|---|
ORCHID | 30K sent. | Word Seg., POS Tagged. | CC BY-NC-SA 3.0 TH | NECTEC | |
THAI-NEST | Thai-NEST: Thai Named Entity tagging Specification and Tools | 45K+ Name Entity Token | Name Entity Tagged | GNU Lesser General Public 2.1 | KINDML |
InterBEST 2009/2010 | 5M words | Word Seg. | CC BY-NC-SA 3.0 TH | NECTEC | |
Thai Wikipedia | Formal Articles | 1.49GB (~213.1 MB compressed) | XML | GFDL | WIKIPEDIA |
TNC Top-5000 Words | Word frequency | 5,000 words | Frequency of Thai words in various genres, EXCEL | All rights reserved | CHULA |
Click Bait Sentences | Thai Click Bait Sentence | 330 sent. (90.7KB) | MIT | Wannaphongcom | |
Thai Sentimental Word List | Thai Sentimental Words List | 52KB | Seperated Words as Adj, V | MIT | Wannaphongcom |
Prime Minister 29 | Prime Minister 29's Speech Sentences | 338KB | Word segged, Name Entity Tagged | MIT | Wannaphongcom |
Thai named entity corpora | named entity corpora by Wirote Aroonmanakun's students | 266KB-1.5MB | syllable seg., word seg., Named Entity tagged | GPLv3(not sure, but tltk is using this license) | นัชชา ถิระสาโรช Data ศศิวิมล กาลันสีมา Data ณัฐดาพร เลิศชีวะ Data |
Thai WordNet | THE CONSTRUCTION OF THAI WORDNET OF 1ST ORDER ENTITY COMMON BASE CONCEPTS USING A BI-DIRECTIONAL TRANSLATION METHOD AND WITH DICTIONARIES OF DIFFERENT COMPILATIONAL APPROACHES(ธนนท์ หลีน้อย) THE CONSTRUCTION OF THAI WORDNET OF 2ND ORDER ENTITY COMMON BASE CONCEPTS USING A BI-DIRECTIONAL TRANSLATION METHOD : A STUDY OF THE DIVERSITY OF MEANINGS AFFECTING TRANSLATIONAL ACCURACY (ปริศนา อัครพุทธิพร) |
WordNet | N/A | ธนนท์ หลีน้อย 2008 ปริศนา อัครพุทธิพร Data 2008 |
|
Toxicity in Thai Tweet Corpus | Tokyo Metropolitan University Natural Language Processing Group | Each tweet is labeled as toxic or non-toxic | CC BY-NC 4.0 | tmu-nlp |
Library | Description | Size | Features | License | Link |
---|---|---|---|---|---|
Thai National Corpus 2 | 32M words | Query text by genre, domain | All rights reserved | CHULA | |
Thai Medical Document | 3,594 docs | Document and dynamic keyword map | All rights reserved | KINDML, SIIT | |
Southeast Asian Languages Library | Thai News, Web Text, Pop Music, Literature, Toponyms | 20M chars | Phase around a search text | SEALang | |
HSE Thai Corpus | Modern texts written in Thai language (mostly news websites) | 50M tokens | Query by word form, lexeme, translation, grammatical attributes, lexical attributees | HSE School of Linguistics |
Library | Description | Size | Features | License | Link |
---|---|---|---|---|---|
TALPCo | TUFS Asian Language Parallel Corpus | 1327 sent | open parallel corpus consisting of Japanese sentences and their translations into Burmese (Myanmar; the official language of the Republic of the Union of Myanmar), Malay (the national language of Malaysia, Singapore and Brunei), Indonesian, Thai, Vietnamese and English | CC BY 4.0 | TALPCo |
Pre-trained Model | Description | Size | Dimensions | License | Link |
---|---|---|---|---|---|
fastText | Skip-Gram model trained on Wikipedia using fastText | 300 | CC BY-SA 3.0 | Facebook + Bin & Text + Text Only | |
thai2fit | ULMFit on Wikipedia. Perplexity of 46.80959 with 60,002 embeddings. | 70MB | 300 | MIT | thai2vec / pyThaiNLP |
model | micro_f1_public | micro_f1_private |
---|---|---|
ULMFit | 0.59313 | 0.60322 |
fastText | 0.5145 | 0.5109 |
LinearSVC | 0.5022 | 0.4976 |
Kaggle Score | 0.59139 | 0.58139 |
BERT | 0.56612 | 0.57057 |
prachathai-67k: body_text
Model | Macro-accuracy | Macro-F1 |
---|---|---|
fastText | 0.9302 | 0.5529 |
LinearSVC | 0.513277 | 0.552801 |
ULMFit | 0.948737 | 0.744875 |
Model | Public Accuracy | Private Accuracy |
---|---|---|
Logistic Regression | 0.72781 | 0.7499 |
FastText | 0.63144 | 0.6131 |
ULMFit | 0.71259 | 0.74194 |
ULMFit Semi-supervised | 0.73119 | 0.75859 |
ULMFit Semi-supervised Repeated One Time | 0.73372 | 0.75968 |
truevoice-intent: destination
model | accuracy | micro-F1 |
---|---|---|
fastText | 0.384116 | 0.384116 |
LinearSVC | 0.807876 | 0.327565 |
ULMFit | 0.834981 | 0.834981 |