Thai Natural Language Processing (Thai NLP) Resource

Collection of Thai Natural Language Processing (NLP) software libraries, dictionaries, and corpus. Always welcome for pull requests.

Thai NLP Libraries/Services

Thai Character Cluster

Library	Description	Programming Languages	Features	License	Author & Link
JTCC	Thai Character Cluster	Java		GPL-3.0	Wittawat
TCC	Thai Character Cluster	Python		Apache 2.0	Wannaphong

Thai Sentiment Analysis

Library	Description	Programming Languages	Features	License	Author & Link
sentiment_analysis_thai					JagerV3

Thai Soundex

Library	Description	Programming Languages	Features	License	Author & Link
LK82 + Udom83	Thai Soundex	Python			Korakot

Word Segmentation

Library	Description	Programming Languages	Features	License	Author & Link
Swath	SWATH (Smart Word Analysis for THai) is a word segmentation for Thai	C	Longest Matching, Maximal Matching and Part-of-Speech Bigram.	GPL	CMU
Lexto	Lexto: Thai Lexeme Tokenizer	Java		LGPL	NECTEC
		Python 2		LGPL	Python2 Wrapper
		Python 3		LGPL	Python3 Wrapper
Wordcut	Thai word breaker for Node.js	JavaScript, Node.JS		LGPL-3.0	veer66, github
wordcutpy	A simple Thai word tokenizer written in 1 Python file	Python 3		LGPL-3.0	veer66, github
CutKum	Thai Word-Segmentation with Deep Learning in Tensorflow. RNN.	Python	93% F-measure.	MIT	Pucktada, github
Thai Language Toolkit (tltk)	Based on a paper by Wirote Aroonmanakun in 2002. Word segmentation is based on a maximum collocation approach. Syllable segmentation is based on 3grams statistics. (Dataset is included)	Python	97.86% F-measure. (It was tested on a different testset; it is not fair to compare it with other models.)	GPLv3	awirote, the Python Package Index
DeepCut	A Thai word tokenization library using Deep Neural Network. CNN.	Python	98.8% F-measure.	MIT	rkcosmos, github
SynThai	Thai Word Segmentation and Part-of-Speech Tagging with Deep Learning. RNN. LSTM.	Python	99.2% F-measure	MIT	KenjiroAI, github
CutThai	Thai word segmentation written in coffee-script Edit	Coffee-script		MIT	Pureexe/cutthai Github
Multi-Candidate-Word-Segmentation	Multi Candidate Word Segmentation for Thai language	Python, RNN, LSTM	97.0% F-measure (Word Level), 98.95% F-measure (Boundary Level)	MIT	Paper, earthy123/Multi-Candidate-Word-Segmentation

Part of Speech Tagging (POS Tagging)

Library	Description	Programming Languages	Features	License	Author & Link
Jitar+NAiST	A simple Trigram HMM part-of-speech tagger	Java			Ver66, Jitar + NAiST, 1 + NAiST, 2
SynThai	Thai Word Segmentation and Part-of-Speech Tagging with Deep Learning. RNN. LSTM.	Python	0.9163 F-measure. RNN. LSTM	MIT	KenjiroAI, github
Chart-POS	Thai POS Tagger	C		All rights reserved	AIAT, KINDML, Thanaruk T. ([email protected]), Thodsaporn C., Demo at iApp

Name Entity Recognition

Library	Description	Programming Languages	Features	License	Author & Link
Named Entity Tagging (Thai NEST)	Thai Named Entity tagging Specification and Tools			GPL	KINDML, SIIT, AIAT
ThaiNER	Thai Named Entity Recognition for PyThaiNLP	Python		Apache 2.0 (code) & CC BY 3.0 (Dataset)	ThaiNER

News Structure Tagging

Library	Description	Programming Languages	Features	License	Author & Link
News Structure Tagging Program	Thai News Structure Tagging Program		Metadata tagging, Structure tagging, Automatic News Title Generation	GPL	AIAT

Syntactic Parsing & Tools

Library	Description	Programming Languages	Features	License	Author & Link
Chart-parser	Extract Syntactic Structure from POS Tagged Sentence.	C		All rights reserved	AIAT, KINDML, Thanaruk T. ([email protected]), Thodsaporn C., Demo at iApp
Grammar Processing	Labelled Brackets -> Context Free Grammars (CFGs)	Python	Transform and compute probability		Thodsaporn C.

Thai Word Embedding

Library	Description	Programming Languages	Features	License	Author & Link
kobkrit-word-embedding	Tensorflow implementation of Thai word embedding	Python	Source code, Example, Word distance graph	LGPL	Kobkrit V.

Thai Question Answering (Machine Comprehension)

Service	Description	License	Author & Link
Thai Machine Comprehension (ThaiMC)	Bidirectional Attention Flow	Copyright (As the service)	iApp-AI

Thai Emojification

Service	Description	License	Author & Link
Thai Emotification	LSTM	GNU General Public License v3.0	Demo at iApp-AI and Source, Github

Dictionaries / Translation Pairs

Library	Description	Size	Features	License	Link
Transliteration Corpus		31K pairs	Thai-Eng Translation Pair	CC BY-NC-SA 3.0 TH	NECTEC
LEXiTRON	Thai<->English Dictionary		TH->EN, EN->TH	LEXiTRON License	NECTEC
Yaitron	LEXiTRON in machine readable format (XML)		TH->EN, EN->TH	LEXiTRON License	Veer66 Schema, Data & Conversion Code

Downloadable Text Corpus

Library	Description	Size	Features	License	Link
ORCHID		30K sent.	Word Seg., POS Tagged.	CC BY-NC-SA 3.0 TH	NECTEC
THAI-NEST	Thai-NEST: Thai Named Entity tagging Specification and Tools	45K+ Name Entity Token	Name Entity Tagged	GNU Lesser General Public 2.1	KINDML
InterBEST 2009/2010		5M words	Word Seg.	CC BY-NC-SA 3.0 TH	NECTEC
Thai Wikipedia	Formal Articles	1.49GB (~213.1 MB compressed)	XML	GFDL	WIKIPEDIA
TNC Top-5000 Words	Word frequency	5,000 words	Frequency of Thai words in various genres, EXCEL	All rights reserved	CHULA
Click Bait Sentences	Thai Click Bait Sentence	330 sent. (90.7KB)		MIT	Wannaphongcom
Thai Sentimental Word List	Thai Sentimental Words List	52KB	Seperated Words as Adj, V	MIT	Wannaphongcom
Prime Minister 29	Prime Minister 29's Speech Sentences	338KB	Word segged, Name Entity Tagged	MIT	Wannaphongcom
Thai named entity corpora	named entity corpora by Wirote Aroonmanakun's students	266KB-1.5MB	syllable seg., word seg., Named Entity tagged	GPLv3(not sure, but tltk is using this license)	นัชชา ถิระสาโรช Data ศศิวิมล กาลันสีมา Data ณัฐดาพร เลิศชีวะ Data
Thai WordNet	THE CONSTRUCTION OF THAI WORDNET OF 1ST ORDER ENTITY COMMON BASE CONCEPTS USING A BI-DIRECTIONAL TRANSLATION METHOD AND WITH DICTIONARIES OF DIFFERENT COMPILATIONAL APPROACHES(ธนนท์ หลีน้อย) THE CONSTRUCTION OF THAI WORDNET OF 2ND ORDER ENTITY COMMON BASE CONCEPTS USING A BI-DIRECTIONAL TRANSLATION METHOD : A STUDY OF THE DIVERSITY OF MEANINGS AFFECTING TRANSLATIONAL ACCURACY (ปริศนา อัครพุทธิพร)		WordNet	N/A	ธนนท์ หลีน้อย 2008 ปริศนา อัครพุทธิพร Data 2008
Toxicity in Thai Tweet Corpus	Tokyo Metropolitan University Natural Language Processing Group		Each tweet is labeled as toxic or non-toxic	CC BY-NC 4.0	tmu-nlp

Web Query Text Corpus

Library	Description	Size	Features	License	Link
Thai National Corpus 2		32M words	Query text by genre, domain	All rights reserved	CHULA
Thai Medical Document		3,594 docs	Document and dynamic keyword map	All rights reserved	KINDML, SIIT
Southeast Asian Languages Library	Thai News, Web Text, Pop Music, Literature, Toponyms	20M chars	Phase around a search text		SEALang
HSE Thai Corpus	Modern texts written in Thai language (mostly news websites)	50M tokens	Query by word form, lexeme, translation, grammatical attributes, lexical attributees		HSE School of Linguistics

Parallel Corpus

Library	Description	Size	Features	License	Link
TALPCo	TUFS Asian Language Parallel Corpus	1327 sent	open parallel corpus consisting of Japanese sentences and their translations into Burmese (Myanmar; the official language of the Republic of the Union of Myanmar), Malay (the national language of Malaysia, Singapore and Brunei), Indonesian, Thai, Vietnamese and English	CC BY 4.0	TALPCo

Pre-trained Word Vectors

Pre-trained Model	Description	Size	Dimensions	License	Link
fastText	Skip-Gram model trained on Wikipedia using fastText		300	CC BY-SA 3.0	Facebook + Bin & Text + Text Only
thai2fit	ULMFit on Wikipedia. Perplexity of 46.80959 with 60,002 embeddings.	70MB	300	MIT	thai2vec / pyThaiNLP

Thai Text Classification Benchmarks

wongnai-corpus

model	micro_f1_public	micro_f1_private
ULMFit	0.59313	0.60322
fastText	0.5145	0.5109
LinearSVC	0.5022	0.4976
Kaggle Score	0.59139	0.58139
BERT	0.56612	0.57057

prachathai-67k: body_text

Model	Macro-accuracy	Macro-F1
fastText	0.9302	0.5529
LinearSVC	0.513277	0.552801
ULMFit	0.948737	0.744875

wisesight-sentiment

Model	Public Accuracy	Private Accuracy
Logistic Regression	0.72781	0.7499
FastText	0.63144	0.6131
ULMFit	0.71259	0.74194
ULMFit Semi-supervised	0.73119	0.75859
ULMFit Semi-supervised Repeated One Time	0.73372	0.75968

truevoice-intent: destination

model	accuracy	micro-F1
fastText	0.384116	0.384116
LinearSVC	0.807876	0.327565
ULMFit	0.834981	0.834981

Not found? Try to look at another Thai NLP Awesome List/Resource (Like this one)

https://resources.aiat.or.th/

Acknowledgements

Arthit - For suggestions on license words.
C4N
Veer66
Bi89
Tchayintr
PureEXE
Cstorm125
Wannaphongcom
Ekapolc

Name		Name	Last commit message	Last commit date
Latest commit History 122 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Thai Natural Language Processing (Thai NLP) Resource

Thai NLP Libraries/Services

Thai Character Cluster

Thai Sentiment Analysis

Thai Soundex

Word Segmentation

Part of Speech Tagging (POS Tagging)

Name Entity Recognition

News Structure Tagging

Syntactic Parsing & Tools

Thai Word Embedding

Thai Question Answering (Machine Comprehension)

Thai Emojification

Dictionaries / Translation Pairs

Downloadable Text Corpus

Web Query Text Corpus

Parallel Corpus

Pre-trained Word Vectors

Thai Text Classification Benchmarks

wongnai-corpus

prachathai-67k: body_text

wisesight-sentiment

truevoice-intent: destination

Not found? Try to look at another Thai NLP Awesome List/Resource (Like this one)

Acknowledgements

About

Releases

Packages

baseresearch/nlp_thai_resources

Folders and files

Latest commit

History

Repository files navigation

Thai Natural Language Processing (Thai NLP) Resource

Thai NLP Libraries/Services

Thai Character Cluster

Thai Sentiment Analysis

Thai Soundex

Word Segmentation

Part of Speech Tagging (POS Tagging)

Name Entity Recognition

News Structure Tagging

Syntactic Parsing & Tools

Thai Word Embedding

Thai Question Answering (Machine Comprehension)

Thai Emojification

Dictionaries / Translation Pairs

Downloadable Text Corpus

Web Query Text Corpus

Parallel Corpus

Pre-trained Word Vectors

Thai Text Classification Benchmarks

wongnai-corpus

prachathai-67k: body_text

wisesight-sentiment

truevoice-intent: destination

Not found? Try to look at another Thai NLP Awesome List/Resource (Like this one)

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages