CMKT is a wrapper library that makes code-mixed text processing more efficient than ever.
git clone https://github.com/lingo-iitgn/CMKT.git
cd CMKT
pip install -r "requirements.txt"
Documentation. This page will be updated with more details soon.
How to use this library:
Refer the demo files for toolkit usage or the detailed Google Colab Notebook
There are four different modules available:-
- Data Acquisition Module
- Preprocessing Module
- Tasks Module
- Metrics Module
This module enables effortless loading and downloading of datasets in various formats from external and local resources. Additionally, it offers a curated collection of 15 datasets tailored for different NLP tasks specific to Hindi-English code-mixed text. Supported file formats of datasets:- pickle json txt csv conll
Datasets available in cmkt datahub for following tasks. Use the specifed names of tasks to search for datasets in cmkt datahub
["lid", "ner", "pos", "machine translation", "sentiment analysis", "hate speech detection", "irony detection", "humor detection", "sarcasm detection"]
Text Preprocessing Module offers a range of functionalities for efficiently preprocessing code-mixed text. This module provides different types of tokenization and stemming specifically designed for code-mixed text. By utilizing the cmkt Text Preprocessing Module, you can efficiently preprocess your code-mixed text data for various downstream tasks such as NLP analysis and model training.
Tokenization in cmkt: Breaking Text into Meaningful Units.
The text preprocessing module includes tokenization techniques at the sentence, word, and subword levels, along with stemming methods for English, Hindi, and Hindi-English mixed script text. Following tokenizers are available in cmkt:-
- Word Tokenizer
- Sentence Tokenizer
- SentencePiece Tokenizer
The tokenizers are currently available for english, hindi and english-hindi mixed script text.
Stemming in CMKT: Reducing Words to their Base Form
Stemming is an essential part of code-mixed text processing, enabling the reduction of words to their base or root form. In the CMKT , we provide a range of stemmers specifically designed for different languages and language combinations.
Following tokenizers are available in cmtt:-
- English Stemmer
- Hindi Stemmer
- Hindi-English mixed Stemmer
This module provides elementary NLP tasks such as NER, POS, LID etc for code-mixed text. This module also provides functions to search for tasks and models available in cmkt. The Hierarchy of task module is defined below.
Task types available in cmkt: "syntactic", "semantic", and "generational"
TaskToolkit (Language specific)
syntactic tasks
- lid
- ner
- pos
semantic tasks
- sentiment analysis
- hate speech detection
- humor detection
generational tasks
- machine translation
The metrics module provides a comprehensive range of evaluation metrics, serving diverse needs such as quantifying code-mixed text and assessing the performance of NLP tasks such as classification and machine translation.<br?
Available code-mixed metrics:
- cmi (code-mixed index)
- m-index (Multilingual Index)
- i-index (I-index)
- burstiness
Other common metrics available: accuracy, precision, recall, f-meaure, BLUE score, ROUGE score, BERT score, pearson score, spearman score