| title | emoji | colorFrom | colorTo | sdk | sdk_version | app_file | pinned | license | short_description |
|---|---|---|---|---|---|---|---|---|---|
Turkic Transliteration Demo |
🌖 |
green |
green |
gradio |
5.29.0 |
app.py |
false |
apache-2.0 |
Transliteration of Kazakh & Kyrgyz into Latin and IPA |
turkic_transliterate Deterministic Latin and IPA transliteration for Kazakh and Kyrgyz, plus helper utilities for tokenizer training and Russian-token filtering.
Quick install
- Install Miniconda or Anaconda (recommended).
- Clone the repo and create the environment: conda env create -f env.yml
- Activate the environment: conda activate turkic
- Run the verification tests: python -m pytest (all tests should pass)
Python compatibility • Works on CPython 3.10 and 3.11. • CPython 3.12+ is supported everywhere except on Windows until official PyICU wheels are available; see “Windows & PyICU” below.
Package names • Runtime import path: turkic_translit • Distributable name on PyPI: turkic_transliterate • Command-line entry point: turkic-translit
For the simplest developer setup experience, run the setup script:
python scripts/setup_dev.pyThis script will:
- Install the package with all development dependencies
- Set up PyICU on Windows automatically
- Verify that development tools are working properly
Alternatively, install with pip:
pip install -e .[dev,ui] # add ,winlid on Windows if you need fasttext-wheelIf you have GNU Make installed, you can use the Makefile for common tasks:
make lint # Run linting (ruff, black, mypy)
make format # Auto-format code
make test # Run tests
make web # Launch the web UI
make help # Show all available commandsOption 1: Install GNU Make using Chocolatey (Recommended)
Install GNU Make using Chocolatey (requires admin privileges):
# In an Admin PowerShell window
choco install makeAfter installation, you can use the same make commands as on Linux/macOS.
Option 2: Use the PowerShell Script Alternative
If you prefer not to install Chocolatey or GNU Make, use the PowerShell script:
./scripts/run.ps1 lint # Run linting
./scripts/run.ps1 format # Auto-format code
./scripts/run.ps1 test # Run tests
./scripts/run.ps1 web # Launch the web UI
./scripts/run.ps1 help # Show all available commandsOptional extras dev → black, ruff, pytest ui → gradio web demo winlid (Windows only) → fasttext-wheel for language ID
Windows & PyICU
Important: Due to PyPI rules, the correct PyICU wheel for Windows cannot be installed automatically during pip install. After installing this package with pip, Windows users must run the helper script to install the appropriate PyICU wheel:
turkic-pyicu-install
This script will download and install the correct PyICU wheel from Christoph Gohlke’s repository based on your Python version. See the script for details.
Command-line usage turkic-translit --lang kk --in text.txt --out_latin kk_lat.txt --ipa --out_ipa kk_ipa.txt --arabic --log-level debug • --lang kk or ky • --ipa emit IPA alongside Latin • --arabic also transliterate embedded Arabic script • --benchmark print throughput statistics • --log-level debug | info | warning | error | critical (default: info)
Logging
Central logging supports structured JSON with correlation IDs and stack traces. Control verbosity with TURKIC_LOG_LEVEL (DEBUG, INFO, WARNING, ERROR). Format via TURKIC_LOG_FORMAT=json|rich (default json). Entry points configure logging; libraries can call turkic_translit.logging_config.setup() to adopt the same config.
Error service
Optional Sentry integration via TURKIC_SENTRY_DSN (and TURKIC_ENV, TURKIC_SENTRY_TRACES). Install with pip install turkic-translit[sentry]. Correlation IDs are generated per request/command; you can also set a fixed one using TURKIC_CORRELATION_ID.
The project is organized into the following directories:
src/turkic_translit/- Core source code for the packageexamples/- Example scripts showing how to use the packageexamples/web/- Web interface for demonstrating transliteration features
data/- Sample data files and language resourcesdocs/- Documentation and reference materialsscripts/- Utility scripts for development and releasescripts/release/- Scripts for building and publishing packages
vendor/pyicu/- Pre-built PyICU wheels for Windowstests/- Test suite for the package
This package uses the FastText language identification model (lid.176.bin) for Russian token filtering and language detection. The model file is not included in the repository or pip package due to its large size.
Automatic Download:
- When you use features that require language identification (such as Russian token filtering or the Gradio web demo), the package will automatically download
lid.176.binfrom the official Facebook AI public link if it is not already present. - The file will be saved in the package directory on first use.
No manual action is needed. This ensures compatibility with pip installs, Hugging Face Spaces, and other cloud environments.
If you need to download the model manually, you can do so from: https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin
Use the main entry point script to run examples:
python turkic_tools.py [command]Available commands:
web- Launch the Gradio web interface for real-time transliterationdemo- Run the simple CLI demofull-demo- Run the comprehensive demo with multiple languageshelp- Display available commands
Tokenizer training example turkic-build-spm --input corpora/kk_lat.txt,corpora/ky_lat.txt --model_prefix spm/turkic12k --vocab_size 12000
Filtering Russian tokens from Uzbek cat uz_raw.txt | turkic-filter-russian --mode drop > uz_clean.txt
Developer checklist black . ruff check . pytest -q
All code is UTF-8-only; on Windows a BOM is written when piping to files to avoid encoding issues.
License Apache-2.0
pip install mypy
mypy --strict .The included mypy.ini restricts analysis to the src/ tree and skips build/, dist/, virtual-env and egg directories so duplicate-module errors do not occur even if you build wheels locally.