A LLM project for Android keyboard.
train
: training & data processingparse
: parsing raw filesdata
: dataset for training and testtokenizer
: tokenizer model for LLM (e.g. Llama2 or custom)script
: .sh filesout
: output model and log filesinference
: C code for inferenceevaluate
: Python for eval
Manually download your own TXT file and put it under ./data/
.
sh script/dataset_split.sh
Caution
请确保在任何时候(特别是小语种的定制化开发),训练和推理过程的tokenizer编码过程是一致的。
训练过程由Python sentencepiece实现;推理过程由C语言实现。
sh script/train_vocab.sh
sh script/pretokenize.sh
Important
Mention on your device (CPU/GPU/MPS). Ensure that GPUs support torch.compile
and PyTorch version > 2.0.
sh script/train.sh
Important
根据产品需求设定TopK数
cd evaluate
python eval.py
cd inference
make run
cd ..
sh script/inference.sh