Tboard

A LLM project for Android keyboard.

Dirs

train: training & data processing
parse: parsing raw files
data: dataset for training and test
tokenizer: tokenizer model for LLM (e.g. Llama2 or custom)
script: .sh files
out: output model and log files
inference: C code for inference
evaluate: Python for eval

Usages

Download

Manually download your own TXT file and put it under ./data/.

Split data into Train and Test

sh script/dataset_split.sh

Train custom vocabulary source

Caution

请确保在任何时候（特别是小语种的定制化开发），训练和推理过程的tokenizer编码过程是一致的。

训练过程由Python sentencepiece实现；推理过程由C语言实现。

sh script/train_vocab.sh

Pretokenize queries in dataset

sh script/pretokenize.sh

Training LLM

Important

Mention on your device (CPU/GPU/MPS). Ensure that GPUs support torch.compile and PyTorch version > 2.0.

sh script/train.sh

Evaluation

Important

根据产品需求设定TopK数

cd evaluate
python eval.py

Inference

cd inference
make run
cd ..
sh script/inference.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tboard

Dirs

Usages

Download

Split data into Train and Test

Train custom vocabulary source

Pretokenize queries in dataset

Training LLM

Evaluation

Inference

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
data		data
evaluate		evaluate
inference		inference
out		out
script		script
tokenizer		tokenizer
train		train
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
README.md		README.md

Owen1u/Tboard

Folders and files

Latest commit

History

Repository files navigation

Tboard

Dirs

Usages

Download

Split data into Train and Test

Train custom vocabulary source

Pretokenize queries in dataset

Training LLM

Evaluation

Inference

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages