For any tokens following 私 the result fails to be tokenized using the tokenizer meta-llama/Meta-LLama-3-8B.


This doesn't match the behaviour of the huggingface tokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
>>> tokenizer(" 私 hello world", add_special_tokens=False).input_ids`
[76771, 223, 24748, 1917]