Failure to tokenize after " 私" for Meta-Llama-3-8B

For any tokens following `  私`  the result fails to be tokenized using the tokenizer meta-llama/Meta-LLama-3-8B.

![Image](https://github.com/user-attachments/assets/73f16cf9-6942-47d4-a65c-9ec3f2df36cc)

![Image](https://github.com/user-attachments/assets/05cc2ef4-2333-428b-baa2-21de5e24f0ac)

This doesn't match the behaviour of the huggingface tokenizer

![Image](https://github.com/user-attachments/assets/9e5df98b-ef16-4d13-9e41-64330518fbdd)

```
>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
>>> tokenizer(" 私 hello world", add_special_tokens=False).input_ids`
[76771, 223, 24748, 1917]
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Failure to tokenize after " 私" for Meta-Llama-3-8B #32

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Failure to tokenize after " 私" for Meta-Llama-3-8B #32

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions