Skip to content

Failure to tokenize after " 私" for Meta-Llama-3-8B #32

@davidquarel

Description

@davidquarel

For any tokens following the result fails to be tokenized using the tokenizer meta-llama/Meta-LLama-3-8B.

Image

Image

This doesn't match the behaviour of the huggingface tokenizer

Image

>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
>>> tokenizer(" 私 hello world", add_special_tokens=False).input_ids`
[76771, 223, 24748, 1917]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions