Tokenization-wise encoding #287

hynky1999 · 2024-09-03T23:34:23Z

What does this implement/fix? Explain your changes.

This PR adds new way to do tokenization.

The method tokenized context and continuation separately which can fix issues when not separating context/continuation for some languages (chinese). See the test for the example
Comments
The input parsing for models is really scary to me, very easy to make a mistake imo. I think we do have some shared interface (batch_size, add_bos_token, tokenizer....) for most of the inference models. Would be nice to refactor in the future and have something like TokensLightevalModel and TextLightevalModel (if we allow closed models in future, we won't be the ones doing the batching/tokenization). Won't be doing that in this PR for sure.
I only add this new param to BaseModel and nanotron as for other I haven't noticed we would be using the tokenization params. To me it's weird api, wouldn't deal with it before the above happens

…oding

NathanHB · 2024-09-06T12:13:16Z

Thanks for the PR, looks great :)

The input parsing for models is really scary to me, very easy to make a mistake imo. I think we do have some shared interface (batch_size, add_bos_token, tokenizer....) for most of the inference models. Would be nice to refactor in the future and have something like TokensLightevalModel and TextLightevalModel (if we allow closed models in future, we won't be the ones doing the batching/tokenization). Won't be doing that in this PR for sure.

What do you mean by in put parsing interface ?

hynky1999 · 2024-09-06T14:14:46Z

Pretty much this fc:
https://github.com/huggingface/lighteval/blob/config_templates/src/lighteval/models/model_config.py#L288

I find it scary, because if one is adding a new args/new model configs there is no reference config, pretty much we access everything through untyped dict access.
And we actually do have such configs, it's all those XConfig classes. But we can't directly parse into them because they are flattened as I think first the CLI interface was introduced and then the .yaml configs were added.

Having hierarchical config as reference is really nice because:

It's super easy to document (you literally just reference the dataclass, and don't need to update docs manually)
you get nice error code, when you use incorrect config (because the whatever library you use it does that for you)
Enforcing required keys and defaults is much more easier.
We could actually merge nanotron config and accelerate worklow together, as nanotron would be yet another model config
You can share some common stuff between the configs (Generation args etc...)

Now the reason why we use flattened configs is likely because CLI interface right ? I think it would make sense to either:

If one uses cli they can use hierarchical args using . syntax. E.g. 'model=pretrained, model.name=llama'
Make the cli interface minimal and parse it into the hierachical config

It's not a feature, but thing that will make maintaining much easier imo so doesn't have much priority.

NathanHB · 2024-09-10T10:56:30Z

I see ! I agree with you, I recently added vllm models and it has different configs than base models, our current system is hard to document and clunky for users. I like the idea of having: model=pretrained, model.name=llama in the cli but it would make the cli command much longer to type.

What do you mean by: "Make the cli interface minimal and parse it into the hierachical config" ?

hynky1999 · 2024-09-13T10:59:09Z

I see ! I agree with you, I recently added vllm models and it has different configs than base models, our current system is hard to document and clunky for users. I like the idea of having: model=pretrained, model.name=llama in the cli but it would make the cli command much longer to type.

What do you mean by: "Make the cli interface minimal and parse it into the hierachical config" ?

We would select minimal usable interface for each model (e.g. for pretrained it could be just name) and only parse that from cli, this makes it a bit easier to maintain
We change the flatten config to hierarchical, so that we can re-use stuff

I like the first method more tho

hynky1999 added 2 commits September 4, 2024 01:55

Squashed commit message

3ae0f61

Merge remote-tracking branch 'origin/main' into tokenization_pair_enc…

7290751

…oding

hynky1999 force-pushed the tokenization_pair_encoding branch from efb95fe to 7290751 Compare September 4, 2024 00:02

hynky1999 added 2 commits September 5, 2024 01:25

Merge remote-tracking branch 'origin/main' into tokenization_pair_enc…

9f49428

…oding

add nanotron support + migrate to xlgm

de2b617

hynky1999 requested a review from NathanHB September 5, 2024 09:46

Merge branch 'main' into tokenization_pair_encoding

24329c3

NathanHB approved these changes Sep 10, 2024

View reviewed changes

hynky1999 merged commit 5034a96 into main Sep 13, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenization-wise encoding #287

Tokenization-wise encoding #287

hynky1999 commented Sep 3, 2024 •

edited

Loading

NathanHB commented Sep 6, 2024

hynky1999 commented Sep 6, 2024

NathanHB commented Sep 10, 2024

hynky1999 commented Sep 13, 2024 •

edited

Loading

Tokenization-wise encoding #287

Tokenization-wise encoding #287

Conversation

hynky1999 commented Sep 3, 2024 • edited Loading

What does this implement/fix? Explain your changes.

NathanHB commented Sep 6, 2024

hynky1999 commented Sep 6, 2024

NathanHB commented Sep 10, 2024

hynky1999 commented Sep 13, 2024 • edited Loading

hynky1999 commented Sep 3, 2024 •

edited

Loading

hynky1999 commented Sep 13, 2024 •

edited

Loading