Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenization-wise encoding #287

Merged
merged 5 commits into from
Sep 13, 2024
Merged

Tokenization-wise encoding #287

merged 5 commits into from
Sep 13, 2024

Conversation

hynky1999
Copy link
Collaborator

@hynky1999 hynky1999 commented Sep 3, 2024

What does this implement/fix? Explain your changes.

This PR adds new way to do tokenization.

  • The method tokenized context and continuation separately which can fix issues when not separating context/continuation for some languages (chinese). See the test for the example
    Comments
  • The input parsing for models is really scary to me, very easy to make a mistake imo. I think we do have some shared interface (batch_size, add_bos_token, tokenizer....) for most of the inference models. Would be nice to refactor in the future and have something like TokensLightevalModel and TextLightevalModel (if we allow closed models in future, we won't be the ones doing the batching/tokenization). Won't be doing that in this PR for sure.
  • I only add this new param to BaseModel and nanotron as for other I haven't noticed we would be using the tokenization params. To me it's weird api, wouldn't deal with it before the above happens

@NathanHB
Copy link
Member

NathanHB commented Sep 6, 2024

Thanks for the PR, looks great :)

The input parsing for models is really scary to me, very easy to make a mistake imo. I think we do have some shared interface (batch_size, add_bos_token, tokenizer....) for most of the inference models. Would be nice to refactor in the future and have something like TokensLightevalModel and TextLightevalModel (if we allow closed models in future, we won't be the ones doing the batching/tokenization). Won't be doing that in this PR for sure.

What do you mean by in put parsing interface ?

@hynky1999
Copy link
Collaborator Author

Pretty much this fc:
https://github.com/huggingface/lighteval/blob/config_templates/src/lighteval/models/model_config.py#L288

I find it scary, because if one is adding a new args/new model configs there is no reference config, pretty much we access everything through untyped dict access.
And we actually do have such configs, it's all those XConfig classes. But we can't directly parse into them because they are flattened as I think first the CLI interface was introduced and then the .yaml configs were added.

Having hierarchical config as reference is really nice because:

  1. It's super easy to document (you literally just reference the dataclass, and don't need to update docs manually)
  2. you get nice error code, when you use incorrect config (because the whatever library you use it does that for you)
  3. Enforcing required keys and defaults is much more easier.
  4. We could actually merge nanotron config and accelerate worklow together, as nanotron would be yet another model config
  5. You can share some common stuff between the configs (Generation args etc...)

Now the reason why we use flattened configs is likely because CLI interface right ? I think it would make sense to either:

  1. If one uses cli they can use hierarchical args using . syntax. E.g. 'model=pretrained, model.name=llama'
  2. Make the cli interface minimal and parse it into the hierachical config

It's not a feature, but thing that will make maintaining much easier imo so doesn't have much priority.

@NathanHB
Copy link
Member

I see ! I agree with you, I recently added vllm models and it has different configs than base models, our current system is hard to document and clunky for users. I like the idea of having: model=pretrained, model.name=llama in the cli but it would make the cli command much longer to type.

What do you mean by: "Make the cli interface minimal and parse it into the hierachical config" ?

@hynky1999 hynky1999 merged commit 5034a96 into main Sep 13, 2024
2 checks passed
@hynky1999
Copy link
Collaborator Author

hynky1999 commented Sep 13, 2024

I see ! I agree with you, I recently added vllm models and it has different configs than base models, our current system is hard to document and clunky for users. I like the idea of having: model=pretrained, model.name=llama in the cli but it would make the cli command much longer to type.

What do you mean by: "Make the cli interface minimal and parse it into the hierachical config" ?

  1. We would select minimal usable interface for each model (e.g. for pretrained it could be just name) and only parse that from cli, this makes it a bit easier to maintain
  2. We change the flatten config to hierarchical, so that we can re-use stuff

I like the first method more tho

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants