-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
data generate --model parameter used for local file path and where to point to remote teacher model endpoint (need two separate variables) #425
Comments
I believe this change introduced the regression: |
Slack thread context just in case anything is missed Hey yep! I think what you added makes sense but I think implicitly there was a breakage in still using a remote teacher model endpoint: when using endpoint-url the --model flag for sdg now gets used for two independent pieces that conflict with one another one is to point to the remote model path (a HTTP path) on the remote model serving server vllm) the other is to point to a local path on the machine (for rhel ai within the container) to pointing to the tokenizer files I just think that the path to the local tokenization file if we are supporting endpoint-url needs to be exposed as a separate variable that is something like --teacher-tokenization-files or etc (I think one reason why it might seem weird and wasn't hit is you were running teacher model and sdg data process all on same machine: but instructlab also supports pointing to remote teacher models and on the local machine running the sdg pipeline it just runs the "data aggregaition/context chunking pieces" (that's the path I see that has regressed) |
More slack convo it won't enforce you only use 1: but the problem is --model now has dual meaning and dual independent meanings that conflict with eachother: one is a HTTP path on a remote server one is a local file path on a machine That is a problem: basically in order to get it to work today you have to get lucky that your remote path on your server can be created in your local machine: I don't think that is something that is wanted: I think we need two variables in the cli: one that is used to specify the local tokenization files you want to use for the context aware chunking tokenization I think the best way is maybe to show an example [root@ty-sdg-pdftry ~]# nohup /root/bin/ilab.sh data generate --taxonomy-path /root/RBC-Instructlab-Taxonomy --taxonomy-base empty --endpoint-url https://781d2e7c-us-east.lb.appdomain.cloud/v1 --model-family mixtral --sdg-scale-factor 30 --pipeline /root/sdg-config/pipelines/agentic/ --model /instructlab/models/mixtral-8x7b-instruct-v0-1 --output-dir /root/outputdir/ --server-ctx-size 32768 --tls-insecure & tangibly when you use endpoint url --model and --endpoint-url are basically utilized together to form the full path to chatting with the model (in the openai client) And then additionally now with the local filesystem piece being utilized as well it will now assume that on the local host at /instructlab/models/mixtral-8x7b-instruct-v0-1 there are model files for the context aware chunking: that is a problem since you cannot just create /instructlab/models on a host (protected directory) so what I am saying is with the addition of utilizing the local path there should be another var like: to where now that var can be utilized (if it's not specified we can default to the model path to keep same behavior) and then we now have the ability to properly configure both endpoints |
Thanks for raising this @relyt0925 I was not aware that SDG used the supplied model path to do anything when the endpoint URL has been specified. By following the code I see that the endpoint URL is just used to construct an openai client that can talk directly with what Im assuming is the model hosted at the server described by the supplied endpoint While I understand the problem, I'm not sure adding a new flag specifically to reference just the tokenizer files is necessarily the way to go. I would be more interested in understanding why the model path needs to take on a different meaning when endpoint URL has entered the picture - and if there is something we can do to prevent that from happening |
It sounds like we need to add some tests around remote endpoint support in SDG, and then figure out exactly what the different params do and how they interact with that. I think the bulk of this work will be in the instructlab/sdg repo to reconcile these parameters and determine how to proceed - whether we need to expose a new knob or if some combination of existing params can be made to do the right thing. Is there an easy way to move this issue over there, or should we copy this into a new one that focuses just on the changes needed in the SDG repo to get this working again? |
What's the urgency of this issue, to help with prioritization? Were you able to get things going with some workarounds for now, or is this blocking the ability to use InstructLab for your use-case? |
This one is non blocking! |
But I do believe ultimately this should be a unique controllable variable (we were able to proceed by doing some mount magic between the containerized filesystem and the host.) basically ensuring the proper host path is mounted at the same file path as the remote file mount option. |
I agree we need to separate out these variables, and will keep this on the radar to tackle. I'm glad you have a workaround for now, even if it sounds like it was quite a bit of work to implement. |
Per usual: super appreciative of everyone's time and attention/help reviewing issues! all of instructlab has an amazing community around it :) |
Describe the bug
With the addition of context aware chunking: the --model parameter in data generate is used for two competing places that makes it impossible to interact with a remote teacher endpoint anymore. The first location it is used is to point to a local tokenizer file to enable context aware tokenization shown in code path below:
https://github.com/instructlab/instructlab/blob/main/src/instructlab/data/generate_data.py#L93
https://github.com/instructlab/sdg/blob/main/src/instructlab/sdg/utils/chunkers.py#L226
https://github.com/instructlab/sdg/blob/main/src/instructlab/sdg/utils/chunkers.py#L151
https://github.com/instructlab/sdg/blob/main/src/instructlab/sdg/utils/taxonomy.py#L427
The next independent path is it is used to add to the base endpoint-url to properly talk to a remote teacher model endpoint shown in the pipeline context:
https://github.com/instructlab/instructlab/blob/main/src/instructlab/data/generate_data.py#L93
https://github.com/instructlab/sdg/blob/main/src/instructlab/sdg/generate_data.py#L305
https://github.com/instructlab/sdg/blob/main/src/instructlab/sdg/generate_data.py#L378
Therefore when running ilab data generate with the endpoint-url on a pdf document you will always see the following error
There should be a separate variable/cli parameter for passing in the local context aware tokenizer file than that which is used to point to the remote endpoint url holding the teacher model
To Reproduce
Steps to reproduce the behavior:
Expected behavior
I should be able to run data generate against a remote teacher model endpoint on a taxonomy with a pdf file if I have the local tokenizer files. I should be able to point to the remote teacher model and the local files through separate vars
Screenshots
Device Info (please complete the following information):
python --version
] 3.11ilab system info
]Additional context
The text was updated successfully, but these errors were encountered: