data generate --model parameter used for local file path and where to point to remote teacher model endpoint (need two separate variables) #425

relyt0925 · 2024-11-21T04:21:02Z

Describe the bug

With the addition of context aware chunking: the --model parameter in data generate is used for two competing places that makes it impossible to interact with a remote teacher endpoint anymore. The first location it is used is to point to a local tokenizer file to enable context aware tokenization shown in code path below:

https://github.com/instructlab/instructlab/blob/main/src/instructlab/data/generate_data.py#L93
https://github.com/instructlab/sdg/blob/main/src/instructlab/sdg/utils/chunkers.py#L226
https://github.com/instructlab/sdg/blob/main/src/instructlab/sdg/utils/chunkers.py#L151
https://github.com/instructlab/sdg/blob/main/src/instructlab/sdg/utils/taxonomy.py#L427

The next independent path is it is used to add to the base endpoint-url to properly talk to a remote teacher model endpoint shown in the pipeline context:
https://github.com/instructlab/instructlab/blob/main/src/instructlab/data/generate_data.py#L93
https://github.com/instructlab/sdg/blob/main/src/instructlab/sdg/generate_data.py#L305
https://github.com/instructlab/sdg/blob/main/src/instructlab/sdg/generate_data.py#L378

Therefore when running ilab data generate with the endpoint-url on a pdf document you will always see the following error

[root@ty-sdg-pdftry ~]# /root/bin/ilab.sh data generate --taxonomy-path /root/RBC-Instructlab-Taxonomy --taxonomy-base empty --endpoint-url https://781d2e7c-us-east.lb.appdomain.cloud/v1  --model-family mixtral --sdg-scale-factor 30 --pipeline /root/sdg-config/pipelines/agentic/ --model /instructlab/models/mixtral-8x7b-instruct-v0-1 --output-dir /root/outputdir/ --server-ctx-size 32768
INFO 2024-11-21 03:47:56,419 numexpr.utils:148: Note: NumExpr detected 24 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
INFO 2024-11-21 03:47:56,419 numexpr.utils:161: NumExpr defaulting to 16 threads.
INFO 2024-11-21 03:47:58,664 datasets:59: PyTorch version 2.4.1 available.
INFO 2024-11-21 03:48:01,028 instructlab.data.generate_data:87: Generating synthetic data using '/root/sdg-config/pipelines/agentic/' pipeline, '/instructlab/models/mixtral-8x7b-instruct-v0-1' model, '/root/RBC-Instructlab-Taxonomy' taxonomy, against https://781d2e7c-us-east.lb.appdomain.cloud/v1 server
INFO 2024-11-21 03:48:02,556 instructlab.sdg.utils.taxonomy:147: Processing files...
INFO 2024-11-21 03:48:02,556 instructlab.sdg.utils.taxonomy:153: Pattern 'RBC_ILAB.pdf' matched 1 files.
INFO 2024-11-21 03:48:02,556 instructlab.sdg.utils.taxonomy:157: Processing file: /root/outputdir/documents-2024-11-21T03_48_01/RBC_ILAB.pdf
INFO 2024-11-21 03:48:02,556 instructlab.sdg.utils.taxonomy:172: Loading PDF document from /root/outputdir/documents-2024-11-21T03_48_01/RBC_ILAB.pdf
INFO 2024-11-21 03:48:02,566 instructlab.sdg.utils.taxonomy:182: PDF '/root/outputdir/documents-2024-11-21T03_48_01/RBC_ILAB.pdf' has 283 pages.
INFO 2024-11-21 03:51:00,399 instructlab.sdg.utils.taxonomy:218: Unloaded PDF document: /root/outputdir/documents-2024-11-21T03_48_01/RBC_ILAB.pdf
INFO 2024-11-21 03:51:36,400 instructlab.sdg.generate_data:408: Synthesizing new instructions. If you aren't satisfied with the generated instructions, interrupt training (Ctrl-C) and try adjusting your YAML files. Adding more examples may help.
ERROR 2024-11-21 03:51:38,716 instructlab.sdg.utils.chunkers:397: Failed to load tokenizer as no valid model was not found at /instructlab/models/mixtral-8x7b-instruct-v0-1. Please provide a path to a valid model format. For help on downloading models, run `ilab model download --help`.

There should be a separate variable/cli parameter for passing in the local context aware tokenizer file than that which is used to point to the remote endpoint url holding the teacher model

To Reproduce
Steps to reproduce the behavior:

Run ilab data generate using the --endpoint-url on a taxonomy that has a PDF document in it (example shown above)
It will error saying failed to load tokenizer

Expected behavior

I should be able to run data generate against a remote teacher model endpoint on a taxonomy with a pdf file if I have the local tokenizer files. I should be able to point to the remote teacher model and the local files through separate vars

Screenshots

Device Info (please complete the following information):

Hardware Specs: [e.g. Apple M2 Pro Chip, 16 GB Memory, etc.] L40
OS Version: [e.g. Mac OS 14.4.1, Fedora Linux 40] RHEL AI 1.2
Python Version: [output of python --version] 3.11
InstructLab Version: [output of ilab system info]

Platform:
  sys.version: 3.11.7 (main, Aug 23 2024, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
  sys.platform: linux
  os.name: posix
  platform.release: 5.14.0-427.37.1.el9_4.x86_64
  platform.machine: x86_64
  platform.node: ty-sdg-pdftry
  platform.python_version: 3.11.7
  os-release.ID: rhel
  os-release.VERSION_ID: 9.4
  os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)
  memory.total: 117.91 GB
  memory.available: 116.30 GB
  memory.used: 0.58 GB

InstructLab:
  instructlab.version: 0.21.0
  instructlab-dolomite.version: 0.2.0
  instructlab-eval.version: 0.4.1
  instructlab-quantize.version: 0.1.0
  instructlab-schema.version: 0.4.1
  instructlab-sdg.version: 0.6.0
  instructlab-training.version: 0.6.1

Torch:
  torch.version: 2.4.1+cu121
  torch.backends.cpu.capability: AVX512
  torch.version.cuda: 12.1
  torch.version.hip: None
  torch.cuda.available: True
  torch.backends.cuda.is_built: True
  torch.backends.mps.is_built: False
  torch.backends.mps.is_available: False
  torch.cuda.bf16: True
  torch.cuda.current.device: 0
  torch.cuda.0.name: NVIDIA L40S
  torch.cuda.0.free: 44.1 GB
  torch.cuda.0.total: 44.5 GB
  torch.cuda.0.capability: 8.9 (see https://developer.nvidia.com/cuda-gpus#compute)

llama_cpp_python:
  llama_cpp_python.version: 0.2.79
  llama_cpp_python.supports_gpu_offload: True

Additional context

The text was updated successfully, but these errors were encountered:

relyt0925 · 2024-11-21T04:25:39Z

I believe this change introduced the regression:
#364

relyt0925 · 2024-11-21T20:46:19Z

Slack thread context just in case anything is missed

Hey yep! I think what you added makes sense but I think implicitly there was a breakage in still using a remote teacher model endpoint:
#425
I see why you do the teacher model local though in the fact you need the tokenizer files/content for the local piece. But I think that path needs a new cli var or else the --endpoint-url function in sdg regresses (hopefully that issue makes sense)

when using endpoint-url the --model flag for sdg now gets used for two independent pieces that conflict with one another

one is to point to the remote model path (a HTTP path) on the remote model serving server vllm) the other is to point to a local path on the machine (for rhel ai within the container) to pointing to the tokenizer files

I just think that the path to the local tokenization file if we are supporting endpoint-url needs to be exposed as a separate variable that is something like --teacher-tokenization-files or etc

(I think one reason why it might seem weird and wasn't hit is you were running teacher model and sdg data process all on same machine: but instructlab also supports pointing to remote teacher models and on the local machine running the sdg pipeline it just runs the "data aggregaition/context chunking pieces" (that's the path I see that has regressed)

relyt0925 · 2024-11-25T16:38:56Z

More slack convo

it won't enforce you only use 1: but the problem is --model now has dual meaning

and dual independent meanings that conflict with eachother: one is a HTTP path on a remote server

one is a local file path on a machine

That is a problem: basically in order to get it to work today you have to get lucky that your remote path on your server can be created in your local machine: I don't think that is something that is wanted: I think we need two variables in the cli: one that is used to specify the local tokenization files you want to use for the context aware chunking tokenization

I think the best way is maybe to show an example

[root@ty-sdg-pdftry ~]# nohup /root/bin/ilab.sh data generate --taxonomy-path /root/RBC-Instructlab-Taxonomy --taxonomy-base empty --endpoint-url https://781d2e7c-us-east.lb.appdomain.cloud/v1 --model-family mixtral --sdg-scale-factor 30 --pipeline /root/sdg-config/pipelines/agentic/ --model /instructlab/models/mixtral-8x7b-instruct-v0-1 --output-dir /root/outputdir/ --server-ctx-size 32768 --tls-insecure &

tangibly when you use endpoint url --model and --endpoint-url are basically utilized together to form the full path to chatting with the model

(in the openai client)

And then additionally now with the local filesystem piece being utilized as well it will now assume that on the local host at /instructlab/models/mixtral-8x7b-instruct-v0-1 there are model files for the context aware chunking: that is a problem since you cannot just create /instructlab/models on a host (protected directory)

so what I am saying is with the addition of utilizing the local path there should be another var like:
nohup /root/bin/ilab.sh data generate --taxonomy-path /root/RBC-Instructlab-Taxonomy --taxonomy-base empty --endpoint-url https://781d2e7c-us-east.lb.appdomain.cloud/v1 --model-family mixtral --sdg-scale-factor 30 --pipeline /root/sdg-config/pipelines/agentic/ --model /instructlab/models/mixtral-8x7b-instruct-v0-1 --context-aware-tokenization-files LOCAL_PATH_ON_HOST --output-dir /root/outputdir/ --server-ctx-size 32768 --tls-insecure &

to where now that var can be utilized (if it's not specified we can default to the model path to keep same behavior) and then we now have the ability to properly configure both endpoints

jaideepr97 · 2024-11-28T21:13:51Z

Thanks for raising this @relyt0925

I was not aware that SDG used the supplied model path to do anything when the endpoint URL has been specified. By following the code I see that the endpoint URL is just used to construct an openai client that can talk directly with what Im assuming is the model hosted at the server described by the supplied endpoint

While I understand the problem, I'm not sure adding a new flag specifically to reference just the tokenizer files is necessarily the way to go. I would be more interested in understanding why the model path needs to take on a different meaning when endpoint URL has entered the picture - and if there is something we can do to prevent that from happening

cc @bbrowning @aakankshaduggal @khaledsulayman

bbrowning · 2024-12-02T19:18:51Z

It sounds like we need to add some tests around remote endpoint support in SDG, and then figure out exactly what the different params do and how they interact with that. I think the bulk of this work will be in the instructlab/sdg repo to reconcile these parameters and determine how to proceed - whether we need to expose a new knob or if some combination of existing params can be made to do the right thing. Is there an easy way to move this issue over there, or should we copy this into a new one that focuses just on the changes needed in the SDG repo to get this working again?

bbrowning · 2024-12-10T22:30:33Z

What's the urgency of this issue, to help with prioritization? Were you able to get things going with some workarounds for now, or is this blocking the ability to use InstructLab for your use-case?

relyt0925 · 2024-12-11T01:18:39Z

This one is non blocking!

relyt0925 · 2024-12-11T01:20:55Z

But I do believe ultimately this should be a unique controllable variable (we were able to proceed by doing some mount magic between the containerized filesystem and the host.) basically ensuring the proper host path is mounted at the same file path as the remote file mount option.

bbrowning · 2024-12-11T01:32:00Z

I agree we need to separate out these variables, and will keep this on the radar to tackle. I'm glad you have a workaround for now, even if it sounds like it was quite a bit of work to implement.

relyt0925 · 2024-12-11T02:08:39Z

Per usual: super appreciative of everyone's time and attention/help reviewing issues! all of instructlab has an amazing community around it :)

relyt0925 added the bug Something isn't working label Nov 21, 2024

ktam3 transferred this issue from instructlab/instructlab Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data generate --model parameter used for local file path and where to point to remote teacher model endpoint (need two separate variables) #425

data generate --model parameter used for local file path and where to point to remote teacher model endpoint (need two separate variables) #425

relyt0925 commented Nov 21, 2024

relyt0925 commented Nov 21, 2024

relyt0925 commented Nov 21, 2024

relyt0925 commented Nov 25, 2024

jaideepr97 commented Nov 28, 2024

bbrowning commented Dec 2, 2024

bbrowning commented Dec 10, 2024

relyt0925 commented Dec 11, 2024

relyt0925 commented Dec 11, 2024

bbrowning commented Dec 11, 2024

relyt0925 commented Dec 11, 2024

data generate --model parameter used for local file path and where to point to remote teacher model endpoint (need two separate variables) #425

data generate --model parameter used for local file path and where to point to remote teacher model endpoint (need two separate variables) #425

Comments

relyt0925 commented Nov 21, 2024

relyt0925 commented Nov 21, 2024

relyt0925 commented Nov 21, 2024

relyt0925 commented Nov 25, 2024

jaideepr97 commented Nov 28, 2024

bbrowning commented Dec 2, 2024

bbrowning commented Dec 10, 2024

relyt0925 commented Dec 11, 2024

relyt0925 commented Dec 11, 2024

bbrowning commented Dec 11, 2024

relyt0925 commented Dec 11, 2024