Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support for concurrency in llm models #519

Closed
wants to merge 5 commits into from

Conversation

dtrawins
Copy link

@dtrawins dtrawins commented Jan 17, 2024

Allows running llm generate operations in multithreaded application

Support in parallel execution for stateful LLM models is implemented using a cloned model object but with shared OV model and compiled model to avoid duplicating memory consumption. Each cloned model object is using a single OpenVINO inference_request object which keeps the execution context and state.

model = OVModelForCausalLM.from_pretrained(model_path,compile=True) # done once
model_exec = model.clone() # done in every new thread
outputs = model_exec.generate(**generate_kwargs)

For stable diffusion and seq2seq pipelines which are without stateful models, no changes are needed for multi threaded execution.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@helena-intel
Copy link
Collaborator

Thanks @dtrawins, I will test this. You can fix the style check by running make style after pip install .[quality] in the repository root`.

@dtrawins dtrawins marked this pull request as ready for review January 19, 2024 11:10
Copy link
Collaborator

@echarlaix echarlaix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks a lot @dtrawins

logger.info(f"Compiling the encoder to {self._device} ...")
self.request = core.compile_model(self.model, self._device, self.ov_config)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might want to keep support for at least a couple of minor releases + add a warning to specify that this attribute will be deprecated

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the attribute request was not exposed in any methods so I would expect it to be internal class implementation specific. It is now called compiled_model which better describes the stored object. We could keep it as a duplicate but I wonder if that could bring some confusion.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, compiled_model is a better name, but people are used to .request now, I use it quite a lot and expect I'm not alone (for example to get a compiled model property). This can also be used in integrations that use optimum-intel. So I think moving to compiled_model is good, but ideally we should show a deprecation warning for .request, and if not at least keep it as an alias for now.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@helena-intel Would it be ok to create a new attribute infer_request to store infer_request object and keep request as an alias for the compiled_model? I wonder how we could add a deprecation notice about the request attribute while it is not used in any methods. A comment in the code? I guess it wasn't documented as an internal implementation detail.

@@ -197,8 +198,14 @@ def forward(
inputs["token_type_ids"] = token_type_ids

# Run inference
outputs = self.request(inputs)
logits = torch.from_numpy(outputs["logits"]).to(self.device) if not np_inputs else outputs["logits"]
infer_request = self.compiled_model.create_infer_request()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to create an inference request for each prediction ? (from my understanding it's needed for stateful models at each generation steps, but might not be the case here)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually new infer_requests are needed for stateless models. It ensure each prediction from each execution thread is independent. Otherwise only one generate operation would be possible on a model. In stateful LLM models there is different approach - each generate operation has a single infer request so the state is preserved between integrations.

from transformers import AutoConfig, AutoTokenizer, set_seed

from optimum.intel import OVModelForCausalLM

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we want to add additional tests can it be integrated in test_modeling.py ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was meant as an example to test it more than an automated test. I would suggest to remove it for now, and then later add an example (in the examples folder. The code in this example is good for testing the potential of multiconcurrency in combination with larger batches, but it is not clear if the performance benefit comes from the larger batch or the multiconcurrency. I made some code to make that clearer, I can clean that up and add an example later, if that's useful.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pytest tests will be added to validate the multithreading so those scripts are to be dropped. I guess they could be added as examples but probably it would be a part of a separate PR.

@dtrawins
Copy link
Author

dtrawins commented Mar 1, 2024

This functionality is continued in #573

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants