support for concurrency in llm models #519

dtrawins · 2024-01-17T18:14:26Z

Allows running llm generate operations in multithreaded application

Support in parallel execution for stateful LLM models is implemented using a cloned model object but with shared OV model and compiled model to avoid duplicating memory consumption. Each cloned model object is using a single OpenVINO inference_request object which keeps the execution context and state.

model = OVModelForCausalLM.from_pretrained(model_path,compile=True) # done once
model_exec = model.clone() # done in every new thread
outputs = model_exec.generate(**generate_kwargs)

For stable diffusion and seq2seq pipelines which are without stateful models, no changes are needed for multi threaded execution.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

HuggingFaceDocBuilderDev · 2024-01-17T18:50:36Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

helena-intel · 2024-01-18T10:27:19Z

Thanks @dtrawins, I will test this. You can fix the style check by running make style after pip install .[quality] in the repository root`.

echarlaix

thanks a lot @dtrawins

echarlaix · 2024-01-23T10:45:56Z

optimum/intel/openvino/modeling_seq2seq.py

 logger.info(f"Compiling the encoder to {self._device} ...")
- self.request = core.compile_model(self.model, self._device, self.ov_config)


we might want to keep support for at least a couple of minor releases + add a warning to specify that this attribute will be deprecated

the attribute request was not exposed in any methods so I would expect it to be internal class implementation specific. It is now called compiled_model which better describes the stored object. We could keep it as a duplicate but I wonder if that could bring some confusion.

Yes, compiled_model is a better name, but people are used to .request now, I use it quite a lot and expect I'm not alone (for example to get a compiled model property). This can also be used in integrations that use optimum-intel. So I think moving to compiled_model is good, but ideally we should show a deprecation warning for .request, and if not at least keep it as an alias for now.

@helena-intel Would it be ok to create a new attribute infer_request to store infer_request object and keep request as an alias for the compiled_model? I wonder how we could add a deprecation notice about the request attribute while it is not used in any methods. A comment in the code? I guess it wasn't documented as an internal implementation detail.

echarlaix · 2024-01-23T10:54:27Z

optimum/intel/openvino/modeling.py

@@ -197,8 +198,14 @@ def forward(
 inputs["token_type_ids"] = token_type_ids

 # Run inference
- outputs = self.request(inputs)
- logits = torch.from_numpy(outputs["logits"]).to(self.device) if not np_inputs else outputs["logits"]
+ infer_request = self.compiled_model.create_infer_request()


do we need to create an inference request for each prediction ? (from my understanding it's needed for stateful models at each generation steps, but might not be the case here)

actually new infer_requests are needed for stateless models. It ensure each prediction from each execution thread is independent. Otherwise only one generate operation would be possible on a model. In stateful LLM models there is different approach - each generate operation has a single infer request so the state is preserved between integrations.

echarlaix · 2024-01-23T10:59:56Z

tests/openvino/gen_batch.py

+from transformers import AutoConfig, AutoTokenizer, set_seed
+
+from optimum.intel import OVModelForCausalLM
+


if we want to add additional tests can it be integrated in test_modeling.py ?

I think this was meant as an example to test it more than an automated test. I would suggest to remove it for now, and then later add an example (in the examples folder. The code in this example is good for testing the potential of multiconcurrency in combination with larger batches, but it is not clear if the performance benefit comes from the larger batch or the multiconcurrency. I made some code to make that clearer, I can clean that up and add an example later, if that's useful.

pytest tests will be added to validate the multithreading so those scripts are to be dropped. I guess they could be added as examples but probably it would be a part of a separate PR.

dtrawins · 2024-03-01T18:11:27Z

This functionality is continued in #573

support for concurrency in llm models

9b55100

style fixes

9e4ab17

dtrawins added 2 commits January 19, 2024 12:01

concurrency in seq2seq and stable diffusion classes

dcb2a8f

merge from main

03b797f

dtrawins marked this pull request as ready for review January 19, 2024 11:10

mzegla approved these changes Jan 19, 2024

View reviewed changes

echarlaix reviewed Jan 23, 2024

View reviewed changes

merge from upstream

eb5db08

dtrawins closed this Feb 16, 2024

dtrawins mentioned this pull request Feb 16, 2024

Concurrency support using model clone #564

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support for concurrency in llm models #519

support for concurrency in llm models #519

dtrawins commented Jan 17, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Jan 17, 2024

helena-intel commented Jan 18, 2024

echarlaix left a comment

echarlaix Jan 23, 2024

dtrawins Jan 23, 2024

helena-intel Jan 23, 2024

dtrawins Feb 6, 2024

echarlaix Jan 23, 2024

dtrawins Jan 23, 2024

echarlaix Jan 23, 2024

helena-intel Jan 23, 2024

dtrawins Feb 6, 2024

dtrawins commented Mar 1, 2024

		logger.info(f"Compiling the encoder to {self._device} ...")
		self.request = core.compile_model(self.model, self._device, self.ov_config)

		from transformers import AutoConfig, AutoTokenizer, set_seed

		from optimum.intel import OVModelForCausalLM

support for concurrency in llm models #519

support for concurrency in llm models #519

Conversation

dtrawins commented Jan 17, 2024 • edited Loading

Allows running llm generate operations in multithreaded application

Before submitting

HuggingFaceDocBuilderDev commented Jan 17, 2024

helena-intel commented Jan 18, 2024

echarlaix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dtrawins commented Mar 1, 2024

dtrawins commented Jan 17, 2024 •

edited

Loading