Device allocation issue when running constrained generation #1713

GitHubOfAndrew · 2025-08-07T01:49:16Z

GitHubOfAndrew
Aug 7, 2025

For context, I am facing the above issue when supplying text and/or image inputs into a multimodal llm (Llama 3.2 11B Vision Instruct). Furthermore, I am running this on a Vertex AI Workbench in GCP (configured with a single NVIDIA A100 GPU). Using outlines==1.2.1.

Beyond this, I have looked at discussion #1708 which was seemingly an identical issue, however, the resolution they came to did not work for me (got a CUDA-level error). It does not seem that they were using multimodal functionality, so I am opening my own issue (please let me know if this is not appropriate).

This is the code snippet that is giving me errors:

# can be copy and pasted and run as-is

import torch
from transformers import MllamaForConditionalGeneration, AutoProcessor
import outlines
from PIL import Image

model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"

model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map='auto'
)

processor = AutoProcessor.from_pretrained(model_id)

# prepare model with the processor
multimodal_model = outlines.from_transformers(model, processor)

class Description(BaseModel):
    description: str

image1 = Image.open('./images/<image_name>.jpg')

text_prompt = """<|begin_of_text|><|start_header_id|>user<|end_header_id|><|image|>

Can you describe this image?

<|eot_id|><|start_header_id|>assistant<|end_header_id|>"""

input_content = [text_prompt, outlines.inputs.Image(image1)]

result = multimodal_model(input_content, Description)

This is the exception I receive:

RuntimeError: expected self and mask to be on the same device, but got mask on cpu and self on cuda:0

This is not just an issue with images, text-only inputs to this model have the same issue. I think this may possibly be a bug with the TransformersMultiModal class during the logit masking process. I have verified that the following work:

Base Model off of HuggingFace transformers can run predictions
HuggingFace transformer model wrapped in an outlines model can run predictions, with both text and images as input

Only inference with constrained generation fails. I would appreciate any pointers.

Edit: I've tried the simplest example outlined in this documentation and I've noticed that this returns wonky outputs that are not JSON as well. Has the multimodal constrained generation capability of outlines been validated? Are there any notebooks or scripts that can reproduce consistently clean JSON?

RobinPicard · 2025-08-21T14:18:38Z

RobinPicard
Aug 21, 2025
Maintainer

Hi @GitHubOfAndrew! We've fixed a bug related to device location in Outlines 1.2.2. Could you try upgrading your version of the library and trying again please?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Device allocation issue when running constrained generation #1713

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Device allocation issue when running constrained generation #1713

Uh oh!

Uh oh!

GitHubOfAndrew Aug 7, 2025

Replies: 1 comment

Uh oh!

RobinPicard Aug 21, 2025 Maintainer

GitHubOfAndrew
Aug 7, 2025

RobinPicard
Aug 21, 2025
Maintainer