Inference time issue

I am using int8 weights only quantization to quantize my Mamba model using the script.

```python
from torchao.quantization import quantize_, Int8WeightOnlyConfig
quantize_(mamba, Int8WeightOnlyConfig())
exported_ep = torch.export.export(mamba, dummy_inputs)
et_program = to_edge_transform_and_lower( # (6)
                    exported_ep,
                    partitioner=[XnnpackPartitioner()],
                ).to_executorch()
with open(f"{args.export}_quant.pte", "wb") as file:
                    file.write(et_program.buffer)
```

Using this script, I saw that the model size is reduced. But when I perform inference, it takes ~10 minutes to generate a single token. But the generated tokens are meaningful.
Are there any suggestions to improve my inference speed?

The inference results are below,
For FP16:

<img width="1591" height="208" alt="Image" src="https://github.com/user-attachments/assets/71d8b05d-13d7-4990-aaf2-0e75f5505fd0" />

For Int8:

<img width="1562" height="184" alt="Image" src="https://github.com/user-attachments/assets/0e002602-42ba-4a86-b2c3-16ecbbaa14a5" />


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inference time issue #3608

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inference time issue #3608

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions