Skip to content

Conversation

@scopophobic
Copy link

FP8 layers were not detected by get_fp_layer_names, causing ignore_layers
to be ignored. This PR:

  • Auto-detects FP8 layers
  • Includes them in not_to_quantized_layers
  • Ensures ignore_layers works for FP8 models

Signed-off-by: Adithyan Madhu [email protected]

logger.trace(f"Auto-detected FP8 layer to ignore : {n}")

if ignore_layers:
ignore_list = ignore_layers.replace(" ", "").split(",")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @scopophobic, thanks for your interest in fix that issue! I think there might be a bit of misunderstanding.
We don’t want to skip all FP8 layers. The idea is that we start with an FP8 model and want to requantize it to another format, like W4A16. However, we don’t want certain layers—such as those inside the attention module—to be quantized to W4A16.
The fix here is aligned with what we’re aiming for. #1286

Copy link
Contributor

@yiliu30 yiliu30 Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @scopophobic Would you be interested in working on the left part of this issue? #1283 (comment)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @yiliu30, thanks a lot for the clarification, that helped resolve a misunderstanding I had 👍

I now understand that the goal is not to skip all FP8 layers, but to start from an FP8 model and re-quantize it (e.g., to W4A16), while keeping specific submodules (like attention) from being quantized.

I’m definitely interested in working on the remaining part of #1283. My current thought is to make FP8 detection more robust by moving away from class-name checks (like "FP8Linear") and instead relying on explicit FP8 characteristics (e.g., presence of FP8 scale metadata used during dequantization). This would allow supporting multiple FP8 layer implementations without brittle heuristics.

Does this approach sound aligned with what you had in mind for this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants