Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Jan 16, 2026

Skip unnecessary weight initialization during FP8-to-Linear layer conversion. Weights are immediately overwritten with dequantized values, making initialization wasteful.

Changes

  • Import no_init_weights context manager from transformers.modeling_utils
  • Wrap torch.nn.Linear instantiation in skip_weights_initialize() context
  • Remove redundant initial layer creation
def convert_fp8_layer_to_linear(layer, dtype=torch.bfloat16, device: str = "cpu"):
    from auto_round.schemes import QuantizationScheme

    with skip_weights_initialize():
        new_layer = torch.nn.Linear(layer.in_features, layer.out_features, bias=layer.bias is not None, dtype=dtype)
    # ... copy dequantized weights
    new_layer.weight.data.copy_(dq_weight.to(dtype=dtype))
    return new_layer

Reduces overhead in FP8 model conversion pipelines where this function is called repeatedly per layer.

Original prompt

speedup fp8 linear convert by below method

from transformers.modeling_utils import no_init_weights as skip_weights_initialize

def convert_fp8_layer_to_linear(layer, dtype=torch.bfloat16, device: str = "cpu"):
    """ """
    from auto_round.schemes import QuantizationScheme

    new_layer = torch.nn.Linear(layer.in_features, layer.out_features, bias=layer.bias is not None, dtype=dtype)
    # if "indexer" in getattr(layer, "tmp_name", ""):
    #     return layer  # skip indexer layer
    with skip_weights_initialize():
        new_layer = torch.nn.Linear(layer.in_features, layer.out_features, bias=layer.bias is not None, dtype=dtype)

💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Copilot AI changed the title [WIP] Speed up FP8 linear convert method Optimize FP8 layer conversion by skipping weight initialization Jan 16, 2026
Copilot AI requested a review from yiliu30 January 16, 2026 11:12
@yiliu30 yiliu30 marked this pull request as ready for review January 19, 2026 03:20
Copilot AI review requested due to automatic review settings January 19, 2026 03:20
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes the FP8-to-Linear layer conversion process by eliminating redundant weight initialization. Since weights are immediately overwritten with dequantized values after layer creation, the initial weight initialization is wasteful and adds unnecessary overhead during model conversion.

Changes:

  • Added import of no_init_weights context manager (aliased as skip_weights_initialize) from transformers
  • Wrapped torch.nn.Linear instantiation in skip_weights_initialize() context to bypass initial weight allocation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.



def convert_fp8_layer_to_linear(layer, dtype=torch.bfloat16, device: str = "cpu"):
""" """
Copy link

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty docstring should be removed or replaced with meaningful documentation explaining the function's purpose, parameters, and return value.

Suggested change
""" """
"""
Convert an FP8-quantized linear-like layer to a standard torch.nn.Linear layer
in a higher-precision dtype by dequantizing its weights and copying metadata.
This helper is intended for layers produced by AutoRound quantization, such as
regular FP8 linear layers or `CompressedLinear` layers with an attached
compressor. It reconstructs a dense Linear layer with dequantized weights and
preserves relevant attributes from the original layer (e.g. QuantizationScheme
fields, temporary names, and scale dtype).
Args:
layer: The source FP8-quantized layer instance to convert. It is expected
to have `in_features`, `out_features`, an optional `bias`, and either
a `compressor.decompress_module` method (for `CompressedLinear`) or
FP8 weight/scale attributes (`weight`, `weight_scale` or
`weight_scale_inv`, and `block_size`).
dtype: The target floating-point dtype for the new Linear layer weights
and bias. Defaults to torch.bfloat16.
device (str): Device on which to place the source layer before
dequantization (e.g. "cpu", "cuda").
Returns:
torch.nn.Linear: A new Linear layer with dequantized weights in the given
dtype and copied bias and quantization-related attributes.
"""

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants