Optimize FP8 layer conversion by skipping weight initialization #1295

Copilot · 2026-01-16T11:05:19Z

Skip unnecessary weight initialization during FP8-to-Linear layer conversion. Weights are immediately overwritten with dequantized values, making initialization wasteful.

Changes

Import no_init_weights context manager from transformers.modeling_utils
Wrap torch.nn.Linear instantiation in skip_weights_initialize() context
Remove redundant initial layer creation

def convert_fp8_layer_to_linear(layer, dtype=torch.bfloat16, device: str = "cpu"):
    from auto_round.schemes import QuantizationScheme

    with skip_weights_initialize():
        new_layer = torch.nn.Linear(layer.in_features, layer.out_features, bias=layer.bias is not None, dtype=dtype)
    # ... copy dequantized weights
    new_layer.weight.data.copy_(dq_weight.to(dtype=dtype))
    return new_layer

Reduces overhead in FP8 model conversion pipelines where this function is called repeatedly per layer.

Original prompt

speedup fp8 linear convert by below method

from transformers.modeling_utils import no_init_weights as skip_weights_initialize

def convert_fp8_layer_to_linear(layer, dtype=torch.bfloat16, device: str = "cpu"):
    """ """
    from auto_round.schemes import QuantizationScheme

    new_layer = torch.nn.Linear(layer.in_features, layer.out_features, bias=layer.bias is not None, dtype=dtype)
    # if "indexer" in getattr(layer, "tmp_name", ""):
    #     return layer  # skip indexer layer
    with skip_weights_initialize():
        new_layer = torch.nn.Linear(layer.in_features, layer.out_features, bias=layer.bias is not None, dtype=dtype)

💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Co-authored-by: yiliu30 <[email protected]>

Copilot

Pull request overview

This PR optimizes the FP8-to-Linear layer conversion process by eliminating redundant weight initialization. Since weights are immediately overwritten with dequantized values after layer creation, the initial weight initialization is wasteful and adds unnecessary overhead during model conversion.

Changes:

Added import of no_init_weights context manager (aliased as skip_weights_initialize) from transformers
Wrapped torch.nn.Linear instantiation in skip_weights_initialize() context to bypass initial weight allocation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-19T03:21:01Z

auto_round/utils/model.py



 def convert_fp8_layer_to_linear(layer, dtype=torch.bfloat16, device: str = "cpu"):
    """ """


Empty docstring should be removed or replaced with meaningful documentation explaining the function's purpose, parameters, and return value.

Suggested change

""" """

"""

Convert an FP8-quantized linear-like layer to a standard torch.nn.Linear layer

in a higher-precision dtype by dequantizing its weights and copying metadata.

This helper is intended for layers produced by AutoRound quantization, such as

regular FP8 linear layers or `CompressedLinear` layers with an attached

compressor. It reconstructs a dense Linear layer with dequantized weights and

preserves relevant attributes from the original layer (e.g. QuantizationScheme

fields, temporary names, and scale dtype).

Args:

layer: The source FP8-quantized layer instance to convert. It is expected

to have `in_features`, `out_features`, an optional `bias`, and either

a `compressor.decompress_module` method (for `CompressedLinear`) or

FP8 weight/scale attributes (`weight`, `weight_scale` or

`weight_scale_inv`, and `block_size`).

dtype: The target floating-point dtype for the new Linear layer weights

and bias. Defaults to torch.bfloat16.

device (str): Device on which to place the source layer before

dequantization (e.g. "cpu", "cuda").

Returns:

torch.nn.Linear: A new Linear layer with dequantized weights in the given

dtype and copied bias and quantization-related attributes.

"""

Initial plan

991da98

Copilot AI assigned Copilot and yiliu30 Jan 16, 2026

Copilot started work on behalf of yiliu30 January 16, 2026 11:05 View session

Speedup FP8 linear conversion using skip_weights_initialize

437f802

Co-authored-by: yiliu30 <[email protected]>

Copilot AI changed the title ~~[WIP] Speed up FP8 linear convert method~~ Optimize FP8 layer conversion by skipping weight initialization Jan 16, 2026

Copilot AI requested a review from yiliu30 January 16, 2026 11:12

Copilot finished work on behalf of yiliu30 January 16, 2026 11:12

yiliu30 marked this pull request as ready for review January 19, 2026 03:20

Copilot AI review requested due to automatic review settings January 19, 2026 03:20

Copilot AI reviewed Jan 19, 2026

View reviewed changes

Merge branch 'main' into copilot/speedup-fp8-linear-convert-another-one

011aca8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize FP8 layer conversion by skipping weight initialization #1295

Optimize FP8 layer conversion by skipping weight initialization #1295

Copilot AI commented Jan 16, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants



		def convert_fp8_layer_to_linear(layer, dtype=torch.bfloat16, device: str = "cpu"):
		""" """

-    """ """
+    """
+    Convert an FP8-quantized linear-like layer to a standard torch.nn.Linear layer
+    in a higher-precision dtype by dequantizing its weights and copying metadata.
+    This helper is intended for layers produced by AutoRound quantization, such as
+    regular FP8 linear layers or `CompressedLinear` layers with an attached
+    compressor. It reconstructs a dense Linear layer with dequantized weights and
+    preserves relevant attributes from the original layer (e.g. QuantizationScheme
+    fields, temporary names, and scale dtype).
+    Args:
+        layer: The source FP8-quantized layer instance to convert. It is expected
+            to have `in_features`, `out_features`, an optional `bias`, and either
+            a `compressor.decompress_module` method (for `CompressedLinear`) or
+            FP8 weight/scale attributes (`weight`, `weight_scale` or
+            `weight_scale_inv`, and `block_size`).
+        dtype: The target floating-point dtype for the new Linear layer weights
+            and bias. Defaults to torch.bfloat16.
+        device (str): Device on which to place the source layer before
+            dequantization (e.g. "cpu", "cuda").
+    Returns:
+        torch.nn.Linear: A new Linear layer with dequantized weights in the given
+        dtype and copied bias and quantization-related attributes.
+    """

Optimize FP8 layer conversion by skipping weight initialization #1295

Are you sure you want to change the base?

Optimize FP8 layer conversion by skipping weight initialization #1295

Conversation

Copilot AI commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Jan 16, 2026 •

edited

Loading