-
Notifications
You must be signed in to change notification settings - Fork 72
Optimize FP8 layer conversion by skipping weight initialization #1295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: yiliu30 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR optimizes the FP8-to-Linear layer conversion process by eliminating redundant weight initialization. Since weights are immediately overwritten with dequantized values after layer creation, the initial weight initialization is wasteful and adds unnecessary overhead during model conversion.
Changes:
- Added import of
no_init_weightscontext manager (aliased asskip_weights_initialize) from transformers - Wrapped
torch.nn.Linearinstantiation inskip_weights_initialize()context to bypass initial weight allocation
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
|
|
||
| def convert_fp8_layer_to_linear(layer, dtype=torch.bfloat16, device: str = "cpu"): | ||
| """ """ |
Copilot
AI
Jan 19, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Empty docstring should be removed or replaced with meaningful documentation explaining the function's purpose, parameters, and return value.
| """ """ | |
| """ | |
| Convert an FP8-quantized linear-like layer to a standard torch.nn.Linear layer | |
| in a higher-precision dtype by dequantizing its weights and copying metadata. | |
| This helper is intended for layers produced by AutoRound quantization, such as | |
| regular FP8 linear layers or `CompressedLinear` layers with an attached | |
| compressor. It reconstructs a dense Linear layer with dequantized weights and | |
| preserves relevant attributes from the original layer (e.g. QuantizationScheme | |
| fields, temporary names, and scale dtype). | |
| Args: | |
| layer: The source FP8-quantized layer instance to convert. It is expected | |
| to have `in_features`, `out_features`, an optional `bias`, and either | |
| a `compressor.decompress_module` method (for `CompressedLinear`) or | |
| FP8 weight/scale attributes (`weight`, `weight_scale` or | |
| `weight_scale_inv`, and `block_size`). | |
| dtype: The target floating-point dtype for the new Linear layer weights | |
| and bias. Defaults to torch.bfloat16. | |
| device (str): Device on which to place the source layer before | |
| dequantization (e.g. "cpu", "cuda"). | |
| Returns: | |
| torch.nn.Linear: A new Linear layer with dequantized weights in the given | |
| dtype and copied bias and quantization-related attributes. | |
| """ |
Skip unnecessary weight initialization during FP8-to-Linear layer conversion. Weights are immediately overwritten with dequantized values, making initialization wasteful.
Changes
no_init_weightscontext manager fromtransformers.modeling_utilstorch.nn.Linearinstantiation inskip_weights_initialize()contextReduces overhead in FP8 model conversion pipelines where this function is called repeatedly per layer.
Original prompt
💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.