Skip to content

Comments

Reject non-finite fp16 loss_scale across config and ZeRO paths#7856

Draft
harshang03 wants to merge 1 commit intodeepspeedai:masterfrom
harshang03:fix/issue-7852-loss-scale-validation
Draft

Reject non-finite fp16 loss_scale across config and ZeRO paths#7856
harshang03 wants to merge 1 commit intodeepspeedai:masterfrom
harshang03:fix/issue-7852-loss-scale-validation

Conversation

@harshang03
Copy link

Describe your changes

  • Added centralized loss-scale validators for finite/positive numeric constraints used by runtime configuration and optimizer paths.
  • Enforced fp16 config validation for loss_scale, loss_scale_window, hysteresis, and min_loss_scale so invalid values fail during config parsing.
  • Hardened LossScaleConfig and CreateLossScaler to reject invalid static/dynamic loss-scale arguments even if values are injected outside config parsing.
  • Added validation to ZeRO stage 1/2 and stage 3 loss-scale override/setter methods to block non-finite/non-positive runtime overrides.
  • Added targeted unit tests for config parsing, loss-scaler validation, and ZeRO override/setter guard behavior.

Screenshot or video (only for visual changes)

  • N/A

GitHub Issue Link (if applicable)

Testing Plan

  • Explanation of why no additional tests are needed:
    • This change is fully covered with focused unit tests at config, scaler, and ZeRO runtime guard layers.
  • Unit Tests (JS and/or Python):
    • python -m pytest tests/unit/runtime/test_ds_config_model.py tests/unit/runtime/half_precision/test_loss_scale_validation.py
  • E2E Tests:
    • Not run (validation and guard logic change).
  • Any manual testing needed?:
    • No.

Contribution License Agreement
By submitting this pull request you agree that all contributions to this project are made under the Apache 2.0 license.

Reject non-finite or non-positive loss-scale values during fp16 config parsing, loss-scaler construction, and ZeRO override/setter flows to prevent inf-driven initialization that later produces NaN gradients.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant