[training_utils] fix: fixed various bugs with tracking #5135
+112
−15
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Fixes several bugs with metrics tracking for W&B, MLFlow and trackio.
Checklist Before Starting
[training_utils] refactor: adapter pattern for tracking backends with bug fixesTest
Manual Testing:
These changes affect integration with external tracking services (MLflow, WandB, Trackio) which require actual backend connections and cannot be easily mocked in CI.
Testing performed:
MLFLOW_RUN_IDenvironment variableAll backends initialize correctly and log metrics without errors. Run resumption works as expected for MLflow and WandB.
API and Usage Example
No API changes. All changes are internal refactoring. External usage remains the same:
Internal improvements:
Design & Code Changes
High-level Design:
Introduced adapter pattern to encapsulate backend-specific logic:
_WandbLoggingAdapter: Handles WandB initialization with deterministic run IDsresume="allow"for robust run resumptiontraining/global_step)_TrackioLoggingAdapter: Handles Trackio with numpy type conversion_MlflowLoggingAdapter: Handles MLflow run resumptionget_run_id_by_name()for finding existing runsMLFLOW_RUN_IDenv var and searches by run nameSpecific Changes:
Tracking.__init__()to use adapter classes instead of direct backend initialization_MlflowLoggingAdapter.get_run_id_by_name()static methodprint(f"[MLflow] Resuming run with ID: {run_id}")Files Modified:
verl/utils/tracking.py(~107 lines added, ~15 lines removed)Checklist Before Submitting
ci-requestSlack channel - Will do after PR creation