[training_utils] fix: fixed various bugs with tracking #5135

guillemgt · 2026-01-30T13:12:51Z

What does this PR do?

Fixes several bugs with metrics tracking for W&B, MLFlow and trackio.

Checklist Before Starting

Search for similar PRs. Query: https://github.com/volcengine/verl/pulls?q=is%3Apr+tracking+adapter
Format the PR title as [training_utils] refactor: adapter pattern for tracking backends with bug fixes

Test

Manual Testing:
These changes affect integration with external tracking services (MLflow, WandB, Trackio) which require actual backend connections and cannot be easily mocked in CI.

Testing performed:

Tested MLflow run resumption with MLFLOW_RUN_ID environment variable
Tested WandB deterministic run ID generation and resumption
Confirmed Trackio adapter properly handles numpy types

All backends initialize correctly and log metrics without errors. Run resumption works as expected for MLflow and WandB.

API and Usage Example

No API changes. All changes are internal refactoring. External usage remains the same:

from verl.utils.tracking import Tracking

# Usage unchanged - initialization works as before
tracker = Tracking(
    project_name="my_project",
    experiment_name="my_experiment",
    default_backend=["wandb", "mlflow"],
    config=config
)

# Logging unchanged
tracker.log({"loss": 0.5}, step=100)

Internal improvements:

Run resumption is more robust (checks env vars and searches by name)

Design & Code Changes

High-level Design:

Introduced adapter pattern to encapsulate backend-specific logic:

_WandbLoggingAdapter: Handles WandB initialization with deterministic run IDs
- Previously, if a run was restarted an error would occur indicating that a given step was already logged, which crashed trainings.
- Generates 16-char MD5 hash from project/experiment name for consistent run IDs
- Uses resume="allow" for robust run resumption
- Defines custom step metric (training/global_step)
_TrackioLoggingAdapter: Handles Trackio with numpy type conversion
- Recursively sanitizes data structures to convert numpy types
- Supports ValidationGenerationsLogger integration
_MlflowLoggingAdapter: Handles MLflow run resumption
- Implements get_run_id_by_name() for finding existing runs
- Checks both MLFLOW_RUN_ID env var and searches by run name

Specific Changes:

Refactored Tracking.__init__() to use adapter classes instead of direct backend initialization
Added _MlflowLoggingAdapter.get_run_id_by_name() static method
Enhanced MLflow resumption: check env var THEN search by name as fallback
Added logging when resuming runs: print(f"[MLflow] Resuming run with ID: {run_id}")
Updated ValidationGenerationsLogger to support trackio backend
Standardized method call style across adapters

Files Modified:

verl/utils/tracking.py (~107 lines added, ~15 lines removed)

Checklist Before Submitting

Read the Contribute Guide
Apply pre-commit checks
Add / Update the documentation - Not applicable (internal refactoring, no user-facing changes)
Add unit or end-to-end test(s) to the CI workflow - Not feasible: requires external tracking service backends (MLflow server, WandB account, Trackio setup). Manual testing performed with actual backends.
Request CI in the ci-request Slack channel - Will do after PR creation

guillemgt added 2 commits January 30, 2026 13:58

[tracking] fix: several fixes for tracking

b97b122

[trainer] fix: fix earlier _MlflowLoggingAdapter bug

e5f7ba0

guillemgt changed the title ~~[training_utils] refactor: adapter pattern for tracking backends with bug fixes~~ [training_utils] fix: fixed various bugs with tracking Jan 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[training_utils] fix: fixed various bugs with tracking #5135

[training_utils] fix: fixed various bugs with tracking #5135

guillemgt commented Jan 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[training_utils] fix: fixed various bugs with tracking #5135

Are you sure you want to change the base?

[training_utils] fix: fixed various bugs with tracking #5135

Conversation

guillemgt commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

guillemgt commented Jan 30, 2026 •

edited

Loading