Skip to content

Conversation

@guillemgt
Copy link

@guillemgt guillemgt commented Jan 30, 2026

What does this PR do?

Fixes several bugs with metrics tracking for W&B, MLFlow and trackio.

Checklist Before Starting

Test

Manual Testing:
These changes affect integration with external tracking services (MLflow, WandB, Trackio) which require actual backend connections and cannot be easily mocked in CI.

Testing performed:

  • Tested MLflow run resumption with MLFLOW_RUN_ID environment variable
  • Tested WandB deterministic run ID generation and resumption
  • Confirmed Trackio adapter properly handles numpy types

All backends initialize correctly and log metrics without errors. Run resumption works as expected for MLflow and WandB.

API and Usage Example

No API changes. All changes are internal refactoring. External usage remains the same:

from verl.utils.tracking import Tracking

# Usage unchanged - initialization works as before
tracker = Tracking(
    project_name="my_project",
    experiment_name="my_experiment",
    default_backend=["wandb", "mlflow"],
    config=config
)

# Logging unchanged
tracker.log({"loss": 0.5}, step=100)

Internal improvements:

  • Run resumption is more robust (checks env vars and searches by name)

Design & Code Changes

High-level Design:

Introduced adapter pattern to encapsulate backend-specific logic:

  1. _WandbLoggingAdapter: Handles WandB initialization with deterministic run IDs

    • Previously, if a run was restarted an error would occur indicating that a given step was already logged, which crashed trainings.
    • Generates 16-char MD5 hash from project/experiment name for consistent run IDs
    • Uses resume="allow" for robust run resumption
    • Defines custom step metric (training/global_step)
  2. _TrackioLoggingAdapter: Handles Trackio with numpy type conversion

    • Recursively sanitizes data structures to convert numpy types
    • Supports ValidationGenerationsLogger integration
  3. _MlflowLoggingAdapter: Handles MLflow run resumption

    • Implements get_run_id_by_name() for finding existing runs
    • Checks both MLFLOW_RUN_ID env var and searches by run name

Specific Changes:

  • Refactored Tracking.__init__() to use adapter classes instead of direct backend initialization
  • Added _MlflowLoggingAdapter.get_run_id_by_name() static method
  • Enhanced MLflow resumption: check env var THEN search by name as fallback
  • Added logging when resuming runs: print(f"[MLflow] Resuming run with ID: {run_id}")
  • Updated ValidationGenerationsLogger to support trackio backend
  • Standardized method call style across adapters

Files Modified:

  • verl/utils/tracking.py (~107 lines added, ~15 lines removed)

Checklist Before Submitting

@guillemgt guillemgt changed the title [training_utils] refactor: adapter pattern for tracking backends with bug fixes [training_utils] fix: fixed various bugs with tracking Jan 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant