Skip to content

Conversation

PytLab
Copy link
Collaborator

@PytLab PytLab commented May 5, 2025

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

Add global callback group & metadata factory function for NeMo

Collection: [Note which collection this PR will affect]
Will keep updating

  • asr
  • multimodal
  • nlp

Changelog

  • New CallbackGroup and Callback ABC in NeMo/nemo/lightning/pytorch/callbacks/callback_group.py

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

@PytLab PytLab self-assigned this May 5, 2025
@github-actions github-actions bot added core Changes to NeMo Core ASR NLP Multi Modal labels May 6, 2025
@PytLab PytLab requested a review from dimapihtar May 7, 2025 08:25
@github-actions github-actions bot removed the ASR label May 12, 2025
Copy link
Contributor

[🤖]: Hi @PytLab 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully.

So it might be time to merge this PR or get some approvals.

//cc @chtruong814 @ko3n1g @pablo-garay @thomasdhc

@github-actions github-actions bot removed the Run CICD label Aug 29, 2025
@liquor233 liquor233 force-pushed the zshao/add_callback_group branch from 5c64172 to 2f09433 Compare August 29, 2025 03:51
TrainerContext.from_trainer(trainer).io_dump(
ckpt_to_dir(self.last_model_path) / "context", yaml_attrs=["model"]
)
try:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liquor233 is code change is not one-logger relevant. why are we changing original NeMo code in this PR?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CICD get failed due to the checkpointing issue -- I have double checked and this is not related to OneLogger code change, this fix is needed if we want to pass the CICD pipeline.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need NeMo owner to help resolve this

@PytLab PytLab changed the title Add CallbackGroup & Metadata factory function OneLogger Integration Sep 2, 2025
# Minimal configuration - required fields only
init_config = {
# Required fields (from OneLoggerConfig) - no defaults
"application_name": "nemo-application",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liquor233 let's just use nemo ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack -- changed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liquor233 also need to update test case

world_size = int(os.environ.get('WORLD_SIZE', 1))
max_steps = getattr(trainer, 'max_steps', 1)
# Use hardcoded value for log_every_n_steps instead of getting from trainer
log_every_n_steps = getattr(trainer, 'log_every_n_steps', 10)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need review this freq with NeMo owner to avoid possible high freq data posting

@liquor233
Copy link
Collaborator

Discussed with @PytLab , need to add the callbackgroup abstract class.

liquor233 and others added 4 commits September 2, 2025 23:49
#14628)

* chore(callbacks): restore generic CallbackGroup and route telemetry via group\n\n- Add BaseCallback and CallbackGroup with update_config and class init hook\n- Register OneLoggerAdapterCallback into group; merge config update into class\n- Replace direct OneLogger API usages with CallbackGroup across code\n- Ensure trainer attaches registered callbacks via group.update_config\n- Add nv-one-logger>=2.0.0 to base requirements\n\nSigned-off-by: Jiashang Hu <[email protected]>

Signed-off-by: Jiashang Hu <[email protected]>

* Apply isort and black reformatting

Signed-off-by: liquor233 <[email protected]>

* chore: renaming.

* chore: revert the change to install nv-one-logger

* chore: fix the linting issue

Signed-off-by: Jiashang Hu <[email protected]>

* Apply isort and black reformatting

Signed-off-by: liquor233 <[email protected]>

---------

Signed-off-by: Jiashang Hu <[email protected]>
Signed-off-by: liquor233 <[email protected]>
Co-authored-by: liquor233 <[email protected]>
Copy link

@stef1927 stef1927 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a couple of comments after a quick review as I've noticed that the code is not yet stable, i.e. the callback group is back now. Let me know once the code is stable, reviewed by the Nemo team, and ready for the Heimdall signoff.

Here are the Heimdall requirements that may be missing at the moment:

  • The exporters need to be configurable, for example with a command line argument or environment variable that the users can set.
  • Exporters may be different according to the rank number, for example rank 0 may log to file whereas all ranks send data to an open telemetry endpoint.
  • Exceptions need to be captured, and recorded as error events.
  • The end functions should be called even in the presence of exceptions.
  • We need to capture distributed initialization and model forward too.

Not an Heimdall requirement, but for you to consider:

  • Generally, if it's possible, you should try to avoid modifying all model and data loader implementations, and hook only in one place. You might need the help of the Nemo team to determine if this is possible for things like model initialization, data loading and checkpoints.

@@ -135,6 +136,9 @@ def train(

trainer.fit(model, data)

# Track app end for NeMo v2 recipe-based applications
CallbackGroup.get_instance().on_app_end()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the matching on_app_start() called? Why not consider a function decorator to call both?

resume.setup(trainer, model)
CallbackGroup.get_instance().on_load_checkpoint_end()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider introducing a context manager for cases similar to this, so that the end function is always called, and we can log any exception as an error event. For Heimdall error events are required.

one_logger_config = OneLoggerConfig(**init_config)
TrainingTelemetryProvider.instance().with_base_config(
one_logger_config
).with_export_config().configure_provider()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can users select different exporters?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Changes to NeMo Core
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants