OneLogger Integration #13437

PytLab · 2025-05-05T16:24:46Z

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

Add global callback group & metadata factory function for NeMo

Collection: [Note which collection this PR will affect]
Will keep updating

asr
multimodal
nlp

Changelog

New CallbackGroup and Callback ABC in NeMo/nemo/lightning/pytorch/callbacks/callback_group.py

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Signed-off-by: PytLab <[email protected]>

nemo/core/classes/common.py

nemo/lightning/pytorch/callbacks/callback_group.py

Signed-off-by: PytLab <[email protected]>

nemo/core/classes/modelPT.py

Signed-off-by: sajup-oss <[email protected]>

nemo/utils/meta_info_manager.py

Signed-off-by: liquor233 <[email protected]>

Signed-off-by: sajup-oss <[email protected]>

nemo/utils/meta_info_manager.py

…o_gpt_train.py

…o into zshao/add_callback_group

Signed-off-by: sajup-oss <[email protected]>

github-actions · 2025-08-28T15:00:58Z

[🤖]: Hi @PytLab 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully.

So it might be time to merge this PR or get some approvals.

//cc @chtruong814 @ko3n1g @pablo-garay @thomasdhc

nemo/collections/llm/fn/mixin.py

nemo/utils/exp_manager.py

nemo/lightning/one_logger_callback.py

PytLab · 2025-09-01T15:04:09Z

nemo/lightning/pytorch/callbacks/model_checkpoint.py

-                TrainerContext.from_trainer(trainer).io_dump(
-                    ckpt_to_dir(self.last_model_path) / "context", yaml_attrs=["model"]
-                )
+                try:


@liquor233 is code change is not one-logger relevant. why are we changing original NeMo code in this PR?

The CICD get failed due to the checkpointing issue -- I have double checked and this is not related to OneLogger code change, this fix is needed if we want to pass the CICD pipeline.

Need NeMo owner to help resolve this

nemo/lightning/one_logger_callback.py

nemo/lightning/pytorch/callbacks/model_checkpoint.py

nemo/utils/meta_info_manager.py

PytLab · 2025-09-02T01:58:04Z

nemo/utils/meta_info_manager.py

+    # Minimal configuration - required fields only
+    init_config = {
+        # Required fields (from OneLoggerConfig) - no defaults
+        "application_name": "nemo-application",


@liquor233 let's just use nemo ?

Ack -- changed.

@liquor233 also need to update test case

nemo/lightning/one_logger_callback.py

nemo/utils/meta_info_manager.py

PytLab · 2025-09-02T04:13:34Z

nemo/utils/meta_info_manager.py

+    world_size = int(os.environ.get('WORLD_SIZE', 1))
+    max_steps = getattr(trainer, 'max_steps', 1)
+    # Use hardcoded value for log_every_n_steps instead of getting from trainer
+    log_every_n_steps = getattr(trainer, 'log_every_n_steps', 10)


Need review this freq with NeMo owner to avoid possible high freq data posting

nemo/utils/meta_info_manager.py

liquor233 · 2025-09-02T23:49:17Z

Discussed with @PytLab , need to add the callbackgroup abstract class.

Signed-off-by: liquor233 <[email protected]>

#14628) * chore(callbacks): restore generic CallbackGroup and route telemetry via group\n\n- Add BaseCallback and CallbackGroup with update_config and class init hook\n- Register OneLoggerAdapterCallback into group; merge config update into class\n- Replace direct OneLogger API usages with CallbackGroup across code\n- Ensure trainer attaches registered callbacks via group.update_config\n- Add nv-one-logger>=2.0.0 to base requirements\n\nSigned-off-by: Jiashang Hu <[email protected]> Signed-off-by: Jiashang Hu <[email protected]> * Apply isort and black reformatting Signed-off-by: liquor233 <[email protected]> * chore: renaming. * chore: revert the change to install nv-one-logger * chore: fix the linting issue Signed-off-by: Jiashang Hu <[email protected]> * Apply isort and black reformatting Signed-off-by: liquor233 <[email protected]> --------- Signed-off-by: Jiashang Hu <[email protected]> Signed-off-by: liquor233 <[email protected]> Co-authored-by: liquor233 <[email protected]>

stef1927

I've left a couple of comments after a quick review as I've noticed that the code is not yet stable, i.e. the callback group is back now. Let me know once the code is stable, reviewed by the Nemo team, and ready for the Heimdall signoff.

Here are the Heimdall requirements that may be missing at the moment:

The exporters need to be configurable, for example with a command line argument or environment variable that the users can set.
Exporters may be different according to the rank number, for example rank 0 may log to file whereas all ranks send data to an open telemetry endpoint.
Exceptions need to be captured, and recorded as error events.
The end functions should be called even in the presence of exceptions.
We need to capture distributed initialization and model forward too.

Not an Heimdall requirement, but for you to consider:

Generally, if it's possible, you should try to avoid modifying all model and data loader implementations, and hook only in one place. You might need the help of the Nemo team to determine if this is possible for things like model initialization, data loading and checkpoints.

stef1927 · 2025-09-03T13:24:14Z

nemo/collections/llm/api.py

@@ -135,6 +136,9 @@ def train(

    trainer.fit(model, data)

+    # Track app end for NeMo v2 recipe-based applications
+    CallbackGroup.get_instance().on_app_end()


Where is the matching on_app_start() called? Why not consider a function decorator to call both?

stef1927 · 2025-09-03T13:25:50Z

nemo/collections/llm/api.py

        resume.setup(trainer, model)
+        CallbackGroup.get_instance().on_load_checkpoint_end()


Consider introducing a context manager for cases similar to this, so that the end function is always called, and we can log any exception as an error event. For Heimdall error events are required.

stef1927 · 2025-09-03T13:39:06Z

nemo/lightning/one_logger_callback.py

+        one_logger_config = OneLoggerConfig(**init_config)
+        TrainingTelemetryProvider.instance().with_base_config(
+            one_logger_config
+        ).with_export_config().configure_provider()


How can users select different exporters?

nemo/lightning/one_logger_callback.py

PytLab and others added 2 commits May 5, 2025 23:58

feat: add callback group definition & callback ABC

642e360

Apply isort and black reformatting

1badf29

Signed-off-by: PytLab <[email protected]>

PytLab self-assigned this May 5, 2025

feat: insert callback functions of CallbackGroup

3bf3367

github-actions bot added core Changes to NeMo Core ASR NLP Multi Modal labels May 6, 2025

Apply isort and black reformatting

2b51e12

Signed-off-by: PytLab <[email protected]>

github-advanced-security bot found potential problems May 6, 2025

View reviewed changes

nemo/core/classes/common.py Fixed Show fixed Hide fixed

nemo/lightning/pytorch/callbacks/callback_group.py Fixed Show fixed Hide fixed

chore: PR test for jiashang

249dad3

PytLab requested a review from dimapihtar May 7, 2025 08:25

feat: use __init_subclass__ to cover all ModelPT subclasses

db2b15d

github-actions bot removed the ASR label May 12, 2025

Apply isort and black reformatting

d921d64

Signed-off-by: PytLab <[email protected]>

github-advanced-security bot found potential problems May 12, 2025

View reviewed changes

nemo/core/classes/modelPT.py Fixed Show fixed Hide fixed

Saju Prasad and others added 2 commits May 11, 2025 22:48

feat: Adding metadata config manager poc

3e32f1a

Apply isort and black reformatting

e1074f6

Signed-off-by: sajup-oss <[email protected]>

github-advanced-security bot found potential problems May 12, 2025

View reviewed changes

nemo/utils/meta_info_manager.py Fixed Show fixed Hide fixed

PytLab added the Run CICD label May 12, 2025

github-actions bot removed the Run CICD label May 12, 2025

liquor233 and others added 4 commits May 13, 2025 15:32

feat: revert test changes.

d79f4f1

Signed-off-by: liquor233 <[email protected]>

fix: Updating metadata attributes

263f7e9

fix: Merging changes

81cd1d9

Apply isort and black reformatting

4852936

Signed-off-by: sajup-oss <[email protected]>

github-advanced-security bot found potential problems May 21, 2025

View reviewed changes

nemo/utils/meta_info_manager.py Fixed Show fixed Hide fixed

sajup-oss and others added 4 commits May 22, 2025 05:55

fix: Adding OneloggerCallback

48d6d87

fix: Reverting changes in examples/multimodal/speech_llm/modular_audi…

2ba6cc5

…o_gpt_train.py

fix: Merge branch 'zshao/add_callback_group' of github.com:NVIDIA/NeM…

c908b53

…o into zshao/add_callback_group

Apply isort and black reformatting

bd39d8f

Signed-off-by: sajup-oss <[email protected]>

ko3n1g temporarily deployed to test August 28, 2025 10:35 — with GitHub Actions Inactive

github-actions bot removed the Run CICD label Aug 28, 2025

liquor233 added the Run CICD label Aug 29, 2025

github-actions bot removed the Run CICD label Aug 29, 2025

liquor233 force-pushed the zshao/add_callback_group branch from 5c64172 to 2f09433 Compare August 29, 2025 03:51

PytLab commented Aug 29, 2025

View reviewed changes

nemo/collections/llm/fn/mixin.py Outdated Show resolved Hide resolved

PytLab commented Aug 29, 2025

View reviewed changes

nemo/utils/exp_manager.py Outdated Show resolved Hide resolved

liquor233 added 3 commits September 1, 2025 12:42

chore: renaming onelogger

88fb787

chore: fix some exception.

d8156dd

Merge branch 'main' into zshao/add_callback_group

a449cc6

PytLab commented Sep 1, 2025

View reviewed changes

nemo/lightning/one_logger_callback.py Outdated Show resolved Hide resolved

nemo/lightning/one_logger_callback.py Outdated Show resolved Hide resolved

chore: renaming.

951e143

liquor233 added the Run CICD label Sep 1, 2025

github-actions bot removed the Run CICD label Sep 1, 2025

PytLab commented Sep 1, 2025

View reviewed changes

nemo/lightning/one_logger_callback.py Outdated Show resolved Hide resolved

PytLab commented Sep 1, 2025

View reviewed changes

nemo/lightning/one_logger_callback.py Outdated Show resolved Hide resolved

nemo/lightning/one_logger_callback.py Outdated Show resolved Hide resolved

PytLab commented Sep 1, 2025

View reviewed changes

PytLab changed the title ~~Add CallbackGroup & Metadata factory function~~ OneLogger Integration Sep 2, 2025

chore: resolve some comments.

3eea3b2

PytLab commented Sep 2, 2025

View reviewed changes

liquor233 added 2 commits September 2, 2025 17:14

chore: remove duplicate init.

129615e

chore: resolve some github comments.

d7085fd

liquor233 and others added 4 commits September 2, 2025 23:49

Apply isort and black reformatting

09d8347

Signed-off-by: liquor233 <[email protected]>

chore: fix the linting issue.

a9fc88b

Merge branch 'main' into zshao/add_callback_group

4dc1c91

stef1927 reviewed Sep 3, 2025

View reviewed changes

PytLab commented Sep 4, 2025

View reviewed changes

nemo/lightning/one_logger_callback.py Show resolved Hide resolved

		resume.setup(trainer, model)
		CallbackGroup.get_instance().on_load_checkpoint_end()

OneLogger Integration #13437

Are you sure you want to change the base?

OneLogger Integration #13437

Uh oh!

Conversation

PytLab commented May 5, 2025

What does this PR do ?

Changelog

Usage

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Aug 28, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

liquor233 commented Sep 2, 2025

Uh oh!

stef1927 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

stef1927 left a comment •

edited

Loading