-
Notifications
You must be signed in to change notification settings - Fork 3.1k
OneLogger Integration #13437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
OneLogger Integration #13437
Conversation
Signed-off-by: PytLab <[email protected]>
Signed-off-by: PytLab <[email protected]>
Signed-off-by: sajup-oss <[email protected]>
Signed-off-by: liquor233 <[email protected]>
Signed-off-by: sajup-oss <[email protected]>
…o into zshao/add_callback_group
Signed-off-by: sajup-oss <[email protected]>
[🤖]: Hi @PytLab 👋, We wanted to let you know that a CICD pipeline for this PR just finished successfully. So it might be time to merge this PR or get some approvals. |
5c64172
to
2f09433
Compare
TrainerContext.from_trainer(trainer).io_dump( | ||
ckpt_to_dir(self.last_model_path) / "context", yaml_attrs=["model"] | ||
) | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liquor233 is code change is not one-logger relevant. why are we changing original NeMo code in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CICD get failed due to the checkpointing issue -- I have double checked and this is not related to OneLogger code change, this fix is needed if we want to pass the CICD pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need NeMo owner to help resolve this
nemo/utils/meta_info_manager.py
Outdated
# Minimal configuration - required fields only | ||
init_config = { | ||
# Required fields (from OneLoggerConfig) - no defaults | ||
"application_name": "nemo-application", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liquor233 let's just use nemo
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack -- changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liquor233 also need to update test case
world_size = int(os.environ.get('WORLD_SIZE', 1)) | ||
max_steps = getattr(trainer, 'max_steps', 1) | ||
# Use hardcoded value for log_every_n_steps instead of getting from trainer | ||
log_every_n_steps = getattr(trainer, 'log_every_n_steps', 10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need review this freq with NeMo owner to avoid possible high freq data posting
Discussed with @PytLab , need to add the callbackgroup abstract class. |
Signed-off-by: liquor233 <[email protected]>
#14628) * chore(callbacks): restore generic CallbackGroup and route telemetry via group\n\n- Add BaseCallback and CallbackGroup with update_config and class init hook\n- Register OneLoggerAdapterCallback into group; merge config update into class\n- Replace direct OneLogger API usages with CallbackGroup across code\n- Ensure trainer attaches registered callbacks via group.update_config\n- Add nv-one-logger>=2.0.0 to base requirements\n\nSigned-off-by: Jiashang Hu <[email protected]> Signed-off-by: Jiashang Hu <[email protected]> * Apply isort and black reformatting Signed-off-by: liquor233 <[email protected]> * chore: renaming. * chore: revert the change to install nv-one-logger * chore: fix the linting issue Signed-off-by: Jiashang Hu <[email protected]> * Apply isort and black reformatting Signed-off-by: liquor233 <[email protected]> --------- Signed-off-by: Jiashang Hu <[email protected]> Signed-off-by: liquor233 <[email protected]> Co-authored-by: liquor233 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left a couple of comments after a quick review as I've noticed that the code is not yet stable, i.e. the callback group is back now. Let me know once the code is stable, reviewed by the Nemo team, and ready for the Heimdall signoff.
Here are the Heimdall requirements that may be missing at the moment:
- The exporters need to be configurable, for example with a command line argument or environment variable that the users can set.
- Exporters may be different according to the rank number, for example rank 0 may log to file whereas all ranks send data to an open telemetry endpoint.
- Exceptions need to be captured, and recorded as error events.
- The end functions should be called even in the presence of exceptions.
- We need to capture distributed initialization and model forward too.
Not an Heimdall requirement, but for you to consider:
- Generally, if it's possible, you should try to avoid modifying all model and data loader implementations, and hook only in one place. You might need the help of the Nemo team to determine if this is possible for things like model initialization, data loading and checkpoints.
@@ -135,6 +136,9 @@ def train( | |||
|
|||
trainer.fit(model, data) | |||
|
|||
# Track app end for NeMo v2 recipe-based applications | |||
CallbackGroup.get_instance().on_app_end() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is the matching on_app_start()
called? Why not consider a function decorator to call both?
resume.setup(trainer, model) | ||
CallbackGroup.get_instance().on_load_checkpoint_end() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider introducing a context manager for cases similar to this, so that the end function is always called, and we can log any exception as an error event. For Heimdall error events are required.
one_logger_config = OneLoggerConfig(**init_config) | ||
TrainingTelemetryProvider.instance().with_base_config( | ||
one_logger_config | ||
).with_export_config().configure_provider() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can users select different exporters?
Important
The
Update branch
button must only be pressed in very rare occassions.An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.
What does this PR do ?
Add global callback group & metadata factory function for NeMo
Collection: [Note which collection this PR will affect]
Will keep updating
Changelog
CallbackGroup
andCallback
ABC in NeMo/nemo/lightning/pytorch/callbacks/callback_group.pyUsage
# Add a code snippet demonstrating how to use this
GitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information