Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

respect train/eval mode in traced network #1217

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

sebffischer
Copy link
Collaborator

@sebffischer sebffischer commented Dec 10, 2024

Further issues:

  • Reading a script function is currently loaded as a module, I don't think this should be the case
  • Probably we should use a separate CompilationUnit per compiled module

@sebffischer sebffischer mentioned this pull request Dec 10, 2024
3 tasks
@dfalbel dfalbel added the lantern Use this label if your PR affects lantern so it's built in the CI label Dec 11, 2024
@sebffischer
Copy link
Collaborator Author

Note that this maybe also fixes https://github.com/mlverse/torch/pull/633/files but I need to check again

@@ -27,7 +27,7 @@ jobs:
config:

- {os: macOS, r_version: release, version: cpu-intel, runner: macos-13}
- {os: macOS, r_version: release, version: cpu-m1, runner: macos-13}
- {os: macOS, r_version: release, version: cpu-m1, runner: macos-latest}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I was looking at why we keep failing on some tests and it seems that github runners, even though running on arm64 images cannot run MPS, as this API can't be acccessed by VM's under macOS, so it requires real self-hosted machines.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, fixed!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think something might be broken with the custom macos runner, at least it's taking forever to start the job.

@sebffischer
Copy link
Collaborator Author

@dfalbel I think something is broken with the macOS runner. I don't think that this PR should behave differently on different operating systems, however.

@dfalbel
Copy link
Member

dfalbel commented Jan 6, 2025

The runner was missing a m1 label that I just added. It should run on m1 now for this PR

@sebffischer
Copy link
Collaborator Author

The runner was missing a m1 label that I just added. It should run on m1 now for this PR

Thanks, but I still think there is something off with the M1 runner: https://github.com/mlverse/torch/actions/runs/12633265519/job/35213843863

Copy link
Member

@dfalbel dfalbel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sebffischer Looks good! I added some comments, let me know what you think!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this file?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file already existed before and was used in test-script_module.R, I merely updated it

should_mangle = TRUE,
manage_memory = FALSE
)
mod$eval()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should add an argument to jit_trace that would keep the old behavior. My two concerns are:

  1. Tracing runs the network twice, which could be problematic for some users.
  2. Duplicates the size of the graph, which might be undesidered. Maybe a user wants to just trace the forward method in eval mode to export for deployment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, makes sense. But then I think calling $train() and $eval() should maybe result in an error, what do you think?

And should the default be to respect the train/eval-mode or not?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually i think there should maybe be no default to force the user to specify this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I changed my mind again, I think the default TRUE is fine, but I am happy to change it as well.
In which cases do you think running the network twice is problematic?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a respect_mode argument that triggers the double/single tracing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lantern Use this label if your PR affects lantern so it's built in the CI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants