-
-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial remote hermetic cuda toolchain #72
base: main
Are you sure you want to change the base?
Conversation
…he local_cuda repository_rule and adding a attribute
@@ -2,5 +2,5 @@ | |||
cc_binary( | |||
name = "main", | |||
srcs = ["cublas.cpp"], | |||
deps = ["@local_cuda//:cublas"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't come up with a good way of avoiding knowing the resolved toolchain repo name here to find cublas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@local_cuda//:cuda_runtime
is also missing, so in #66 the basic CI failed. And it should be resolved before this one.
What's the status here? This would be a very welcome feature for more than one project I'm contributing to! So far the solution is a custom archive with the CUDA SDK and some local patch to |
The code in this PR works (although it has bit rotten a bit - the branch I'm actually using is remote_toolchain in my fork of the repo) but it breaks the support for the local setup use case. I don't really have the time at the moment to make both work in a single repo so some help on getting this working woudl be appreciated; there's likely some bits that can be broken out into separate PRs and landed independently to get us there in smaller steps as this is a rather large change otherwise. |
I think this should be split into mulitple step.
|
Ah yes I remember now - NVIDIA/cccl#622 was the issue I raised so that I could effectively get CCCL into a bzlmod support repository as a fully hermetic toolchain will require downloading these separately. |
local_cuda is a name inherited from tf_runtime impl, this should have been called local_cuda_toolkit, so every components stated in the doc will have a position (maybe overrideable). The last step might not be as trivial as it seems to be. For example, in CI, you might want to override all those links with your server. If we fetch the json instead of checkin directly, we don't need the URL to be stable. And if we let the user provide the json url, we don't even require the URL itself to be stable. (we need the json schema to be stable tho... |
Unfortunately the json schema hasn't proven to be stable - its changed in the 12 series of releases.. - its only the addition of some extra keys, but it was enough to break the logic I had in here. |
Thanks for the update, @jsharpe and @cloudhan, much appreciated! I will look at the mentioned branch in @jsharpe's fork. I don't care about supporting anything locally installed too much myself, but since that has been the only option for |
|
||
cuda_toolchain( | ||
name = "nvcc-local", | ||
compiler_executable = "external/cuda_nvcc-linux-x86_64/bin/nvcc", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this bit only works with spawn_strategy=local
, right ? is there a way to make it work with sandboxing ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, it works with sandboxing - in fact I use this with RBE.
Potentially superseded by PRs in #283 |
Depends on #66
The BUILD files generated for each of the downloaded repos are a bit hacky at the moment and
BUILD.remote_cuda
is probably not fully updated (I only updated the bits I needed).CUDA 12 requires bringing in libcu++ as an external dep otherwise nv/target is missing.
I've not checked this for reproducibility but ancedotedly in the debugger I've seen source paths that are RBE worker dependent so I suspect there are some reproducibility issues with the current setup.
The other thing likely missing is runfiles for runtime dependencies from the remote toolchain.