Initial remote hermetic cuda toolchain #72

jsharpe · 2023-03-20T23:40:27Z

Depends on #66

The BUILD files generated for each of the downloaded repos are a bit hacky at the moment and BUILD.remote_cuda is probably not fully updated (I only updated the bits I needed).

CUDA 12 requires bringing in libcu++ as an external dep otherwise nv/target is missing.

I've not checked this for reproducibility but ancedotedly in the debugger I've seen source paths that are RBE worker dependent so I suspect there are some reproducibility issues with the current setup.

The other thing likely missing is runfiles for runtime dependencies from the remote toolchain.

…he local_cuda repository_rule and adding a attribute

jsharpe · 2023-03-20T23:48:46Z

examples/cublas/BUILD.bazel

@@ -2,5 +2,5 @@
 cc_binary(
    name = "main",
    srcs = ["cublas.cpp"],
-    deps = ["@local_cuda//:cublas"],


I haven't come up with a good way of avoiding knowing the resolved toolchain repo name here to find cublas.

@local_cuda//:cuda_runtime is also missing, so in #66 the basic CI failed. And it should be resolved before this one.

ahans · 2024-06-28T11:56:12Z

What's the status here? This would be a very welcome feature for more than one project I'm contributing to! So far the solution is a custom archive with the CUDA SDK and some local patch to rules_cuda that makes it download that first and then set it up as if it was local. Having this work out of the box with rules_cuda and also only download what is actually used would be much nicer, of course!

jsharpe · 2024-06-28T12:08:15Z

What's the status here? This would be a very welcome feature for more than one project I'm contributing to! So far the solution is a custom archive with the CUDA SDK and some local patch to rules_cuda that makes it download that first and then set it up as if it was local. Having this work out of the box with rules_cuda and also only download what is actually used would be much nicer, of course!

The code in this PR works (although it has bit rotten a bit - the branch I'm actually using is remote_toolchain in my fork of the repo) but it breaks the support for the local setup use case. I don't really have the time at the moment to make both work in a single repo so some help on getting this working woudl be appreciated; there's likely some bits that can be broken out into separate PRs and landed independently to get us there in smaller steps as this is a rather large change otherwise.

cloudhan · 2024-06-28T12:29:52Z

I think this should be split into mulitple step.

support composing multiple components (say, local_cccl, local_cublas, local_thrust, local_cub) into a unified local_cuda
- this might allow reusing pip and conda installs
support instantiating thoes local_* from local tar balls
support downloading tar balls
support parsing json at https://developer.download.nvidia.cn/compute/cuda/redist/
- this effectively supports https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#tarball-and-zip-archive-deliverables

jsharpe · 2024-06-28T12:38:01Z

Ah yes I remember now - NVIDIA/cccl#622 was the issue I raised so that I could effectively get CCCL into a bzlmod support repository as a fully hermetic toolchain will require downloading these separately.
IMO the cccl / thrust / cub targets shouldn't be inside local_cuda; they're just libraries, the fact that they can be provided by a local_cuda repo is incidental. Note that thrust in particular can be used independently of a CUDA install - it works just as well in an OpenMP context on a host..

cloudhan · 2024-06-28T12:51:06Z

local_cuda is a name inherited from tf_runtime impl, this should have been called local_cuda_toolkit, so every components stated in the doc will have a position (maybe overrideable).

The last step might not be as trivial as it seems to be. For example, in CI, you might want to override all those links with your server. If we fetch the json instead of checkin directly, we don't need the URL to be stable. And if we let the user provide the json url, we don't even require the URL itself to be stable. (we need the json schema to be stable tho...

jsharpe · 2024-06-28T13:10:40Z

Unfortunately the json schema hasn't proven to be stable - its changed in the 12 series of releases.. - its only the addition of some extra keys, but it was enough to break the logic I had in here.

ahans · 2024-06-28T14:32:22Z

Thanks for the update, @jsharpe and @cloudhan, much appreciated! I will look at the mentioned branch in @jsharpe's fork. I don't care about supporting anything locally installed too much myself, but since that has been the only option for rules_cuda, I understand that it shouldn't be taken away. I will see if I can help with anything, but no promises.

ziad-fernride · 2024-08-27T16:20:02Z

cuda/templates/BUILD.remote_toolchain_nvcc

+
+cuda_toolchain(
+    name = "nvcc-local",
+    compiler_executable = "external/cuda_nvcc-linux-x86_64/bin/nvcc",


this bit only works with spawn_strategy=local, right ? is there a way to make it work with sandboxing ?

Nope, it works with sandboxing - in fact I use this with RBE.

cloudhan · 2024-10-21T16:03:01Z

Potentially superseded by PRs in #283

jsharpe added 9 commits March 20, 2023 23:00

Allow multiple 'local' installs of cuda to be specified by exposing t…

369622a

…he local_cuda repository_rule and adding a attribute

Update CI

3248d98

Update examples MODULE.bazel

0e6f14d

Bump version number

2522313

Remove unecessary WORKSPACE.bzlmod file

155602e

Fix for label names

f49563b

Update module extension

998fb01

Remote toolchain

ff35d01

Add cuda library dependencies

0bc80db

jsharpe commented Mar 20, 2023

View reviewed changes

cloudhan mentioned this pull request Mar 23, 2023

Various fixes to get rules_cuda to work with Lambda-stack #74

Merged

cloudhan mentioned this pull request Apr 20, 2023

bazel run works but bazel test does not #88

Closed

Merge branch 'main' into remote_toolchain_clean

ff0d981

cloudhan mentioned this pull request Oct 19, 2023

Add thrust and cub repos #38

Open

jsharpe mentioned this pull request Jul 5, 2024

Support RBE where worker image has no cuda install #45

Open

ziad-fernride reviewed Aug 27, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial remote hermetic cuda toolchain #72

Initial remote hermetic cuda toolchain #72

jsharpe commented Mar 20, 2023

jsharpe Mar 20, 2023

cloudhan Mar 21, 2023

ahans commented Jun 28, 2024

jsharpe commented Jun 28, 2024

cloudhan commented Jun 28, 2024

jsharpe commented Jun 28, 2024

cloudhan commented Jun 28, 2024 •

edited

Loading

jsharpe commented Jun 28, 2024

ahans commented Jun 28, 2024

ziad-fernride Aug 27, 2024

jsharpe Aug 27, 2024

cloudhan commented Oct 21, 2024

Initial remote hermetic cuda toolchain #72

Are you sure you want to change the base?

Initial remote hermetic cuda toolchain #72

Conversation

jsharpe commented Mar 20, 2023

jsharpe Mar 20, 2023

Choose a reason for hiding this comment

cloudhan Mar 21, 2023

Choose a reason for hiding this comment

ahans commented Jun 28, 2024

jsharpe commented Jun 28, 2024

cloudhan commented Jun 28, 2024

jsharpe commented Jun 28, 2024

cloudhan commented Jun 28, 2024 • edited Loading

jsharpe commented Jun 28, 2024

ahans commented Jun 28, 2024

ziad-fernride Aug 27, 2024

Choose a reason for hiding this comment

jsharpe Aug 27, 2024

Choose a reason for hiding this comment

cloudhan commented Oct 21, 2024

cloudhan commented Jun 28, 2024 •

edited

Loading