-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA]: Introduce Python module with CCCL headers #3201
base: main
Are you sure you want to change the base?
Conversation
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
python/cuda_cccl/setup.py
Outdated
project_path = os.path.abspath(os.path.dirname(__file__)) | ||
cccl_path = os.path.abspath(os.path.join(project_path, "..", "..")) | ||
cccl_headers = [["cub", "cub"], ["libcudacxx", "include"], ["thrust", "thrust"]] | ||
ver = "0.1.2.8.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to use the CCCL version here, not CCCL Python modules' version. We should also not hard-code it, but instead read from CMakeLists which is the source of truth AFAIK, and for that setuptools might not be doing the job. @vyasr might have a simple example for how this can be done with scikit-build-core.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. I added this is a bullet to the PR description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check out the dynamic metadata section, specifically the Regex tab.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You would need to rewrite everything here to use CMake instead of setuptools. Depending on what this module is trying to do that may or may not be beneficial. Do you need to run compilation of cuda_cccl/cooperative/parallel against CCCL headers? In that case it is almost certainly worthwhile, I wouldn't want to orchestrate that compilation using setuptools.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to run compilation of cuda_cccl/cooperative/parallel against CCCL headers?
cuda_cccl
would just be nvidia-cuda-cccl-cuXX containing the headers but owned/maintained by the CCCL team for faster release cycles (think of it ascccl
vscuda-cccl
on conda-forge)cuda_cooperative
JIT compiles CCCL headers at run time, so for all purposes the headers can be thought as shared libraries; no AOT compilation is neededcuda_parallel
is the most interesting case, because it does need to build the CCCL C shared library and include it in the wheel, but I dunno if building it requires NVCC + CCCL headers, or GCC/MSVC alone is enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but I dunno if building it requires NVCC + CCCL headers, or GCC/MSVC alone is enough
Based on
- adding
-DCMAKE_VERBOSE_MAKEFILE=ON
and looking at the output of pip install --verbose ./cuda_parallel[test]
nvcc
is required for compiling cccl/c/parallel/src/for.cu
and reduce.cu
:
cd /home/coder/cccl/python/cuda_parallel/build/temp.linux-x86_64-cpython-312/c/parallel && /usr/bin/sccache /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -ccbin=/usr/bin/g++ -DCCCL_C_EXPERIMENTAL=1 -DNVRTC_GET_TYPE_NAME=1 -D_CCCL_NO_SYSTEM_HEADER -Dcccl_c_parallel_EXPORTS --options-file CMakeFiles/cccl.c.parallel.dir/includes_CUDA.rsp -O3 -DNDEBUG -std=c++20 "--generate-code=arch=compute_52,code=[compute_52,sm_52]" -Xcompiler=-fPIC -Xcudafe=--display_error_number -Wno-deprecated-gpu-targets -Xcudafe=--promote_warnings -Wreorder -Xcompiler=-Werror -Xcompiler=-Wall -Xcompiler=-Wextra -Xcompiler=-Wreorder -Xcompiler=-Winit-self -Xcompiler=-Woverloaded-virtual -Xcompiler=-Wcast-qual -Xcompiler=-Wpointer-arith -Xcompiler=-Wvla -Xcompiler=-Wno-gnu-line-marker -Xcompiler=-Wno-gnu-zero-variadic-macro-arguments -Xcompiler=-Wno-unused-function -Xcompiler=-Wno-noexcept-type -MD -MT c/parallel/CMakeFiles/cccl.c.parallel.dir/src/for.cu.o -MF CMakeFiles/cccl.c.parallel.dir/src/for.cu.o.d -x cu -c /home/coder/cccl/c/parallel/src/for.cu -o CMakeFiles/cccl.c.parallel.dir/src/for.cu.o
cd /home/coder/cccl/python/cuda_parallel/build/temp.linux-x86_64-cpython-312/c/parallel && /usr/bin/sccache /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -ccbin=/usr/bin/g++ -DCCCL_C_EXPERIMENTAL=1 -DNVRTC_GET_TYPE_NAME=1 -D_CCCL_NO_SYSTEM_HEADER -Dcccl_c_parallel_EXPORTS --options-file CMakeFiles/cccl.c.parallel.dir/includes_CUDA.rsp -O3 -DNDEBUG -std=c++20 "--generate-code=arch=compute_52,code=[compute_52,sm_52]" -Xcompiler=-fPIC -Xcudafe=--display_error_number -Wno-deprecated-gpu-targets -Xcudafe=--promote_warnings -Wreorder -Xcompiler=-Werror -Xcompiler=-Wall -Xcompiler=-Wextra -Xcompiler=-Wreorder -Xcompiler=-Winit-self -Xcompiler=-Woverloaded-virtual -Xcompiler=-Wcast-qual -Xcompiler=-Wpointer-arith -Xcompiler=-Wvla -Xcompiler=-Wno-gnu-line-marker -Xcompiler=-Wno-gnu-zero-variadic-macro-arguments -Xcompiler=-Wno-unused-function -Xcompiler=-Wno-noexcept-type -MD -MT c/parallel/CMakeFiles/cccl.c.parallel.dir/src/reduce.cu.o -MF CMakeFiles/cccl.c.parallel.dir/src/reduce.cu.o.d -x cu -c /home/coder/cccl/c/parallel/src/reduce.cu -o CMakeFiles/cccl.c.parallel.dir/src/reduce.cu.o
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I skimmed over the code and I am actually confused, because my impression is that the kernel compilation is still done at run time (JIT), and that the host logic can just be handled by a host compiler. @gevtushenko IIRC you built the prototype, any reason we have to use .cu
files here and use NVCC to compile?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commit 2913ae0 adopts the established _version.py handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tl;dr I would suggest that if you have to do any compilation whatsoever beyond pure Cython you switch away from setuptools, but if you don't have any compiled modules at build time then stick to setuptools or use another backend that isn't designed for compilation (hatchling would be a great choice).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gevtushenko IIRC you built the prototype, any reason we have to use
.cu
files here and use NVCC to compile?
In the offline call Georgii reminded me that there are some CUB structs that we need to pre-compile to pass around. Since generally CUB headers are not host compilable, NVCC has to be used, but we don't generate any GPU-specific code.
Q: In what way is it not working? |
It is getting a non-existing path here:
At HEAD, cuda_paralleld/cuda/_include exists in the source directory (it is |
On August 30, 2014 @leofang wrote: Leo: Do you still recommend that we replace I'm asking because that'll take this PR in a very different direction (I think). |
Logging an observation (JIC it's useful to reference this later): With CCCL HEAD (I have @ d6253b5) TL;DR: @gevtushenko could this explain your "only works 50% of the time" experience? Current working directory is
The output is:
Similarly for cuda_parallel:
Same output as above. |
Now with this PR (@ daab580) TL;DR: Same problem (this had me really confused TBH).
Output:
|
Small summary:
|
Commit ef9d5f4 makes the I wouldn't be surprised if this isn't the right way of doing it, but it does work in one pass. |
… cuda._include to find the include path.
Commit bc116dc fixes the |
… (they are equivalent to the default functions)
It turns out what I discovered the hard way was actually a known issue: Lines 23 to 27 in d6253b5
|
/ok to test |
🟩 CI finished in 58m 34s: Pass: 100%/176 | Total: 1d 00h | Avg: 8m 22s | Max: 44m 12s | Hits: 99%/22510
|
Project | |
---|---|
+/- | CCCL Infrastructure |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
+/- | python |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
+/- | CCCL Infrastructure |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 176)
# | Runner |
---|---|
125 | linux-amd64-cpu16 |
25 | linux-amd64-gpu-v100-latest-1 |
15 | windows-amd64-cpu16 |
10 | linux-arm64-cpu16 |
1 | linux-amd64-gpu-h100-latest-1-testing |
Don't sweat to make major changes. There are technical and cultural reasons that I suggested
However, it does take some efforts to rewrite the build system. Right now our custom |
ci/update_version.sh
Outdated
@@ -110,6 +111,7 @@ update_file "$CUDAX_CMAKE_VERSION_FILE" "set(cudax_VERSION_MAJOR \([0-9]\+\))" " | |||
update_file "$CUDAX_CMAKE_VERSION_FILE" "set(cudax_VERSION_MINOR \([0-9]\+\))" "set(cudax_VERSION_MINOR $minor)" | |||
update_file "$CUDAX_CMAKE_VERSION_FILE" "set(cudax_VERSION_PATCH \([0-9]\+\))" "set(cudax_VERSION_PATCH $patch)" | |||
|
|||
update_file "$CUDA_CCCL_VERSION_FILE" "^__version__ = \"\([0-9.]\+\)\"" "__version__ = \"$pymajor.$pyminor.$major.$minor.$patch\"" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is wrong, we need the headers to track C++ versions
update_file "$CUDA_CCCL_VERSION_FILE" "^__version__ = \"\([0-9.]\+\)\"" "__version__ = \"$pymajor.$pyminor.$major.$minor.$patch\"" | |
update_file "$CUDA_CCCL_VERSION_FILE" "^__version__ = \"\([0-9.]\+\)\"" "__version__ = \"$major.$minor.$patch\"" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done: commit df943c0
python/cuda_cccl/.gitignore
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: Is it possible that we consolidate .gitignore
files at the root directory and not have independent ones per sub dir...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created #3212 to look into this later.
python/cuda_cccl/setup.py
Outdated
__version__ = None | ||
with open(os.path.join(project_path, "cuda", "cccl", "_version.py")) as f: | ||
exec(f.read()) | ||
assert __version__ is not None | ||
ver = __version__ | ||
del __version__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: These can be moved to pyproject.toml cleanly, see, e.g. how we do it in cuda.core:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done: commit 40c8389
python/cuda_cccl/setup.py
Outdated
with open("README.md") as f: | ||
long_description = f.read() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this can be moved to pyproject.toml too, ex:
https://github.com/NVIDIA/cuda-python/blob/33b7366e308201f3bca8206ae331e399ac1b3379/cuda_core/pyproject.toml#L65
(in pyproject.toml, readme
is the new preferred name over long_description
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done: commit 40c8389
python/cuda_cccl/setup.py
Outdated
|
||
|
||
def copy_cccl_headers_to_cuda_include(): | ||
inc_path = os.path.join(project_path, "cuda", "_include") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important: Can we please establish the right layout before merging? There is a new wheel layout being communicated internally. This should be something like
inc_path = os.path.join(project_path, "cuda", "_include") | |
inc_path = os.path.join(project_path, "cccl", "include") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done: commit e3c7867
I'm not sure about two aspects:
-
Did you mean to suggest
site-packages/cccl/include/
as the install directory? — I decided to make thissite-packages/cuda/cccl/include/
. It seemed odd to me to have it outside the cuda subdir. -
By accident I discovered that all cuda.cooperative unit tests pass without the CCCL headers. @gevtushenko for comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By accident I discovered that all cuda.cooperative unit tests pass without the CCCL headers. @gevtushenko for comment.
@rwgk we add path to CUDA Toolkit (CTK) headers here:
cuda_include_path = os.path.join(get_cuda_path(), "include") |
CTK provides CUB headers as well. Tests likely pass because they use CTK version of CUB. We should add a static assert somewhere in cuda.cooperative to check that we use version of CUB that cuda.cooperative was "build" with.
…sted under NVIDIA#3201 (comment)) Command used: ci/update_version.sh 2 8 0
Trigger for this change: * NVIDIA#3201 (comment) * NVIDIA#3201 (comment)
Trigger for this change: * NVIDIA#3201 (comment) Unexpected accidental discovery: cuda.cooperative unit tests pass without CCCL headers entirely.
🟩 CI finished in 2h 05m: Pass: 100%/176 | Total: 3d 07h | Avg: 27m 15s | Max: 1h 15m | Hits: 57%/22530
|
Project | |
---|---|
+/- | CCCL Infrastructure |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
+/- | python |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
+/- | CCCL Infrastructure |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 176)
# | Runner |
---|---|
125 | linux-amd64-cpu16 |
25 | linux-amd64-gpu-v100-latest-1 |
15 | windows-amd64-cpu16 |
10 | linux-arm64-cpu16 |
1 | linux-amd64-gpu-h100-latest-1-testing |
🟩 CI finished in 1h 41m: Pass: 100%/176 | Total: 3d 17h | Avg: 30m 27s | Max: 1h 17m | Hits: 31%/22530
|
Project | |
---|---|
+/- | CCCL Infrastructure |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
+/- | python |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
+/- | CCCL Infrastructure |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 176)
# | Runner |
---|---|
125 | linux-amd64-cpu16 |
25 | linux-amd64-gpu-v100-latest-1 |
15 | windows-amd64-cpu16 |
10 | linux-arm64-cpu16 |
1 | linux-amd64-gpu-h100-latest-1-testing |
🟨 CI finished in 51m 02s: Pass: 99%/176 | Total: 1d 05h | Avg: 9m 59s | Max: 41m 53s | Hits: 82%/22530
|
Project | |
---|---|
+/- | CCCL Infrastructure |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
+/- | python |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
+/- | CCCL Infrastructure |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 176)
# | Runner |
---|---|
125 | linux-amd64-cpu16 |
25 | linux-amd64-gpu-v100-latest-1 |
15 | windows-amd64-cpu16 |
10 | linux-arm64-cpu16 |
1 | linux-amd64-gpu-h100-latest-1-testing |
🟩 CI finished in 1h 08m: Pass: 100%/176 | Total: 1d 05h | Avg: 10m 09s | Max: 47m 06s | Hits: 83%/22530
|
Project | |
---|---|
+/- | CCCL Infrastructure |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
+/- | python |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
+/- | CCCL Infrastructure |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 176)
# | Runner |
---|---|
125 | linux-amd64-cpu16 |
25 | linux-amd64-gpu-v100-latest-1 |
15 | windows-amd64-cpu16 |
10 | linux-arm64-cpu16 |
1 | linux-amd64-gpu-h100-latest-1-testing |
… to resolve Build and Verify Docs failure.)
🟩 CI finished in 1h 18m: Pass: 100%/176 | Total: 1d 04h | Avg: 9m 52s | Max: 52m 11s | Hits: 87%/22530
|
Project | |
---|---|
+/- | CCCL Infrastructure |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
+/- | python |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
+/- | CCCL Infrastructure |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 176)
# | Runner |
---|---|
125 | linux-amd64-cpu16 |
25 | linux-amd64-gpu-v100-latest-1 |
15 | windows-amd64-cpu16 |
10 | linux-arm64-cpu16 |
1 | linux-amd64-gpu-h100-latest-1-testing |
Description
closes #2281