-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use programmatic dependent launch in CUB merge sort #3114
Conversation
b863f58
to
bd24cd7
Compare
/ok to test |
🟨 CI finished in 2h 25m: Pass: 85%/94 | Total: 2d 14h | Avg: 39m 46s | Max: 1h 23m | Hits: 47%/9706
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 94)
# | Runner |
---|---|
70 | linux-amd64-cpu16 |
11 | linux-amd64-gpu-v100-latest-1 |
9 | windows-amd64-cpu16 |
4 | linux-arm64-cpu16 |
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
/ok to test |
🟩 CI finished in 1h 37m: Pass: 100%/94 | Total: 2d 13h | Avg: 39m 17s | Max: 1h 08m | Hits: 62%/12324
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 94)
# | Runner |
---|---|
70 | linux-amd64-cpu16 |
11 | linux-amd64-gpu-v100-latest-1 |
9 | windows-amd64-cpu16 |
4 | linux-arm64-cpu16 |
// True, when programmatic dependent launch is available, otherwise false. | ||
#define _THRUST_HAS_PDL _CCCL_CUDACC_AT_LEAST(11, 8) | ||
#if _THRUST_HAS_PDL | ||
// Waits for the previous kernel to complete (when it reaches its final membar). Should be put before the first global | ||
// memory access in a kernel. | ||
# define _THRUST_PDL_GRID_DEPENDENCY_SYNC() NV_IF_TARGET(NV_PROVIDES_SM_90, cudaGridDependencySynchronize();) | ||
// Allows the subsequent kernel in the same stream to launch. Can be put anywhere in a kernel. | ||
// Heuristic(ahendriksen): put it after the last load. | ||
# define _THRUST_PDL_TRIGGER_NEXT_LAUNCH() NV_IF_TARGET(NV_PROVIDES_SM_90, cudaTriggerProgrammaticLaunchCompletion();) | ||
#else | ||
# define _THRUST_PDL_GRID_DEPENDENCY_SYNC() | ||
# define _THRUST_PDL_TRIGGER_NEXT_LAUNCH() | ||
#endif // _THRUST_HAS_PDL | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those macros are only used within thrust for the moment but they are not specific to thrust.
Should we rather move them into CCCL and name them _CCCL_PDL_MEOW
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved them to cuda/std/__cccl/cuda_capabilities.h
as discussed offline.
* Extend triple_chevron_launch to handle PDL * Flag benchmark as synchronizing * Add launch control APIs to merge sort kernels
🟩 CI finished in 1h 34m: Pass: 100%/168 | Total: 3d 01h | Avg: 26m 08s | Max: 1h 11m | Hits: 71%/22398
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 168)
# | Runner |
---|---|
124 | linux-amd64-cpu16 |
19 | linux-amd64-gpu-v100-latest-1 |
15 | windows-amd64-cpu16 |
10 | linux-arm64-cpu16 |
@@ -127,7 +127,7 @@ void keys(nvbench::state& state, nvbench::type_list<T, OffsetT>) | |||
thrust::device_vector<nvbench::uint8_t> temp(temp_size); | |||
auto* temp_storage = thrust::raw_pointer_cast(temp.data()); | |||
|
|||
state.exec(nvbench::exec_tag::no_batch, [&](nvbench::launch& launch) { | |||
state.exec(nvbench::exec_tag::sync, [&](nvbench::launch& launch) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: why this change is needed? Did merge become synchronous?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Merge sort now has calls to cudaGridDependencySynchronize
, which cause the benchmark to crash if I were to use no_batch
.
For some reason, I made PDL available with CTK 11.8 in NVIDIA#3114, but it seems the feature is only available starting with CTK 12.0.
For some reason, I made PDL available with CTK 11.8 in #3114, but it seems the feature is only available starting with CTK 12.0.
The kernel already contains a call to _CCCL_PDL_GRID_DEPENDENCY_SYNC, but PDL was not enabled when launching it. This was missed in NVIDIA#3114.
This PR explores the use of programmatic dependent launch (PDL) for
cub::MergeSort
.nsys trace for
cub.bench.merge_sort.keys.base -d 0 --stopping-criterion entropy --profile -a 'T{ct}=I8' -a 'OffsetT{ct}=I32' -a 'Elements{io}[pow2]=20' -a 'Entropy=1.000'
Before:
After:
We can see the tighter execution of kernels back-to-back.
Benchmark on H200
Addresses part of #3115