Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use programmatic dependent launch in CUB merge sort #3114

Merged
merged 2 commits into from
Dec 11, 2024

Conversation

bernhardmgruber
Copy link
Contributor

@bernhardmgruber bernhardmgruber commented Dec 10, 2024

This PR explores the use of programmatic dependent launch (PDL) for cub::MergeSort.

nsys trace for cub.bench.merge_sort.keys.base -d 0 --stopping-criterion entropy --profile -a 'T{ct}=I8' -a 'OffsetT{ct}=I32' -a 'Elements{io}[pow2]=20' -a 'Entropy=1.000'
Before:
image

After:
image
We can see the tighter execution of kernels back-to-back.

Benchmark on H200
## [0] NVIDIA H200 NVL

|  T{ct}  |  OffsetT{ct}  |  Elements{io}  |  Entropy  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |        Diff |   %Diff |  Status  |
|---------|---------------|----------------|-----------|------------|-------------|------------|-------------|-------------|---------|----------|
|   I8    |      I32      |      2^16      |     1     |  56.637 us |       0.90% |  47.600 us |       0.85% |   -9.037 us | -15.96% |   FAST   |
|   I8    |      I32      |      2^20      |     1     | 125.550 us |       0.53% | 103.864 us |       0.48% |  -21.686 us | -17.27% |   FAST   |
|   I8    |      I32      |      2^24      |     1     | 754.473 us |       0.34% | 715.045 us |       0.28% |  -39.428 us |  -5.23% |   FAST   |
|   I8    |      I32      |      2^28      |     1     |  12.093 ms |       0.11% |  12.060 ms |       0.12% |  -32.455 us |  -0.27% |   FAST   |
|   I8    |      I32      |      2^16      |   0.201   |  57.289 us |       0.90% |  46.955 us |       0.75% |  -10.334 us | -18.04% |   FAST   |
|   I8    |      I32      |      2^20      |   0.201   | 121.903 us |       0.53% | 101.537 us |       0.43% |  -20.366 us | -16.71% |   FAST   |
|   I8    |      I32      |      2^24      |   0.201   | 711.011 us |       0.39% | 674.561 us |       0.31% |  -36.451 us |  -5.13% |   FAST   |
|   I8    |      I32      |      2^28      |   0.201   |  11.337 ms |       0.14% |  11.276 ms |       0.13% |  -60.914 us |  -0.54% |   FAST   |
|   I8    |      I64      |      2^16      |     1     |  60.455 us |       0.82% |  50.153 us |       1.56% |  -10.302 us | -17.04% |   FAST   |
|   I8    |      I64      |      2^20      |     1     | 129.027 us |       0.47% | 109.501 us |       0.37% |  -19.526 us | -15.13% |   FAST   |
|   I8    |      I64      |      2^24      |     1     | 766.632 us |       0.38% | 729.845 us |       0.25% |  -36.786 us |  -4.80% |   FAST   |
|   I8    |      I64      |      2^28      |     1     |  12.212 ms |       0.10% |  12.159 ms |       0.10% |  -52.046 us |  -0.43% |   FAST   |
|   I8    |      I64      |      2^16      |   0.201   |  59.361 us |       0.91% |  49.300 us |       0.69% |  -10.062 us | -16.95% |   FAST   |
|   I8    |      I64      |      2^20      |   0.201   | 125.903 us |       0.50% | 106.021 us |       0.44% |  -19.882 us | -15.79% |   FAST   |
|   I8    |      I64      |      2^24      |   0.201   | 723.476 us |       0.46% | 688.062 us |       0.35% |  -35.414 us |  -4.89% |   FAST   |
|   I8    |      I64      |      2^28      |   0.201   |  11.447 ms |       0.10% |  11.383 ms |       0.08% |  -63.780 us |  -0.56% |   FAST   |
|   I16   |      I32      |      2^16      |     1     |  62.244 us |       0.82% |  52.173 us |       0.90% |  -10.071 us | -16.18% |   FAST   |
|   I16   |      I32      |      2^20      |     1     | 133.690 us |       0.59% | 114.302 us |       0.45% |  -19.388 us | -14.50% |   FAST   |
|   I16   |      I32      |      2^24      |     1     | 882.365 us |       0.30% | 845.034 us |       0.21% |  -37.331 us |  -4.23% |   FAST   |
|   I16   |      I32      |      2^28      |     1     |  13.294 ms |       0.14% |  13.330 ms |       0.14% |   36.765 us |   0.28% |   SLOW   |
|   I16   |      I32      |      2^16      |   0.201   |  61.778 us |       0.86% |  51.693 us |       1.13% |  -10.084 us | -16.32% |   FAST   |
|   I16   |      I32      |      2^20      |   0.201   | 130.022 us |       0.55% | 110.515 us |       0.48% |  -19.507 us | -15.00% |   FAST   |
|   I16   |      I32      |      2^24      |   0.201   | 816.256 us |       0.23% | 779.617 us |       0.29% |  -36.639 us |  -4.49% |   FAST   |
|   I16   |      I32      |      2^28      |   0.201   |  11.769 ms |       0.15% |  11.820 ms |       0.15% |   50.417 us |   0.43% |   SLOW   |
|   I16   |      I64      |      2^16      |     1     |  62.579 us |       0.80% |  52.677 us |       0.87% |   -9.902 us | -15.82% |   FAST   |
|   I16   |      I64      |      2^20      |     1     | 134.663 us |       0.52% | 115.245 us |       0.41% |  -19.418 us | -14.42% |   FAST   |
|   I16   |      I64      |      2^24      |     1     | 891.979 us |       0.36% | 851.583 us |       0.22% |  -40.396 us |  -4.53% |   FAST   |
|   I16   |      I64      |      2^28      |     1     |  13.473 ms |       0.14% |  13.422 ms |       0.14% |  -50.429 us |  -0.37% |   FAST   |
|   I16   |      I64      |      2^16      |   0.201   |  62.149 us |       0.83% |  52.211 us |       0.73% |   -9.938 us | -15.99% |   FAST   |
|   I16   |      I64      |      2^20      |   0.201   | 132.269 us |       0.53% | 111.969 us |       0.63% |  -20.301 us | -15.35% |   FAST   |
|   I16   |      I64      |      2^24      |   0.201   | 827.477 us |       0.31% | 784.111 us |       0.32% |  -43.366 us |  -5.24% |   FAST   |
|   I16   |      I64      |      2^28      |   0.201   |  11.945 ms |       0.15% |  11.897 ms |       0.16% |  -48.360 us |  -0.40% |   FAST   |
|   I32   |      I32      |      2^16      |     1     |  60.291 us |       0.83% |  50.517 us |       0.64% |   -9.774 us | -16.21% |   FAST   |
|   I32   |      I32      |      2^20      |     1     | 130.722 us |       0.51% | 112.443 us |       0.57% |  -18.279 us | -13.98% |   FAST   |
|   I32   |      I32      |      2^24      |     1     | 901.861 us |       0.25% | 860.995 us |       0.37% |  -40.866 us |  -4.53% |   FAST   |
|   I32   |      I32      |      2^28      |     1     |  14.105 ms |       0.23% |  14.129 ms |       0.25% |   24.298 us |   0.17% |   SAME   |
|   I32   |      I32      |      2^16      |   0.201   |  61.565 us |       0.95% |  51.346 us |       0.95% |  -10.219 us | -16.60% |   FAST   |
|   I32   |      I32      |      2^20      |   0.201   | 130.747 us |       0.54% | 111.227 us |       0.43% |  -19.520 us | -14.93% |   FAST   |
|   I32   |      I32      |      2^24      |   0.201   | 850.385 us |       0.32% | 810.549 us |       0.27% |  -39.836 us |  -4.68% |   FAST   |
|   I32   |      I32      |      2^28      |   0.201   |  12.518 ms |       0.06% |  12.496 ms |       0.07% |  -22.119 us |  -0.18% |   FAST   |
|   I32   |      I64      |      2^16      |     1     |  61.368 us |       0.87% |  50.950 us |       0.62% |  -10.418 us | -16.98% |   FAST   |
|   I32   |      I64      |      2^20      |     1     | 132.968 us |       0.45% | 113.945 us |       0.39% |  -19.023 us | -14.31% |   FAST   |
|   I32   |      I64      |      2^24      |     1     | 907.791 us |       0.23% | 866.074 us |       0.23% |  -41.717 us |  -4.60% |   FAST   |
|   I32   |      I64      |      2^28      |     1     |  14.245 ms |       0.29% |  14.282 ms |       0.58% |   36.830 us |   0.26% |   SAME   |
|   I32   |      I64      |      2^16      |   0.201   |  62.014 us |       1.08% |  53.112 us |       2.21% |   -8.902 us | -14.35% |   FAST   |
|   I32   |      I64      |      2^20      |   0.201   | 131.672 us |       0.44% | 112.474 us |       0.64% |  -19.198 us | -14.58% |   FAST   |
|   I32   |      I64      |      2^24      |   0.201   | 856.974 us |       0.23% | 815.076 us |       0.27% |  -41.898 us |  -4.89% |   FAST   |
|   I32   |      I64      |      2^28      |   0.201   |  12.553 ms |       0.07% |  12.570 ms |       0.50% |   16.765 us |   0.13% |   SLOW   |
|   I64   |      I32      |      2^16      |     1     |  71.689 us |       0.88% |  59.153 us |       0.61% |  -12.536 us | -17.49% |   FAST   |
|   I64   |      I32      |      2^20      |     1     | 184.195 us |       0.44% | 163.066 us |       0.34% |  -21.129 us | -11.47% |   FAST   |
|   I64   |      I32      |      2^24      |     1     |   1.846 ms |       0.23% |   1.808 ms |       0.41% |  -37.939 us |  -2.06% |   FAST   |
|   I64   |      I32      |      2^28      |     1     |  31.919 ms |       0.13% |  32.052 ms |       0.25% |  133.313 us |   0.42% |   SLOW   |
|   I64   |      I32      |      2^16      |   0.201   |  71.621 us |       0.90% |  59.835 us |       0.98% |  -11.786 us | -16.46% |   FAST   |
|   I64   |      I32      |      2^20      |   0.201   | 191.616 us |       0.44% | 172.121 us |       0.50% |  -19.495 us | -10.17% |   FAST   |
|   I64   |      I32      |      2^24      |   0.201   |   1.935 ms |       0.17% |   1.896 ms |       0.31% |  -38.850 us |  -2.01% |   FAST   |
|   I64   |      I32      |      2^28      |   0.201   |  32.532 ms |       0.11% |  32.580 ms |       0.10% |   47.636 us |   0.15% |   SLOW   |
|   I64   |      I64      |      2^16      |     1     |  71.718 us |       0.89% |  59.715 us |       0.68% |  -12.003 us | -16.74% |   FAST   |
|   I64   |      I64      |      2^20      |     1     | 184.457 us |       0.51% | 164.217 us |       0.32% |  -20.240 us | -10.97% |   FAST   |
|   I64   |      I64      |      2^24      |     1     |   1.859 ms |       0.22% |   1.813 ms |       0.32% |  -45.745 us |  -2.46% |   FAST   |
|   I64   |      I64      |      2^28      |     1     |  31.864 ms |       0.22% |  31.931 ms |       0.16% |   66.442 us |   0.21% |   SLOW   |
|   I64   |      I64      |      2^16      |   0.201   |  72.281 us |       0.83% |  60.839 us |       0.61% |  -11.443 us | -15.83% |   FAST   |
|   I64   |      I64      |      2^20      |   0.201   | 195.081 us |       0.43% | 174.550 us |       0.40% |  -20.530 us | -10.52% |   FAST   |
|   I64   |      I64      |      2^24      |   0.201   |   1.943 ms |       0.31% |   1.904 ms |       0.26% |  -38.869 us |  -2.00% |   FAST   |
|   I64   |      I64      |      2^28      |   0.201   |  32.660 ms |       0.09% |  32.707 ms |       0.10% |   46.363 us |   0.14% |   SLOW   |
|  I128   |      I32      |      2^16      |     1     |  83.964 us |       0.90% |  70.504 us |       0.92% |  -13.460 us | -16.03% |   FAST   |
|  I128   |      I32      |      2^20      |     1     | 306.784 us |       0.94% | 281.616 us |       0.43% |  -25.167 us |  -8.20% |   FAST   |
|  I128   |      I32      |      2^24      |     1     |   3.924 ms |       0.21% |   3.880 ms |       0.22% |  -44.186 us |  -1.13% |   FAST   |
|  I128   |      I32      |      2^28      |     1     |  71.054 ms |       0.05% |  70.978 ms |       0.06% |  -76.006 us |  -0.11% |   FAST   |
|  I128   |      I32      |      2^16      |   0.201   |  84.151 us |       0.90% |  70.598 us |       1.15% |  -13.554 us | -16.11% |   FAST   |
|  I128   |      I32      |      2^20      |   0.201   | 311.469 us |       0.81% | 288.039 us |       0.49% |  -23.429 us |  -7.52% |   FAST   |
|  I128   |      I32      |      2^24      |   0.201   |   3.827 ms |       0.11% |   3.782 ms |       0.14% |  -45.344 us |  -1.18% |   FAST   |
|  I128   |      I32      |      2^28      |   0.201   |  67.811 ms |       0.05% |  67.806 ms |       0.06% |   -4.941 us |  -0.01% |   SAME   |
|  I128   |      I64      |      2^16      |     1     |  86.823 us |       0.92% |  70.720 us |       0.50% |  -16.103 us | -18.55% |   FAST   |
|  I128   |      I64      |      2^20      |     1     | 309.301 us |       0.59% | 282.948 us |       0.49% |  -26.353 us |  -8.52% |   FAST   |
|  I128   |      I64      |      2^24      |     1     |   3.950 ms |       0.22% |   3.898 ms |       0.23% |  -51.825 us |  -1.31% |   FAST   |
|  I128   |      I64      |      2^28      |     1     |  71.455 ms |       0.06% |  71.331 ms |       0.05% | -124.015 us |  -0.17% |   FAST   |
|  I128   |      I64      |      2^16      |   0.201   |  87.248 us |       0.89% |  70.819 us |       1.39% |  -16.429 us | -18.83% |   FAST   |
|  I128   |      I64      |      2^20      |   0.201   | 315.809 us |       0.71% | 289.691 us |       0.49% |  -26.118 us |  -8.27% |   FAST   |
|  I128   |      I64      |      2^24      |   0.201   |   3.849 ms |       0.12% |   3.796 ms |       0.12% |  -52.558 us |  -1.37% |   FAST   |
|  I128   |      I64      |      2^28      |   0.201   |  68.132 ms |       0.05% |  68.100 ms |       0.06% |  -32.555 us |  -0.05% |   SAME   |
|   F32   |      I32      |      2^16      |     1     |  61.164 us |       1.04% |  50.853 us |       0.71% |  -10.311 us | -16.86% |   FAST   |
|   F32   |      I32      |      2^20      |     1     | 132.196 us |       0.70% | 113.560 us |       0.50% |  -18.636 us | -14.10% |   FAST   |
|   F32   |      I32      |      2^24      |     1     | 902.337 us |       0.23% | 860.063 us |       0.30% |  -42.274 us |  -4.68% |   FAST   |
|   F32   |      I32      |      2^28      |     1     |  13.858 ms |       0.31% |  13.890 ms |       0.32% |   32.357 us |   0.23% |   SAME   |
|   F32   |      I32      |      2^16      |   0.201   |  60.981 us |       0.95% |  50.940 us |       0.89% |  -10.040 us | -16.46% |   FAST   |
|   F32   |      I32      |      2^20      |   0.201   | 131.584 us |       0.63% | 111.775 us |       0.45% |  -19.809 us | -15.05% |   FAST   |
|   F32   |      I32      |      2^24      |   0.201   | 851.726 us |       0.35% | 811.535 us |       0.29% |  -40.191 us |  -4.72% |   FAST   |
|   F32   |      I32      |      2^28      |   0.201   |  12.512 ms |       0.04% |  12.505 ms |       0.05% |   -7.101 us |  -0.06% |   FAST   |
|   F32   |      I64      |      2^16      |     1     |  61.315 us |       0.79% |  51.074 us |       0.81% |  -10.241 us | -16.70% |   FAST   |
|   F32   |      I64      |      2^20      |     1     | 133.544 us |       0.52% | 114.561 us |       0.48% |  -18.983 us | -14.21% |   FAST   |
|   F32   |      I64      |      2^24      |     1     | 905.479 us |       0.26% | 864.987 us |       0.20% |  -40.493 us |  -4.47% |   FAST   |
|   F32   |      I64      |      2^28      |     1     |  14.257 ms |       1.98% |  14.536 ms |       1.87% |  278.870 us |   1.96% |   SLOW   |
|   F32   |      I64      |      2^16      |   0.201   |  63.155 us |       2.20% |  53.386 us |       2.33% |   -9.769 us | -15.47% |   FAST   |
|   F32   |      I64      |      2^20      |   0.201   | 131.299 us |       0.64% | 113.528 us |       0.51% |  -17.771 us | -13.54% |   FAST   |
|   F32   |      I64      |      2^24      |   0.201   | 853.180 us |       0.31% | 817.087 us |       0.25% |  -36.093 us |  -4.23% |   FAST   |
|   F32   |      I64      |      2^28      |   0.201   |  12.694 ms |       0.59% |  12.739 ms |       0.50% |   44.592 us |   0.35% |   SAME   |
|   F64   |      I32      |      2^16      |     1     |  70.188 us |       0.86% |  58.910 us |       0.86% |  -11.278 us | -16.07% |   FAST   |
|   F64   |      I32      |      2^20      |     1     | 183.386 us |       0.44% | 162.924 us |       0.40% |  -20.462 us | -11.16% |   FAST   |
|   F64   |      I32      |      2^24      |     1     |   1.846 ms |       0.32% |   1.803 ms |       0.36% |  -43.783 us |  -2.37% |   FAST   |
|   F64   |      I32      |      2^28      |     1     |  32.090 ms |       1.01% |  32.387 ms |       1.54% |  297.135 us |   0.93% |   SAME   |
|   F64   |      I32      |      2^16      |   0.201   |  71.093 us |       1.29% |  61.092 us |       2.54% |  -10.001 us | -14.07% |   FAST   |
|   F64   |      I32      |      2^20      |   0.201   | 190.424 us |       0.47% | 170.151 us |       0.36% |  -20.273 us | -10.65% |   FAST   |
|   F64   |      I32      |      2^24      |   0.201   |   1.919 ms |       0.32% |   1.879 ms |       0.27% |  -39.901 us |  -2.08% |   FAST   |
|   F64   |      I32      |      2^28      |   0.201   |  32.452 ms |       0.09% |  32.492 ms |       0.10% |   39.660 us |   0.12% |   SLOW   |
|   F64   |      I64      |      2^16      |     1     |  70.881 us |       0.71% |  59.049 us |       0.65% |  -11.832 us | -16.69% |   FAST   |
|   F64   |      I64      |      2^20      |     1     | 183.486 us |       0.37% | 163.994 us |       0.32% |  -19.491 us | -10.62% |   FAST   |
|   F64   |      I64      |      2^24      |     1     |   1.850 ms |       0.18% |   1.813 ms |       0.35% |  -37.013 us |  -2.00% |   FAST   |
|   F64   |      I64      |      2^28      |     1     |  31.957 ms |       0.11% |  32.032 ms |       0.11% |   74.882 us |   0.23% |   SLOW   |
|   F64   |      I64      |      2^16      |   0.201   |  71.335 us |       0.69% |  60.037 us |       0.58% |  -11.298 us | -15.84% |   FAST   |
|   F64   |      I64      |      2^20      |   0.201   | 192.146 us |       0.42% | 173.485 us |       0.42% |  -18.661 us |  -9.71% |   FAST   |
|   F64   |      I64      |      2^24      |   0.201   |   1.930 ms |       0.30% |   1.890 ms |       0.25% |  -39.434 us |  -2.04% |   FAST   |
|   F64   |      I64      |      2^28      |   0.201   |  32.608 ms |       0.09% |  32.687 ms |       0.10% |   79.401 us |   0.24% |   SLOW   |
|   C64   |      I32      |      2^16      |     1     | 207.577 us |       0.58% | 196.390 us |       0.50% |  -11.187 us |  -5.39% |   FAST   |
|   C64   |      I32      |      2^20      |     1     | 447.000 us |       0.42% | 427.794 us |       0.37% |  -19.206 us |  -4.30% |   FAST   |
|   C64   |      I32      |      2^24      |     1     |   5.284 ms |       0.17% |   5.283 ms |       0.14% |   -1.195 us |  -0.02% |   SAME   |
|   C64   |      I32      |      2^28      |     1     | 101.346 ms |       0.03% | 102.088 ms |       0.04% |  742.288 us |   0.73% |   SLOW   |
|   C64   |      I32      |      2^16      |   0.201   | 320.975 us |       0.41% | 310.896 us |       0.43% |  -10.079 us |  -3.14% |   FAST   |
|   C64   |      I32      |      2^20      |   0.201   | 700.654 us |       0.53% | 684.135 us |       0.56% |  -16.519 us |  -2.36% |   FAST   |
|   C64   |      I32      |      2^24      |   0.201   |  10.557 ms |       0.28% |  10.542 ms |       0.27% |  -15.063 us |  -0.14% |   SAME   |
|   C64   |      I32      |      2^28      |   0.201   | 173.781 ms |       0.06% | 174.236 ms |       0.05% |  455.522 us |   0.26% |   SLOW   |
|   C64   |      I64      |      2^16      |     1     | 209.579 us |       0.55% | 196.378 us |       0.51% |  -13.200 us |  -6.30% |   FAST   |
|   C64   |      I64      |      2^20      |     1     | 452.551 us |       0.41% | 430.664 us |       0.43% |  -21.887 us |  -4.84% |   FAST   |
|   C64   |      I64      |      2^24      |     1     |   5.374 ms |       0.13% |   5.321 ms |       0.14% |  -52.959 us |  -0.99% |   FAST   |
|   C64   |      I64      |      2^28      |     1     | 103.592 ms |       0.10% | 102.974 ms |       0.12% | -617.989 us |  -0.60% |   FAST   |
|   C64   |      I64      |      2^16      |   0.201   | 324.018 us |       0.51% | 311.103 us |       0.45% |  -12.915 us |  -3.99% |   FAST   |
|   C64   |      I64      |      2^20      |   0.201   | 706.347 us |       0.55% | 685.960 us |       0.55% |  -20.387 us |  -2.89% |   FAST   |
|   C64   |      I64      |      2^24      |   0.201   |  10.639 ms |       0.28% |  10.685 ms |       0.28% |   46.559 us |   0.44% |   SLOW   |
|   C64   |      I64      |      2^28      |   0.201   | 175.164 ms |       0.06% | 176.671 ms |       0.05% |    1.508 ms |   0.86% |   SLOW   |
We can see basically only improvements (up to 18%), especially for small problem sizes and a maximum of 2% regression.

Addresses part of #3115

@bernhardmgruber bernhardmgruber changed the title Use fast dependent launch in CUB merge sort Use programmatic dependent launch in CUB merge sort Dec 10, 2024
@bernhardmgruber bernhardmgruber force-pushed the fdl branch 3 times, most recently from b863f58 to bd24cd7 Compare December 10, 2024 18:07
@NVIDIA NVIDIA deleted a comment from copy-pr-bot bot Dec 10, 2024
@bernhardmgruber
Copy link
Contributor Author

/ok to test

Copy link
Contributor

🟨 CI finished in 2h 25m: Pass: 85%/94 | Total: 2d 14h | Avg: 39m 46s | Max: 1h 23m | Hits: 47%/9706
  • 🟨 thrust: Pass: 84%/46 | Total: 23h 53m | Avg: 31m 10s | Max: 1h 15m | Hits: 53%/7408

    🔍 cpu: amd64 🔍
      🔍 amd64              Pass:  84%/44  | Total: 22h 47m | Avg: 31m 05s | Max:  1h 15m | Hits:  53%/7408  
      🟩 arm64              Pass: 100%/2   | Total:  1h 06m | Avg: 33m 03s | Max: 35m 29s
    🚨 ctk: 11.1 🚨
      🔥 11.1               Pass:   0%/7   | Total:  1h 08m | Avg:  9m 44s | Max: 48m 16s
      🟩 12.5               Pass: 100%/2   | Total:  2h 17m | Avg:  1h 08m | Max:  1h 13m
      🟩 12.6               Pass: 100%/37  | Total: 20h 27m | Avg: 33m 11s | Max:  1h 15m | Hits:  53%/7408  
    🚨 cudacxx: nvcc11.1 🚨
      🟩 ClangCUDA18        Pass: 100%/2   | Total: 57m 43s | Avg: 28m 51s | Max: 29m 05s
      🔥 nvcc11.1           Pass:   0%/7   | Total:  1h 08m | Avg:  9m 44s | Max: 48m 16s
      🟩 nvcc12.5           Pass: 100%/2   | Total:  2h 17m | Avg:  1h 08m | Max:  1h 13m
      🟩 nvcc12.6           Pass: 100%/35  | Total: 19h 30m | Avg: 33m 26s | Max:  1h 15m | Hits:  53%/7408  
    🔍 cudacxx_family: nvcc 🔍
      🟩 ClangCUDA          Pass: 100%/2   | Total: 57m 43s | Avg: 28m 51s | Max: 29m 05s
      🔍 nvcc               Pass:  84%/44  | Total: 22h 56m | Avg: 31m 16s | Max:  1h 15m | Hits:  53%/7408  
    🔍 jobs: Build 🔍
      🔍 Build              Pass:  82%/40  | Total: 22h 33m | Avg: 33m 49s | Max:  1h 15m | Hits:  38%/5556  
      🟩 TestCPU            Pass: 100%/3   | Total: 39m 09s | Avg: 13m 03s | Max: 23m 14s | Hits:  99%/1852  
      🟩 TestGPU            Pass: 100%/3   | Total: 41m 40s | Avg: 13m 53s | Max: 16m 52s
    🟨 cxx
      🟨 Clang9             Pass:  50%/4   | Total:  1h 04m | Avg: 16m 14s | Max: 31m 15s
      🟩 Clang10            Pass: 100%/1   | Total: 38m 03s | Avg: 38m 03s | Max: 38m 03s
      🟩 Clang11            Pass: 100%/1   | Total: 33m 45s | Avg: 33m 45s | Max: 33m 45s
      🟩 Clang12            Pass: 100%/1   | Total: 31m 50s | Avg: 31m 50s | Max: 31m 50s
      🟩 Clang13            Pass: 100%/1   | Total: 32m 55s | Avg: 32m 55s | Max: 32m 55s
      🟩 Clang14            Pass: 100%/1   | Total: 31m 38s | Avg: 31m 38s | Max: 31m 38s
      🟩 Clang15            Pass: 100%/1   | Total: 36m 52s | Avg: 36m 52s | Max: 36m 52s
      🟩 Clang16            Pass: 100%/1   | Total: 35m 51s | Avg: 35m 51s | Max: 35m 51s
      🟩 Clang17            Pass: 100%/1   | Total: 34m 38s | Avg: 34m 38s | Max: 34m 38s
      🟩 Clang18            Pass: 100%/7   | Total:  2h 58m | Avg: 25m 30s | Max: 33m 32s
      🟥 GCC6               Pass:   0%/2   | Total:  6m 28s | Avg:  3m 14s | Max:  3m 17s
      🟩 GCC7               Pass: 100%/2   | Total: 56m 56s | Avg: 28m 28s | Max: 32m 03s
      🟩 GCC8               Pass: 100%/1   | Total: 32m 34s | Avg: 32m 34s | Max: 32m 34s
      🟨 GCC9               Pass:  33%/3   | Total: 40m 39s | Avg: 13m 33s | Max: 34m 11s
      🟩 GCC10              Pass: 100%/1   | Total: 38m 22s | Avg: 38m 22s | Max: 38m 22s
      🟩 GCC11              Pass: 100%/1   | Total: 36m 30s | Avg: 36m 30s | Max: 36m 30s
      🟩 GCC12              Pass: 100%/1   | Total: 36m 54s | Avg: 36m 54s | Max: 36m 54s
      🟩 GCC13              Pass: 100%/8   | Total:  3h 17m | Avg: 24m 37s | Max: 38m 15s
      🟩 Intel2023.2.0      Pass: 100%/1   | Total: 45m 59s | Avg: 45m 59s | Max: 45m 59s
      🟥 MSVC14.16          Pass:   0%/1   | Total: 48m 16s | Avg: 48m 16s | Max: 48m 16s
      🟩 MSVC14.29          Pass: 100%/1   | Total:  1h 10m | Avg:  1h 10m | Max:  1h 10m | Hits:  39%/1852  
      🟩 MSVC14.39          Pass: 100%/3   | Total:  2h 46m | Avg: 55m 33s | Max:  1h 15m | Hits:  58%/5556  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  2h 17m | Avg:  1h 08m | Max:  1h 13m
    🟨 cxx_family
      🟨 Clang              Pass:  89%/19  | Total:  8h 39m | Avg: 27m 18s | Max: 38m 03s
      🟨 GCC                Pass:  78%/19  | Total:  7h 25m | Avg: 23m 26s | Max: 38m 22s
      🟩 Intel              Pass: 100%/1   | Total: 45m 59s | Avg: 45m 59s | Max: 45m 59s
      🟨 MSVC               Pass:  80%/5   | Total:  4h 45m | Avg: 57m 08s | Max:  1h 15m | Hits:  53%/7408  
      🟩 NVHPC              Pass: 100%/2   | Total:  2h 17m | Avg:  1h 08m | Max:  1h 13m
    🟨 std
      🟨 11                 Pass:  40%/5   | Total:  1h 01m | Avg: 12m 14s | Max: 26m 44s
      🟨 14                 Pass:  50%/4   | Total:  1h 54m | Avg: 28m 42s | Max: 48m 16s
      🟨 17                 Pass:  83%/12  | Total:  7h 38m | Avg: 38m 12s | Max:  1h 10m | Hits:  40%/3704  
      🟩 20                 Pass: 100%/23  | Total: 12h 34m | Avg: 32m 47s | Max:  1h 15m | Hits:  66%/3704  
    🟨 gpu
      🟨 v100               Pass:  84%/46  | Total: 23h 53m | Avg: 31m 10s | Max:  1h 15m | Hits:  53%/7408  
    🟩 cmake_options
      🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 45m 01s | Avg: 22m 30s | Max: 33m 13s
    🟩 sm
      🟩 90a                Pass: 100%/1   | Total: 20m 42s | Avg: 20m 42s | Max: 20m 42s
    
  • 🟨 cub: Pass: 84%/45 | Total: 1d 13h | Avg: 50m 21s | Max: 1h 23m | Hits: 29%/2298

    🔍 cpu: amd64 🔍
      🔍 amd64              Pass:  83%/43  | Total:  1d 11h | Avg: 49m 53s | Max:  1h 23m | Hits:  29%/2298  
      🟩 arm64              Pass: 100%/2   | Total:  2h 00m | Avg:  1h 00m | Max:  1h 01m
    🚨 ctk: 11.1 🚨
      🔥 11.1               Pass:   0%/7   | Total:  3h 37m | Avg: 31m 01s | Max: 52m 13s
      🟩 12.5               Pass: 100%/2   | Total:  2h 22m | Avg:  1h 11m | Max:  1h 15m
      🟩 12.6               Pass: 100%/36  | Total:  1d 07h | Avg: 52m 57s | Max:  1h 23m | Hits:  29%/2298  
    🚨 cudacxx: nvcc11.1 🚨
      🟩 ClangCUDA18        Pass: 100%/2   | Total:  1h 55m | Avg: 57m 59s | Max: 59m 59s
      🔥 nvcc11.1           Pass:   0%/7   | Total:  3h 37m | Avg: 31m 01s | Max: 52m 13s
      🟩 nvcc12.5           Pass: 100%/2   | Total:  2h 22m | Avg:  1h 11m | Max:  1h 15m
      🟩 nvcc12.6           Pass: 100%/34  | Total:  1d 05h | Avg: 52m 40s | Max:  1h 23m | Hits:  29%/2298  
    🔍 cudacxx_family: nvcc 🔍
      🟩 ClangCUDA          Pass: 100%/2   | Total:  1h 55m | Avg: 57m 59s | Max: 59m 59s
      🔍 nvcc               Pass:  83%/43  | Total:  1d 11h | Avg: 50m 00s | Max:  1h 23m | Hits:  29%/2298  
    🔍 jobs: Build 🔍
      🔍 Build              Pass:  82%/39  | Total:  1d 10h | Avg: 53m 32s | Max:  1h 15m | Hits:  29%/2298  
      🟩 DeviceLaunch       Pass: 100%/1   | Total: 25m 17s | Avg: 25m 17s | Max: 25m 17s
      🟩 GraphCapture       Pass: 100%/1   | Total: 13m 57s | Avg: 13m 57s | Max: 13m 57s
      🟩 HostLaunch         Pass: 100%/2   | Total: 35m 04s | Avg: 17m 32s | Max: 18m 17s
      🟩 TestGPU            Pass: 100%/2   | Total:  1h 43m | Avg: 51m 44s | Max:  1h 23m
    🟨 cxx
      🟨 Clang9             Pass:  50%/4   | Total:  2h 49m | Avg: 42m 21s | Max:  1h 00m
      🟩 Clang10            Pass: 100%/1   | Total: 54m 15s | Avg: 54m 15s | Max: 54m 15s
      🟩 Clang11            Pass: 100%/1   | Total: 59m 23s | Avg: 59m 23s | Max: 59m 23s
      🟩 Clang12            Pass: 100%/1   | Total:  1h 00m | Avg:  1h 00m | Max:  1h 00m
      🟩 Clang13            Pass: 100%/1   | Total: 59m 57s | Avg: 59m 57s | Max: 59m 57s
      🟩 Clang14            Pass: 100%/1   | Total: 56m 00s | Avg: 56m 00s | Max: 56m 00s
      🟩 Clang15            Pass: 100%/1   | Total: 54m 02s | Avg: 54m 02s | Max: 54m 02s
      🟩 Clang16            Pass: 100%/1   | Total: 57m 46s | Avg: 57m 46s | Max: 57m 46s
      🟩 Clang17            Pass: 100%/1   | Total: 54m 34s | Avg: 54m 34s | Max: 54m 34s
      🟩 Clang18            Pass: 100%/7   | Total:  5h 27m | Avg: 46m 50s | Max: 59m 59s
      🟥 GCC6               Pass:   0%/2   | Total: 56m 02s | Avg: 28m 01s | Max: 28m 49s
      🟩 GCC7               Pass: 100%/2   | Total:  1h 46m | Avg: 53m 15s | Max: 53m 53s
      🟩 GCC8               Pass: 100%/1   | Total: 52m 58s | Avg: 52m 58s | Max: 52m 58s
      🟨 GCC9               Pass:  33%/3   | Total:  1h 52m | Avg: 37m 37s | Max: 57m 46s
      🟩 GCC10              Pass: 100%/1   | Total:  1h 01m | Avg:  1h 01m | Max:  1h 01m
      🟩 GCC11              Pass: 100%/1   | Total: 55m 43s | Avg: 55m 43s | Max: 55m 43s
      🟩 GCC12              Pass: 100%/1   | Total:  1h 01m | Avg:  1h 01m | Max:  1h 01m
      🟩 GCC13              Pass: 100%/8   | Total:  5h 51m | Avg: 43m 52s | Max:  1h 23m
      🟩 Intel2023.2.0      Pass: 100%/1   | Total:  1h 04m | Avg:  1h 04m | Max:  1h 04m
      🟥 MSVC14.16          Pass:   0%/1   | Total: 52m 13s | Avg: 52m 13s | Max: 52m 13s
      🟩 MSVC14.29          Pass: 100%/1   | Total:  1h 04m | Avg:  1h 04m | Max:  1h 04m | Hits:  42%/766   
      🟩 MSVC14.39          Pass: 100%/2   | Total:  2h 10m | Avg:  1h 05m | Max:  1h 07m | Hits:  22%/1532  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  2h 22m | Avg:  1h 11m | Max:  1h 15m
    🟨 cxx_family
      🟨 Clang              Pass:  89%/19  | Total: 15h 53m | Avg: 50m 12s | Max:  1h 00m
      🟨 GCC                Pass:  78%/19  | Total: 14h 18m | Avg: 45m 10s | Max:  1h 23m
      🟩 Intel              Pass: 100%/1   | Total:  1h 04m | Avg:  1h 04m | Max:  1h 04m
      🟨 MSVC               Pass:  75%/4   | Total:  4h 07m | Avg:  1h 01m | Max:  1h 07m | Hits:  29%/2298  
      🟩 NVHPC              Pass: 100%/2   | Total:  2h 22m | Avg:  1h 11m | Max:  1h 15m
    🟨 std
      🟨 11                 Pass:  40%/5   | Total:  3h 13m | Avg: 38m 39s | Max: 55m 25s
      🟨 14                 Pass:  50%/4   | Total:  3h 12m | Avg: 48m 04s | Max:  1h 00m
      🟨 17                 Pass:  83%/12  | Total: 11h 00m | Avg: 55m 03s | Max:  1h 15m | Hits:  42%/1532  
      🟩 20                 Pass: 100%/24  | Total: 20h 19m | Avg: 50m 49s | Max:  1h 23m | Hits:   3%/766   
    🟨 gpu
      🟨 v100               Pass:  84%/45  | Total:  1d 13h | Avg: 50m 21s | Max:  1h 23m | Hits:  29%/2298  
    🟩 sm
      🟩 90a                Pass: 100%/1   | Total: 28m 20s | Avg: 28m 20s | Max: 28m 20s
    
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 9m 02s | Avg: 4m 31s | Max: 6m 46s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total:  9m 02s | Avg:  4m 31s | Max:  6m 46s
    🟩 ctk
      🟩 12.6               Pass: 100%/2   | Total:  9m 02s | Avg:  4m 31s | Max:  6m 46s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/2   | Total:  9m 02s | Avg:  4m 31s | Max:  6m 46s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total:  9m 02s | Avg:  4m 31s | Max:  6m 46s
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total:  9m 02s | Avg:  4m 31s | Max:  6m 46s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total:  9m 02s | Avg:  4m 31s | Max:  6m 46s
    🟩 gpu
      🟩 v100               Pass: 100%/2   | Total:  9m 02s | Avg:  4m 31s | Max:  6m 46s
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 16s | Avg:  2m 16s | Max:  2m 16s
      🟩 Test               Pass: 100%/1   | Total:  6m 46s | Avg:  6m 46s | Max:  6m 46s
    
  • 🟩 python: Pass: 100%/1 | Total: 29m 11s | Avg: 29m 11s | Max: 29m 11s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 29m 11s | Avg: 29m 11s | Max: 29m 11s
    🟩 ctk
      🟩 12.6               Pass: 100%/1   | Total: 29m 11s | Avg: 29m 11s | Max: 29m 11s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/1   | Total: 29m 11s | Avg: 29m 11s | Max: 29m 11s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 29m 11s | Avg: 29m 11s | Max: 29m 11s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 29m 11s | Avg: 29m 11s | Max: 29m 11s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 29m 11s | Avg: 29m 11s | Max: 29m 11s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 29m 11s | Avg: 29m 11s | Max: 29m 11s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 29m 11s | Avg: 29m 11s | Max: 29m 11s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 94)

# Runner
70 linux-amd64-cpu16
11 linux-amd64-gpu-v100-latest-1
9 windows-amd64-cpu16
4 linux-arm64-cpu16

Copy link

copy-pr-bot bot commented Dec 11, 2024

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@bernhardmgruber
Copy link
Contributor Author

/ok to test

Copy link
Contributor

🟩 CI finished in 1h 37m: Pass: 100%/94 | Total: 2d 13h | Avg: 39m 17s | Max: 1h 08m | Hits: 62%/12324
  • 🟩 thrust: Pass: 100%/46 | Total: 23h 48m | Avg: 31m 02s | Max: 1h 08m | Hits: 69%/9260

    🟩 cmake_options
      🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 36m 50s | Avg: 18m 25s | Max: 25m 11s
    🟩 cpu
      🟩 amd64              Pass: 100%/44  | Total: 22h 51m | Avg: 31m 10s | Max:  1h 08m | Hits:  69%/9260  
      🟩 arm64              Pass: 100%/2   | Total: 56m 19s | Avg: 28m 09s | Max: 30m 31s
    🟩 ctk
      🟩 11.1               Pass: 100%/7   | Total:  3h 36m | Avg: 30m 55s | Max: 57m 14s | Hits:  62%/1852  
      🟩 12.5               Pass: 100%/2   | Total:  1h 52m | Avg: 56m 15s | Max:  1h 01m
      🟩 12.6               Pass: 100%/37  | Total: 18h 19m | Avg: 29m 42s | Max:  1h 08m | Hits:  71%/7408  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total: 48m 20s | Avg: 24m 10s | Max: 24m 35s
      🟩 nvcc11.1           Pass: 100%/7   | Total:  3h 36m | Avg: 30m 55s | Max: 57m 14s | Hits:  62%/1852  
      🟩 nvcc12.5           Pass: 100%/2   | Total:  1h 52m | Avg: 56m 15s | Max:  1h 01m
      🟩 nvcc12.6           Pass: 100%/35  | Total: 17h 31m | Avg: 30m 01s | Max:  1h 08m | Hits:  71%/7408  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 48m 20s | Avg: 24m 10s | Max: 24m 35s
      🟩 nvcc               Pass: 100%/44  | Total: 22h 59m | Avg: 31m 21s | Max:  1h 08m | Hits:  69%/9260  
    🟩 cxx
      🟩 Clang9             Pass: 100%/4   | Total:  1h 53m | Avg: 28m 21s | Max: 35m 24s
      🟩 Clang10            Pass: 100%/1   | Total: 35m 44s | Avg: 35m 44s | Max: 35m 44s
      🟩 Clang11            Pass: 100%/1   | Total: 30m 29s | Avg: 30m 29s | Max: 30m 29s
      🟩 Clang12            Pass: 100%/1   | Total: 33m 30s | Avg: 33m 30s | Max: 33m 30s
      🟩 Clang13            Pass: 100%/1   | Total: 30m 42s | Avg: 30m 42s | Max: 30m 42s
      🟩 Clang14            Pass: 100%/1   | Total: 29m 33s | Avg: 29m 33s | Max: 29m 33s
      🟩 Clang15            Pass: 100%/1   | Total: 32m 26s | Avg: 32m 26s | Max: 32m 26s
      🟩 Clang16            Pass: 100%/1   | Total: 31m 22s | Avg: 31m 22s | Max: 31m 22s
      🟩 Clang17            Pass: 100%/1   | Total: 31m 43s | Avg: 31m 43s | Max: 31m 43s
      🟩 Clang18            Pass: 100%/7   | Total:  2h 41m | Avg: 23m 04s | Max: 31m 22s
      🟩 GCC6               Pass: 100%/2   | Total: 52m 21s | Avg: 26m 10s | Max: 30m 52s
      🟩 GCC7               Pass: 100%/2   | Total: 50m 08s | Avg: 25m 04s | Max: 28m 15s
      🟩 GCC8               Pass: 100%/1   | Total: 30m 31s | Avg: 30m 31s | Max: 30m 31s
      🟩 GCC9               Pass: 100%/3   | Total:  1h 24m | Avg: 28m 05s | Max: 30m 56s
      🟩 GCC10              Pass: 100%/1   | Total: 28m 57s | Avg: 28m 57s | Max: 28m 57s
      🟩 GCC11              Pass: 100%/1   | Total: 33m 00s | Avg: 33m 00s | Max: 33m 00s
      🟩 GCC12              Pass: 100%/1   | Total: 30m 08s | Avg: 30m 08s | Max: 30m 08s
      🟩 GCC13              Pass: 100%/8   | Total:  2h 47m | Avg: 20m 56s | Max: 32m 36s
      🟩 Intel2023.2.0      Pass: 100%/1   | Total: 40m 55s | Avg: 40m 55s | Max: 40m 55s
      🟩 MSVC14.16          Pass: 100%/1   | Total: 57m 14s | Avg: 57m 14s | Max: 57m 14s | Hits:  62%/1852  
      🟩 MSVC14.29          Pass: 100%/1   | Total: 57m 58s | Avg: 57m 58s | Max: 57m 58s | Hits:  62%/1852  
      🟩 MSVC14.39          Pass: 100%/3   | Total:  2h 32m | Avg: 50m 46s | Max:  1h 08m | Hits:  74%/5556  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  1h 52m | Avg: 56m 15s | Max:  1h 01m
    🟩 cxx_family
      🟩 Clang              Pass: 100%/19  | Total:  8h 50m | Avg: 27m 54s | Max: 35m 44s
      🟩 GCC                Pass: 100%/19  | Total:  7h 56m | Avg: 25m 06s | Max: 33m 00s
      🟩 Intel              Pass: 100%/1   | Total: 40m 55s | Avg: 40m 55s | Max: 40m 55s
      🟩 MSVC               Pass: 100%/5   | Total:  4h 27m | Avg: 53m 30s | Max:  1h 08m | Hits:  69%/9260  
      🟩 NVHPC              Pass: 100%/2   | Total:  1h 52m | Avg: 56m 15s | Max:  1h 01m
    🟩 gpu
      🟩 v100               Pass: 100%/46  | Total: 23h 48m | Avg: 31m 02s | Max:  1h 08m | Hits:  69%/9260  
    🟩 jobs
      🟩 Build              Pass: 100%/40  | Total: 22h 27m | Avg: 33m 41s | Max:  1h 08m | Hits:  62%/7408  
      🟩 TestCPU            Pass: 100%/3   | Total: 38m 56s | Avg: 12m 58s | Max: 24m 15s | Hits:  99%/1852  
      🟩 TestGPU            Pass: 100%/3   | Total: 42m 00s | Avg: 14m 00s | Max: 18m 18s
    🟩 sm
      🟩 90a                Pass: 100%/1   | Total: 18m 35s | Avg: 18m 35s | Max: 18m 35s
    🟩 std
      🟩 11                 Pass: 100%/5   | Total:  1h 54m | Avg: 22m 56s | Max: 25m 08s
      🟩 14                 Pass: 100%/4   | Total:  2h 31m | Avg: 37m 56s | Max: 57m 14s | Hits:  62%/1852  
      🟩 17                 Pass: 100%/12  | Total:  7h 29m | Avg: 37m 27s | Max: 59m 12s | Hits:  62%/3704  
      🟩 20                 Pass: 100%/23  | Total: 11h 15m | Avg: 29m 22s | Max:  1h 08m | Hits:  80%/3704  
    
  • 🟩 cub: Pass: 100%/45 | Total: 1d 13h | Avg: 49m 26s | Max: 1h 07m | Hits: 42%/3064

    🟩 cpu
      🟩 amd64              Pass: 100%/43  | Total:  1d 11h | Avg: 49m 08s | Max:  1h 07m | Hits:  42%/3064  
      🟩 arm64              Pass: 100%/2   | Total:  1h 51m | Avg: 55m 51s | Max: 57m 58s
    🟩 ctk
      🟩 11.1               Pass: 100%/7   | Total:  5h 30m | Avg: 47m 13s | Max: 56m 55s | Hits:  42%/766   
      🟩 12.5               Pass: 100%/2   | Total:  2h 02m | Avg:  1h 01m | Max:  1h 01m
      🟩 12.6               Pass: 100%/36  | Total:  1d 05h | Avg: 49m 11s | Max:  1h 07m | Hits:  42%/2298  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total:  1h 54m | Avg: 57m 17s | Max: 57m 27s
      🟩 nvcc11.1           Pass: 100%/7   | Total:  5h 30m | Avg: 47m 13s | Max: 56m 55s | Hits:  42%/766   
      🟩 nvcc12.5           Pass: 100%/2   | Total:  2h 02m | Avg:  1h 01m | Max:  1h 01m
      🟩 nvcc12.6           Pass: 100%/34  | Total:  1d 03h | Avg: 48m 43s | Max:  1h 07m | Hits:  42%/2298  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total:  1h 54m | Avg: 57m 17s | Max: 57m 27s
      🟩 nvcc               Pass: 100%/43  | Total:  1d 11h | Avg: 49m 04s | Max:  1h 07m | Hits:  42%/3064  
    🟩 cxx
      🟩 Clang9             Pass: 100%/4   | Total:  3h 14m | Avg: 48m 39s | Max: 54m 25s
      🟩 Clang10            Pass: 100%/1   | Total: 54m 20s | Avg: 54m 20s | Max: 54m 20s
      🟩 Clang11            Pass: 100%/1   | Total: 51m 28s | Avg: 51m 28s | Max: 51m 28s
      🟩 Clang12            Pass: 100%/1   | Total: 57m 04s | Avg: 57m 04s | Max: 57m 04s
      🟩 Clang13            Pass: 100%/1   | Total: 57m 41s | Avg: 57m 41s | Max: 57m 41s
      🟩 Clang14            Pass: 100%/1   | Total: 56m 28s | Avg: 56m 28s | Max: 56m 28s
      🟩 Clang15            Pass: 100%/1   | Total: 51m 00s | Avg: 51m 00s | Max: 51m 00s
      🟩 Clang16            Pass: 100%/1   | Total: 53m 31s | Avg: 53m 31s | Max: 53m 31s
      🟩 Clang17            Pass: 100%/1   | Total: 52m 29s | Avg: 52m 29s | Max: 52m 29s
      🟩 Clang18            Pass: 100%/7   | Total:  5h 15m | Avg: 45m 04s | Max: 57m 27s
      🟩 GCC6               Pass: 100%/2   | Total:  1h 31m | Avg: 45m 42s | Max: 47m 24s
      🟩 GCC7               Pass: 100%/2   | Total:  1h 49m | Avg: 54m 53s | Max: 55m 59s
      🟩 GCC8               Pass: 100%/1   | Total: 52m 04s | Avg: 52m 04s | Max: 52m 04s
      🟩 GCC9               Pass: 100%/3   | Total:  2h 27m | Avg: 49m 00s | Max: 53m 33s
      🟩 GCC10              Pass: 100%/1   | Total: 54m 13s | Avg: 54m 13s | Max: 54m 13s
      🟩 GCC11              Pass: 100%/1   | Total: 55m 37s | Avg: 55m 37s | Max: 55m 37s
      🟩 GCC12              Pass: 100%/1   | Total: 57m 35s | Avg: 57m 35s | Max: 57m 35s
      🟩 GCC13              Pass: 100%/8   | Total:  4h 34m | Avg: 34m 20s | Max: 57m 58s
      🟩 Intel2023.2.0      Pass: 100%/1   | Total: 57m 29s | Avg: 57m 29s | Max: 57m 29s
      🟩 MSVC14.16          Pass: 100%/1   | Total: 56m 55s | Avg: 56m 55s | Max: 56m 55s | Hits:  42%/766   
      🟩 MSVC14.29          Pass: 100%/1   | Total:  1h 06m | Avg:  1h 06m | Max:  1h 06m | Hits:  42%/766   
      🟩 MSVC14.39          Pass: 100%/2   | Total:  2h 13m | Avg:  1h 06m | Max:  1h 07m | Hits:  41%/1532  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  2h 02m | Avg:  1h 01m | Max:  1h 01m
    🟩 cxx_family
      🟩 Clang              Pass: 100%/19  | Total: 15h 44m | Avg: 49m 41s | Max: 57m 41s
      🟩 GCC                Pass: 100%/19  | Total: 14h 02m | Avg: 44m 20s | Max: 57m 58s
      🟩 Intel              Pass: 100%/1   | Total: 57m 29s | Avg: 57m 29s | Max: 57m 29s
      🟩 MSVC               Pass: 100%/4   | Total:  4h 17m | Avg:  1h 04m | Max:  1h 07m | Hits:  42%/3064  
      🟩 NVHPC              Pass: 100%/2   | Total:  2h 02m | Avg:  1h 01m | Max:  1h 01m
    🟩 gpu
      🟩 v100               Pass: 100%/45  | Total:  1d 13h | Avg: 49m 26s | Max:  1h 07m | Hits:  42%/3064  
    🟩 jobs
      🟩 Build              Pass: 100%/39  | Total:  1d 10h | Avg: 53m 49s | Max:  1h 07m | Hits:  42%/3064  
      🟩 DeviceLaunch       Pass: 100%/1   | Total: 17m 37s | Avg: 17m 37s | Max: 17m 37s
      🟩 GraphCapture       Pass: 100%/1   | Total: 14m 30s | Avg: 14m 30s | Max: 14m 30s
      🟩 HostLaunch         Pass: 100%/2   | Total: 45m 36s | Avg: 22m 48s | Max: 26m 07s
      🟩 TestGPU            Pass: 100%/2   | Total: 47m 32s | Avg: 23m 46s | Max: 27m 20s
    🟩 sm
      🟩 90a                Pass: 100%/1   | Total: 22m 33s | Avg: 22m 33s | Max: 22m 33s
    🟩 std
      🟩 11                 Pass: 100%/5   | Total:  4h 03m | Avg: 48m 42s | Max: 55m 59s
      🟩 14                 Pass: 100%/4   | Total:  3h 29m | Avg: 52m 23s | Max: 56m 55s | Hits:  42%/766   
      🟩 17                 Pass: 100%/12  | Total: 11h 08m | Avg: 55m 43s | Max:  1h 07m | Hits:  42%/1532  
      🟩 20                 Pass: 100%/24  | Total: 18h 22m | Avg: 45m 57s | Max:  1h 06m | Hits:  41%/766   
    
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 10m 10s | Avg: 5m 05s | Max: 8m 06s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total: 10m 10s | Avg:  5m 05s | Max:  8m 06s
    🟩 ctk
      🟩 12.6               Pass: 100%/2   | Total: 10m 10s | Avg:  5m 05s | Max:  8m 06s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/2   | Total: 10m 10s | Avg:  5m 05s | Max:  8m 06s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total: 10m 10s | Avg:  5m 05s | Max:  8m 06s
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total: 10m 10s | Avg:  5m 05s | Max:  8m 06s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total: 10m 10s | Avg:  5m 05s | Max:  8m 06s
    🟩 gpu
      🟩 v100               Pass: 100%/2   | Total: 10m 10s | Avg:  5m 05s | Max:  8m 06s
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 04s | Avg:  2m 04s | Max:  2m 04s
      🟩 Test               Pass: 100%/1   | Total:  8m 06s | Avg:  8m 06s | Max:  8m 06s
    
  • 🟩 python: Pass: 100%/1 | Total: 31m 00s | Avg: 31m 00s | Max: 31m 00s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 31m 00s | Avg: 31m 00s | Max: 31m 00s
    🟩 ctk
      🟩 12.6               Pass: 100%/1   | Total: 31m 00s | Avg: 31m 00s | Max: 31m 00s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/1   | Total: 31m 00s | Avg: 31m 00s | Max: 31m 00s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 31m 00s | Avg: 31m 00s | Max: 31m 00s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 31m 00s | Avg: 31m 00s | Max: 31m 00s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 31m 00s | Avg: 31m 00s | Max: 31m 00s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 31m 00s | Avg: 31m 00s | Max: 31m 00s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 31m 00s | Avg: 31m 00s | Max: 31m 00s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 94)

# Runner
70 linux-amd64-cpu16
11 linux-amd64-gpu-v100-latest-1
9 windows-amd64-cpu16
4 linux-arm64-cpu16

@bernhardmgruber bernhardmgruber marked this pull request as ready for review December 11, 2024 11:14
@bernhardmgruber bernhardmgruber requested review from a team as code owners December 11, 2024 11:14
cub/benchmarks/bench/merge_sort/keys.cu Show resolved Hide resolved
// True, when programmatic dependent launch is available, otherwise false.
#define _THRUST_HAS_PDL _CCCL_CUDACC_AT_LEAST(11, 8)
#if _THRUST_HAS_PDL
// Waits for the previous kernel to complete (when it reaches its final membar). Should be put before the first global
// memory access in a kernel.
# define _THRUST_PDL_GRID_DEPENDENCY_SYNC() NV_IF_TARGET(NV_PROVIDES_SM_90, cudaGridDependencySynchronize();)
// Allows the subsequent kernel in the same stream to launch. Can be put anywhere in a kernel.
// Heuristic(ahendriksen): put it after the last load.
# define _THRUST_PDL_TRIGGER_NEXT_LAUNCH() NV_IF_TARGET(NV_PROVIDES_SM_90, cudaTriggerProgrammaticLaunchCompletion();)
#else
# define _THRUST_PDL_GRID_DEPENDENCY_SYNC()
# define _THRUST_PDL_TRIGGER_NEXT_LAUNCH()
#endif // _THRUST_HAS_PDL

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those macros are only used within thrust for the moment but they are not specific to thrust.

Should we rather move them into CCCL and name them _CCCL_PDL_MEOW

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved them to cuda/std/__cccl/cuda_capabilities.h as discussed offline.

* Extend triple_chevron_launch to handle PDL
* Flag benchmark as synchronizing
* Add launch control APIs to merge sort kernels
Copy link
Contributor

🟩 CI finished in 1h 34m: Pass: 100%/168 | Total: 3d 01h | Avg: 26m 08s | Max: 1h 11m | Hits: 71%/22398
  • 🟩 libcudacxx: Pass: 100%/48 | Total: 11h 17m | Avg: 14m 06s | Max: 34m 41s | Hits: 81%/9762

    🟩 cpu
      🟩 amd64              Pass: 100%/46  | Total: 10h 34m | Avg: 13m 47s | Max: 34m 41s | Hits:  81%/9762  
      🟩 arm64              Pass: 100%/2   | Total: 43m 00s | Avg: 21m 30s | Max: 22m 41s
    🟩 ctk
      🟩 11.1               Pass: 100%/7   | Total:  1h 29m | Avg: 12m 45s | Max: 23m 09s | Hits:  97%/2226  
      🟩 12.5               Pass: 100%/2   | Total: 42m 45s | Avg: 21m 22s | Max: 33m 59s
      🟩 12.6               Pass: 100%/39  | Total:  9h 05m | Avg: 13m 58s | Max: 34m 41s | Hits:  76%/7536  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/4   | Total:  1h 01m | Avg: 15m 26s | Max: 18m 52s
      🟩 nvcc11.1           Pass: 100%/7   | Total:  1h 29m | Avg: 12m 45s | Max: 23m 09s | Hits:  97%/2226  
      🟩 nvcc12.5           Pass: 100%/2   | Total: 42m 45s | Avg: 21m 22s | Max: 33m 59s
      🟩 nvcc12.6           Pass: 100%/35  | Total:  8h 03m | Avg: 13m 48s | Max: 34m 41s | Hits:  76%/7536  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/4   | Total:  1h 01m | Avg: 15m 26s | Max: 18m 52s
      🟩 nvcc               Pass: 100%/44  | Total: 10h 15m | Avg: 13m 59s | Max: 34m 41s | Hits:  81%/9762  
    🟩 cxx
      🟩 Clang9             Pass: 100%/4   | Total: 47m 40s | Avg: 11m 55s | Max: 23m 09s
      🟩 Clang10            Pass: 100%/1   | Total:  5m 16s | Avg:  5m 16s | Max:  5m 16s
      🟩 Clang11            Pass: 100%/1   | Total: 20m 02s | Avg: 20m 02s | Max: 20m 02s
      🟩 Clang12            Pass: 100%/1   | Total:  4m 36s | Avg:  4m 36s | Max:  4m 36s
      🟩 Clang13            Pass: 100%/1   | Total:  4m 24s | Avg:  4m 24s | Max:  4m 24s
      🟩 Clang14            Pass: 100%/1   | Total:  3m 58s | Avg:  3m 58s | Max:  3m 58s
      🟩 Clang15            Pass: 100%/1   | Total:  4m 09s | Avg:  4m 09s | Max:  4m 09s
      🟩 Clang16            Pass: 100%/1   | Total:  4m 31s | Avg:  4m 31s | Max:  4m 31s
      🟩 Clang17            Pass: 100%/1   | Total: 22m 28s | Avg: 22m 28s | Max: 22m 28s
      🟩 Clang18            Pass: 100%/8   | Total:  2h 27m | Avg: 18m 26s | Max: 23m 06s
      🟩 GCC6               Pass: 100%/2   | Total: 14m 37s | Avg:  7m 18s | Max: 11m 56s
      🟩 GCC7               Pass: 100%/2   | Total: 20m 14s | Avg: 10m 07s | Max: 16m 37s
      🟩 GCC8               Pass: 100%/1   | Total:  3m 41s | Avg:  3m 41s | Max:  3m 41s
      🟩 GCC9               Pass: 100%/3   | Total: 39m 30s | Avg: 13m 10s | Max: 21m 24s
      🟩 GCC10              Pass: 100%/1   | Total:  4m 26s | Avg:  4m 26s | Max:  4m 26s
      🟩 GCC11              Pass: 100%/1   | Total: 21m 16s | Avg: 21m 16s | Max: 21m 16s
      🟩 GCC12              Pass: 100%/1   | Total: 21m 00s | Avg: 21m 00s | Max: 21m 00s
      🟩 GCC13              Pass: 100%/10  | Total:  2h 32m | Avg: 15m 13s | Max: 30m 49s
      🟩 Intel2023.2.0      Pass: 100%/1   | Total:  5m 20s | Avg:  5m 20s | Max:  5m 20s
      🟩 MSVC14.16          Pass: 100%/1   | Total: 18m 22s | Avg: 18m 22s | Max: 18m 22s | Hits:  97%/2226  
      🟩 MSVC14.29          Pass: 100%/1   | Total: 34m 41s | Avg: 34m 41s | Max: 34m 41s | Hits:  31%/2463  
      🟩 MSVC14.39          Pass: 100%/2   | Total: 34m 17s | Avg: 17m 08s | Max: 19m 23s | Hits:  98%/5073  
      🟩 NVHPC24.7          Pass: 100%/2   | Total: 42m 45s | Avg: 21m 22s | Max: 33m 59s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/20  | Total:  4h 24m | Avg: 13m 13s | Max: 23m 09s
      🟩 GCC                Pass: 100%/21  | Total:  4h 37m | Avg: 13m 11s | Max: 30m 49s
      🟩 Intel              Pass: 100%/1   | Total:  5m 20s | Avg:  5m 20s | Max:  5m 20s
      🟩 MSVC               Pass: 100%/4   | Total:  1h 27m | Avg: 21m 50s | Max: 34m 41s | Hits:  81%/9762  
      🟩 NVHPC              Pass: 100%/2   | Total: 42m 45s | Avg: 21m 22s | Max: 33m 59s
    🟩 gpu
      🟩 v100               Pass: 100%/48  | Total: 11h 17m | Avg: 14m 06s | Max: 34m 41s | Hits:  81%/9762  
    🟩 jobs
      🟩 Build              Pass: 100%/41  | Total:  8h 57m | Avg: 13m 07s | Max: 34m 41s | Hits:  81%/9762  
      🟩 NVRTC              Pass: 100%/4   | Total:  1h 37m | Avg: 24m 16s | Max: 30m 49s
      🟩 Test               Pass: 100%/2   | Total: 40m 14s | Avg: 20m 07s | Max: 20m 52s
      🟩 VerifyCodegen      Pass: 100%/1   | Total:  1m 55s | Avg:  1m 55s | Max:  1m 55s
    🟩 sm
      🟩 90                 Pass: 100%/1   | Total: 11m 42s | Avg: 11m 42s | Max: 11m 42s
      🟩 90a                Pass: 100%/2   | Total: 16m 50s | Avg:  8m 25s | Max: 13m 08s
    🟩 std
      🟩 11                 Pass: 100%/6   | Total: 53m 07s | Avg:  8m 51s | Max: 23m 09s
      🟩 14                 Pass: 100%/5   | Total:  1h 11m | Avg: 14m 14s | Max: 19m 13s | Hits:  97%/2226  
      🟩 17                 Pass: 100%/13  | Total:  3h 23m | Avg: 15m 37s | Max: 34m 41s | Hits:  65%/4926  
      🟩 20                 Pass: 100%/23  | Total:  5h 47m | Avg: 15m 07s | Max: 33m 59s | Hits:  98%/2610  
    
  • 🟩 thrust: Pass: 100%/46 | Total: 23h 40m | Avg: 30m 52s | Max: 1h 03m | Hits: 69%/9260

    🟩 cmake_options
      🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 45m 45s | Avg: 22m 52s | Max: 32m 18s
    🟩 cpu
      🟩 amd64              Pass: 100%/44  | Total: 22h 32m | Avg: 30m 43s | Max:  1h 03m | Hits:  69%/9260  
      🟩 arm64              Pass: 100%/2   | Total:  1h 08m | Avg: 34m 06s | Max: 38m 27s
    🟩 ctk
      🟩 11.1               Pass: 100%/7   | Total:  2h 10m | Avg: 18m 39s | Max:  1h 01m | Hits:  62%/1852  
      🟩 12.5               Pass: 100%/2   | Total:  1h 42m | Avg: 51m 26s | Max: 53m 18s
      🟩 12.6               Pass: 100%/37  | Total: 19h 46m | Avg: 32m 04s | Max:  1h 03m | Hits:  71%/7408  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total:  1h 01m | Avg: 30m 59s | Max: 31m 25s
      🟩 nvcc11.1           Pass: 100%/7   | Total:  2h 10m | Avg: 18m 39s | Max:  1h 01m | Hits:  62%/1852  
      🟩 nvcc12.5           Pass: 100%/2   | Total:  1h 42m | Avg: 51m 26s | Max: 53m 18s
      🟩 nvcc12.6           Pass: 100%/35  | Total: 18h 44m | Avg: 32m 08s | Max:  1h 03m | Hits:  71%/7408  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total:  1h 01m | Avg: 30m 59s | Max: 31m 25s
      🟩 nvcc               Pass: 100%/44  | Total: 22h 38m | Avg: 30m 52s | Max:  1h 03m | Hits:  69%/9260  
    🟩 cxx
      🟩 Clang9             Pass: 100%/4   | Total:  1h 53m | Avg: 28m 19s | Max: 34m 50s
      🟩 Clang10            Pass: 100%/1   | Total: 35m 10s | Avg: 35m 10s | Max: 35m 10s
      🟩 Clang11            Pass: 100%/1   | Total: 31m 39s | Avg: 31m 39s | Max: 31m 39s
      🟩 Clang12            Pass: 100%/1   | Total: 30m 29s | Avg: 30m 29s | Max: 30m 29s
      🟩 Clang13            Pass: 100%/1   | Total: 30m 23s | Avg: 30m 23s | Max: 30m 23s
      🟩 Clang14            Pass: 100%/1   | Total: 35m 04s | Avg: 35m 04s | Max: 35m 04s
      🟩 Clang15            Pass: 100%/1   | Total: 36m 25s | Avg: 36m 25s | Max: 36m 25s
      🟩 Clang16            Pass: 100%/1   | Total: 34m 37s | Avg: 34m 37s | Max: 34m 37s
      🟩 Clang17            Pass: 100%/1   | Total: 36m 04s | Avg: 36m 04s | Max: 36m 04s
      🟩 Clang18            Pass: 100%/7   | Total:  2h 58m | Avg: 25m 31s | Max: 33m 27s
      🟩 GCC6               Pass: 100%/2   | Total:  7m 45s | Avg:  3m 52s | Max:  4m 00s
      🟩 GCC7               Pass: 100%/2   | Total:  1h 00m | Avg: 30m 17s | Max: 34m 38s
      🟩 GCC8               Pass: 100%/1   | Total: 36m 38s | Avg: 36m 38s | Max: 36m 38s
      🟩 GCC9               Pass: 100%/3   | Total: 42m 56s | Avg: 14m 18s | Max: 34m 51s
      🟩 GCC10              Pass: 100%/1   | Total: 33m 02s | Avg: 33m 02s | Max: 33m 02s
      🟩 GCC11              Pass: 100%/1   | Total: 36m 36s | Avg: 36m 36s | Max: 36m 36s
      🟩 GCC12              Pass: 100%/1   | Total: 34m 49s | Avg: 34m 49s | Max: 34m 49s
      🟩 GCC13              Pass: 100%/8   | Total:  3h 20m | Avg: 25m 06s | Max: 38m 27s
      🟩 Intel2023.2.0      Pass: 100%/1   | Total: 45m 06s | Avg: 45m 06s | Max: 45m 06s
      🟩 MSVC14.16          Pass: 100%/1   | Total:  1h 01m | Avg:  1h 01m | Max:  1h 01m | Hits:  62%/1852  
      🟩 MSVC14.29          Pass: 100%/1   | Total: 54m 06s | Avg: 54m 06s | Max: 54m 06s | Hits:  62%/1852  
      🟩 MSVC14.39          Pass: 100%/3   | Total:  2h 21m | Avg: 47m 08s | Max:  1h 03m | Hits:  74%/5556  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  1h 42m | Avg: 51m 26s | Max: 53m 18s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/19  | Total:  9h 21m | Avg: 29m 34s | Max: 36m 25s
      🟩 GCC                Pass: 100%/19  | Total:  7h 33m | Avg: 23m 51s | Max: 38m 27s
      🟩 Intel              Pass: 100%/1   | Total: 45m 06s | Avg: 45m 06s | Max: 45m 06s
      🟩 MSVC               Pass: 100%/5   | Total:  4h 17m | Avg: 51m 26s | Max:  1h 03m | Hits:  69%/9260  
      🟩 NVHPC              Pass: 100%/2   | Total:  1h 42m | Avg: 51m 26s | Max: 53m 18s
    🟩 gpu
      🟩 v100               Pass: 100%/46  | Total: 23h 40m | Avg: 30m 52s | Max:  1h 03m | Hits:  69%/9260  
    🟩 jobs
      🟩 Build              Pass: 100%/40  | Total: 22h 17m | Avg: 33m 26s | Max:  1h 03m | Hits:  62%/7408  
      🟩 TestCPU            Pass: 100%/3   | Total: 40m 38s | Avg: 13m 32s | Max: 24m 35s | Hits:  99%/1852  
      🟩 TestGPU            Pass: 100%/3   | Total: 42m 09s | Avg: 14m 03s | Max: 14m 36s
    🟩 sm
      🟩 90a                Pass: 100%/1   | Total: 23m 46s | Avg: 23m 46s | Max: 23m 46s
    🟩 std
      🟩 11                 Pass: 100%/5   | Total:  1h 23m | Avg: 16m 39s | Max: 25m 56s
      🟩 14                 Pass: 100%/4   | Total:  2h 15m | Avg: 33m 47s | Max:  1h 01m | Hits:  62%/1852  
      🟩 17                 Pass: 100%/12  | Total:  7h 18m | Avg: 36m 31s | Max: 54m 06s | Hits:  62%/3704  
      🟩 20                 Pass: 100%/23  | Total: 11h 57m | Avg: 31m 12s | Max:  1h 03m | Hits:  80%/3704  
    
  • 🟩 cub: Pass: 100%/45 | Total: 1d 11h | Avg: 46m 54s | Max: 1h 11m | Hits: 42%/3064

    🟩 cpu
      🟩 amd64              Pass: 100%/43  | Total:  1d 09h | Avg: 46m 28s | Max:  1h 11m | Hits:  42%/3064  
      🟩 arm64              Pass: 100%/2   | Total:  1h 52m | Avg: 56m 27s | Max: 56m 45s
    🟩 ctk
      🟩 11.1               Pass: 100%/7   | Total:  2h 48m | Avg: 24m 07s | Max:  1h 00m | Hits:  42%/766   
      🟩 12.5               Pass: 100%/2   | Total:  2h 06m | Avg:  1h 03m | Max:  1h 05m
      🟩 12.6               Pass: 100%/36  | Total:  1d 06h | Avg: 50m 25s | Max:  1h 11m | Hits:  42%/2298  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total:  2h 04m | Avg:  1h 02m | Max:  1h 04m
      🟩 nvcc11.1           Pass: 100%/7   | Total:  2h 48m | Avg: 24m 07s | Max:  1h 00m | Hits:  42%/766   
      🟩 nvcc12.5           Pass: 100%/2   | Total:  2h 06m | Avg:  1h 03m | Max:  1h 05m
      🟩 nvcc12.6           Pass: 100%/34  | Total:  1d 04h | Avg: 49m 43s | Max:  1h 11m | Hits:  42%/2298  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total:  2h 04m | Avg:  1h 02m | Max:  1h 04m
      🟩 nvcc               Pass: 100%/43  | Total:  1d 09h | Avg: 46m 11s | Max:  1h 11m | Hits:  42%/3064  
    🟩 cxx
      🟩 Clang9             Pass: 100%/4   | Total:  3h 25m | Avg: 51m 28s | Max:  1h 00m
      🟩 Clang10            Pass: 100%/1   | Total: 54m 01s | Avg: 54m 01s | Max: 54m 01s
      🟩 Clang11            Pass: 100%/1   | Total: 58m 29s | Avg: 58m 29s | Max: 58m 29s
      🟩 Clang12            Pass: 100%/1   | Total: 52m 04s | Avg: 52m 04s | Max: 52m 04s
      🟩 Clang13            Pass: 100%/1   | Total: 51m 23s | Avg: 51m 23s | Max: 51m 23s
      🟩 Clang14            Pass: 100%/1   | Total: 51m 40s | Avg: 51m 40s | Max: 51m 40s
      🟩 Clang15            Pass: 100%/1   | Total: 59m 20s | Avg: 59m 20s | Max: 59m 20s
      🟩 Clang16            Pass: 100%/1   | Total: 57m 38s | Avg: 57m 38s | Max: 57m 38s
      🟩 Clang17            Pass: 100%/1   | Total: 52m 12s | Avg: 52m 12s | Max: 52m 12s
      🟩 Clang18            Pass: 100%/7   | Total:  5h 27m | Avg: 46m 43s | Max:  1h 04m
      🟩 GCC6               Pass: 100%/2   | Total:  8m 20s | Avg:  4m 10s | Max:  4m 26s
      🟩 GCC7               Pass: 100%/2   | Total:  1h 53m | Avg: 56m 58s | Max: 59m 46s
      🟩 GCC8               Pass: 100%/1   | Total: 51m 45s | Avg: 51m 45s | Max: 51m 45s
      🟩 GCC9               Pass: 100%/3   | Total:  1h 02m | Avg: 20m 40s | Max: 52m 36s
      🟩 GCC10              Pass: 100%/1   | Total: 59m 59s | Avg: 59m 59s | Max: 59m 59s
      🟩 GCC11              Pass: 100%/1   | Total: 59m 36s | Avg: 59m 36s | Max: 59m 36s
      🟩 GCC12              Pass: 100%/1   | Total:  1h 00m | Avg:  1h 00m | Max:  1h 00m
      🟩 GCC13              Pass: 100%/8   | Total:  4h 45m | Avg: 35m 41s | Max:  1h 00m
      🟩 Intel2023.2.0      Pass: 100%/1   | Total:  1h 01m | Avg:  1h 01m | Max:  1h 01m
      🟩 MSVC14.16          Pass: 100%/1   | Total:  1h 00m | Avg:  1h 00m | Max:  1h 00m | Hits:  42%/766   
      🟩 MSVC14.29          Pass: 100%/1   | Total: 59m 32s | Avg: 59m 32s | Max: 59m 32s | Hits:  42%/766   
      🟩 MSVC14.39          Pass: 100%/2   | Total:  2h 11m | Avg:  1h 05m | Max:  1h 11m | Hits:  42%/1532  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  2h 06m | Avg:  1h 03m | Max:  1h 05m
    🟩 cxx_family
      🟩 Clang              Pass: 100%/19  | Total: 16h 09m | Avg: 51m 02s | Max:  1h 04m
      🟩 GCC                Pass: 100%/19  | Total: 11h 42m | Avg: 36m 56s | Max:  1h 00m
      🟩 Intel              Pass: 100%/1   | Total:  1h 01m | Avg:  1h 01m | Max:  1h 01m
      🟩 MSVC               Pass: 100%/4   | Total:  4h 11m | Avg:  1h 02m | Max:  1h 11m | Hits:  42%/3064  
      🟩 NVHPC              Pass: 100%/2   | Total:  2h 06m | Avg:  1h 03m | Max:  1h 05m
    🟩 gpu
      🟩 v100               Pass: 100%/45  | Total:  1d 11h | Avg: 46m 54s | Max:  1h 11m | Hits:  42%/3064  
    🟩 jobs
      🟩 Build              Pass: 100%/39  | Total:  1d 09h | Avg: 50m 49s | Max:  1h 11m | Hits:  42%/3064  
      🟩 DeviceLaunch       Pass: 100%/1   | Total: 16m 33s | Avg: 16m 33s | Max: 16m 33s
      🟩 GraphCapture       Pass: 100%/1   | Total: 22m 48s | Avg: 22m 48s | Max: 22m 48s
      🟩 HostLaunch         Pass: 100%/2   | Total: 47m 27s | Avg: 23m 43s | Max: 23m 44s
      🟩 TestGPU            Pass: 100%/2   | Total: 42m 23s | Avg: 21m 11s | Max: 23m 11s
    🟩 sm
      🟩 90a                Pass: 100%/1   | Total: 26m 15s | Avg: 26m 15s | Max: 26m 15s
    🟩 std
      🟩 11                 Pass: 100%/5   | Total:  2h 48m | Avg: 33m 41s | Max: 59m 46s
      🟩 14                 Pass: 100%/4   | Total:  2h 59m | Avg: 44m 52s | Max:  1h 00m | Hits:  42%/766   
      🟩 17                 Pass: 100%/12  | Total: 10h 18m | Avg: 51m 31s | Max:  1h 01m | Hits:  42%/1532  
      🟩 20                 Pass: 100%/24  | Total: 19h 05m | Avg: 47m 42s | Max:  1h 11m | Hits:  41%/766   
    
  • 🟩 cudax: Pass: 100%/26 | Total: 2h 25m | Avg: 5m 34s | Max: 16m 58s | Hits: 89%/312

    🟩 cpu
      🟩 amd64              Pass: 100%/22  | Total:  2h 10m | Avg:  5m 55s | Max: 16m 58s | Hits:  89%/312   
      🟩 arm64              Pass: 100%/4   | Total: 14m 43s | Avg:  3m 40s | Max:  3m 55s
    🟩 ctk
      🟩 12.0               Pass: 100%/3   | Total: 20m 43s | Avg:  6m 54s | Max: 13m 36s | Hits:  89%/156   
      🟩 12.5               Pass: 100%/2   | Total: 12m 03s | Avg:  6m 01s | Max:  6m 12s
      🟩 12.6               Pass: 100%/21  | Total:  1h 52m | Avg:  5m 20s | Max: 16m 58s | Hits:  89%/156   
    🟩 cudacxx
      🟩 nvcc12.0           Pass: 100%/3   | Total: 20m 43s | Avg:  6m 54s | Max: 13m 36s | Hits:  89%/156   
      🟩 nvcc12.5           Pass: 100%/2   | Total: 12m 03s | Avg:  6m 01s | Max:  6m 12s
      🟩 nvcc12.6           Pass: 100%/21  | Total:  1h 52m | Avg:  5m 20s | Max: 16m 58s | Hits:  89%/156   
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/26  | Total:  2h 25m | Avg:  5m 34s | Max: 16m 58s | Hits:  89%/312   
    🟩 cxx
      🟩 Clang9             Pass: 100%/1   | Total:  3m 46s | Avg:  3m 46s | Max:  3m 46s
      🟩 Clang10            Pass: 100%/1   | Total:  4m 38s | Avg:  4m 38s | Max:  4m 38s
      🟩 Clang11            Pass: 100%/1   | Total:  4m 11s | Avg:  4m 11s | Max:  4m 11s
      🟩 Clang12            Pass: 100%/1   | Total:  3m 49s | Avg:  3m 49s | Max:  3m 49s
      🟩 Clang13            Pass: 100%/1   | Total:  4m 07s | Avg:  4m 07s | Max:  4m 07s
      🟩 Clang14            Pass: 100%/1   | Total:  3m 52s | Avg:  3m 52s | Max:  3m 52s
      🟩 Clang15            Pass: 100%/1   | Total:  3m 58s | Avg:  3m 58s | Max:  3m 58s
      🟩 Clang16            Pass: 100%/1   | Total:  3m 46s | Avg:  3m 46s | Max:  3m 46s
      🟩 Clang17            Pass: 100%/1   | Total:  3m 49s | Avg:  3m 49s | Max:  3m 49s
      🟩 Clang18            Pass: 100%/4   | Total: 27m 30s | Avg:  6m 52s | Max: 16m 02s
      🟩 GCC9               Pass: 100%/1   | Total:  3m 21s | Avg:  3m 21s | Max:  3m 21s
      🟩 GCC10              Pass: 100%/1   | Total:  3m 46s | Avg:  3m 46s | Max:  3m 46s
      🟩 GCC11              Pass: 100%/1   | Total:  3m 51s | Avg:  3m 51s | Max:  3m 51s
      🟩 GCC12              Pass: 100%/2   | Total: 20m 59s | Avg: 10m 29s | Max: 16m 58s
      🟩 GCC13              Pass: 100%/4   | Total: 13m 59s | Avg:  3m 29s | Max:  3m 55s
      🟩 MSVC14.36          Pass: 100%/1   | Total: 13m 36s | Avg: 13m 36s | Max: 13m 36s | Hits:  89%/156   
      🟩 MSVC14.39          Pass: 100%/1   | Total:  9m 59s | Avg:  9m 59s | Max:  9m 59s | Hits:  89%/156   
      🟩 NVHPC24.7          Pass: 100%/2   | Total: 12m 03s | Avg:  6m 01s | Max:  6m 12s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/13  | Total:  1h 03m | Avg:  4m 52s | Max: 16m 02s
      🟩 GCC                Pass: 100%/9   | Total: 45m 56s | Avg:  5m 06s | Max: 16m 58s
      🟩 MSVC               Pass: 100%/2   | Total: 23m 35s | Avg: 11m 47s | Max: 13m 36s | Hits:  89%/312   
      🟩 NVHPC              Pass: 100%/2   | Total: 12m 03s | Avg:  6m 01s | Max:  6m 12s
    🟩 gpu
      🟩 v100               Pass: 100%/26  | Total:  2h 25m | Avg:  5m 34s | Max: 16m 58s | Hits:  89%/312   
    🟩 jobs
      🟩 Build              Pass: 100%/24  | Total:  1h 52m | Avg:  4m 40s | Max: 13m 36s | Hits:  89%/312   
      🟩 Test               Pass: 100%/2   | Total: 33m 00s | Avg: 16m 30s | Max: 16m 58s
    🟩 sm
      🟩 90                 Pass: 100%/1   | Total:  3m 25s | Avg:  3m 25s | Max:  3m 25s
      🟩 90a                Pass: 100%/1   | Total:  3m 10s | Avg:  3m 10s | Max:  3m 10s
    🟩 std
      🟩 17                 Pass: 100%/6   | Total: 23m 34s | Avg:  3m 55s | Max:  5m 51s
      🟩 20                 Pass: 100%/20  | Total:  2h 01m | Avg:  6m 04s | Max: 16m 58s | Hits:  89%/312   
    
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 10m 27s | Avg: 5m 13s | Max: 8m 05s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total: 10m 27s | Avg:  5m 13s | Max:  8m 05s
    🟩 ctk
      🟩 12.6               Pass: 100%/2   | Total: 10m 27s | Avg:  5m 13s | Max:  8m 05s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/2   | Total: 10m 27s | Avg:  5m 13s | Max:  8m 05s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total: 10m 27s | Avg:  5m 13s | Max:  8m 05s
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total: 10m 27s | Avg:  5m 13s | Max:  8m 05s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total: 10m 27s | Avg:  5m 13s | Max:  8m 05s
    🟩 gpu
      🟩 v100               Pass: 100%/2   | Total: 10m 27s | Avg:  5m 13s | Max:  8m 05s
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 22s | Avg:  2m 22s | Max:  2m 22s
      🟩 Test               Pass: 100%/1   | Total:  8m 05s | Avg:  8m 05s | Max:  8m 05s
    
  • 🟩 python: Pass: 100%/1 | Total: 27m 53s | Avg: 27m 53s | Max: 27m 53s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 27m 53s | Avg: 27m 53s | Max: 27m 53s
    🟩 ctk
      🟩 12.6               Pass: 100%/1   | Total: 27m 53s | Avg: 27m 53s | Max: 27m 53s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/1   | Total: 27m 53s | Avg: 27m 53s | Max: 27m 53s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 27m 53s | Avg: 27m 53s | Max: 27m 53s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 27m 53s | Avg: 27m 53s | Max: 27m 53s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 27m 53s | Avg: 27m 53s | Max: 27m 53s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 27m 53s | Avg: 27m 53s | Max: 27m 53s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 27m 53s | Avg: 27m 53s | Max: 27m 53s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
+/- libcu++
+/- CUB
+/- Thrust
CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
+/- libcu++
+/- CUB
+/- Thrust
+/- CUDA Experimental
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 168)

# Runner
124 linux-amd64-cpu16
19 linux-amd64-gpu-v100-latest-1
15 windows-amd64-cpu16
10 linux-arm64-cpu16

@bernhardmgruber bernhardmgruber merged commit 53f69a4 into NVIDIA:main Dec 11, 2024
185 checks passed
@bernhardmgruber bernhardmgruber deleted the fdl branch December 11, 2024 16:14
@@ -127,7 +127,7 @@ void keys(nvbench::state& state, nvbench::type_list<T, OffsetT>)
thrust::device_vector<nvbench::uint8_t> temp(temp_size);
auto* temp_storage = thrust::raw_pointer_cast(temp.data());

state.exec(nvbench::exec_tag::no_batch, [&](nvbench::launch& launch) {
state.exec(nvbench::exec_tag::sync, [&](nvbench::launch& launch) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: why this change is needed? Did merge become synchronous?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Merge sort now has calls to cudaGridDependencySynchronize, which cause the benchmark to crash if I were to use no_batch.

bernhardmgruber added a commit to bernhardmgruber/cccl that referenced this pull request Dec 12, 2024
For some reason, I made PDL available with CTK 11.8 in NVIDIA#3114, but it seems the feature is only available starting with CTK 12.0.
bernhardmgruber added a commit that referenced this pull request Dec 12, 2024
For some reason, I made PDL available with CTK 11.8 in #3114, but it seems the feature is only available starting with CTK 12.0.
bernhardmgruber added a commit to bernhardmgruber/cccl that referenced this pull request Dec 19, 2024
The kernel already contains a call to _CCCL_PDL_GRID_DEPENDENCY_SYNC,
but PDL was not enabled when launching it. This was missed in NVIDIA#3114.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants