You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
wonder how 3000 this magic number is considered here ?
when bench matrix shape as [m, n], if m <3000, then segmented_radix_sort_impl() will never go to do_paritioning, looks inside which has more fine-grained kernel depending on different segment_counts.
on the other hand, CUDA::CUB can do segmented per row, are we expecting some perf gap here?
Thanks for guiding
The text was updated successfully, but these errors were encountered:
Hi, these values are determined by our autotuning system. We invoke this on a set of GPUs, which then compiles & benchmarks the algorithms for a range of parameters. A developer-oriented explanation is given here.
If you believe that there is a performance issue there, you have a few options:
You can pass a custom config for that particular operation where you manually set the values.
You can also add a benchmark case in the benchmark for segmented radix sort for your dimensions, and run the tuning yourself.
hi, rocm expert,
wonder how 3000 this magic number is considered here ?
when bench matrix shape as [m, n], if m <3000, then segmented_radix_sort_impl() will never go to
do_paritioning
, looks inside which has more fine-grained kernel depending on different segment_counts.on the other hand, CUDA::CUB can do segmented per row, are we expecting some perf gap here?
Thanks for guiding
The text was updated successfully, but these errors were encountered: