[QST] When to expect speedups for RF on CUDA? #6191

adam2392 · 2024-12-19T13:24:47Z

What is your question?

Hi, Thanks for the package!

I am browsing https://docs.rapids.ai/api/cuml/stable/api/#random-forest, and am wondering when one might expect speedups for the RF model over the model say in scikit-learn?

I am trying to understand where the GPU parallelization comes into play.

Is it that each core in the GPU trains a separate tree?
Is it that the determination of best split at each node is parallelized on the GPU?
Is there other parallelization that occurs on the GPU?
When/what data is transferred from GPU to CPU? I imagine if this is done per split node in the tree, then it would have communication overhead.
I also noticed that max_depth is constrained. Is this in part due to the GPU implementation?

I am also trying to understand if there are known limitations, and possible performance bottlenecks. Are there any docs or links to benchmark experiments that can help a user understand this better?

The text was updated successfully, but these errors were encountered:

wphicks · 2024-12-20T21:21:04Z

My specialty is more on the inference side of this question than on training, but let me see how much I can answer for you until the folks who wrote more of the training code are back in the office.

Each tree is trained on a separate CUDA stream taken from a pool of fixed size. That doesn't necessarily map to "cores," but it does mean that work on those trees can be handled in parallel. This is the highest level of parallelism in the training algorithm, but further parallelism is achieved at a more granular level.
Which brings us to your second question. Yes, for each tree, we parallelize the computation of the splits at each node over CUDA threads. Each thread handle multiple samples from the training data up to some maximum value and works on those samples in parallel to other threads.
Yes, there is additional parallelism at multiple steps of the training algorithm. For a detailed understanding, I'd recommend starting here and checking out the kernels launched in that method.
Mostly, training data is transferred from CPU to GPU at the beginning of the process and then accessed from global device memory. In principle, training could be batched in such a way that we transfer a batch to device, perform all necessary access and then move on to the next batch, but we don't currently implement that. There are some additional details around where we use host memory internally that others can answer better than I.
You may need to wait for other folks to give you a more complete answer here. My understanding is that this is because we allocate space for the maximum potential number of nodes and do not want to have to bring the entire parallel training process to a halt if we run out of room and need a reallocation.

In general, RandomForest follows most other ML algorithms in terms of its GPU acceleration characteristics. The larger the dataset or the larger the model, the greater benefit that GPUs tend to offer. The exact cutoff is hardware dependent, but you can see some example benchmarks here.

After the holidays, the folks primarily responsible for the training code can give you much more detailed answers, and if you have questions about inference in the meantime, I can answer that in as much detail as you like. Hope this at least gives you a start on what you need!

adam2392 · 2024-12-31T15:30:52Z

Happy new years! Thank you for this detailed response. Looking forward to additional responses from the training team!

Some follow-up questions after reading thru the benchmarks you linked. No rush in answering if the training team is still OOTO.

[max_depth] 5. You may need to wait for other folks to give you a more complete answer here. My understanding is that this is because we allocate space for the maximum potential number of nodes and do not want to have to bring the entire parallel training process to a halt if we run out of room and need a reallocation.

Is there any heuristic to suggest what one can set max_depth to be before it is not possible to fit on a GPU anymore?

Additionally, I understand there is no loss of accuracy on these benchmarks at max_depth compared to sklearn, but I think a more fair comparison could be to compare against sklearn RF trained to purity (i.e. max_depth is not constrained)?

It seems that there additionally are the issues of binning and quantization of features:

From what I understand, this reduces the search space of a split value from n_samples_in_split_node to n_bins values at each split. Moreover, if one precomputes the quantiles, that speeds things up even more. Were there ablation experiments done to disentangle the speedup that comes from using this binning/quanization strategy vs parallelization over the GPU?

I.e. if one uses CPU and runs the cuml RF with the binning, and/or quantization strategy, is there still a speedup compared to sklearn? I'm asking because one could imagine a significant amount of the speedup in training being due to this algorithmic change.

adam2392 added ? - Needs Triage Need team to review and classify question Further information is requested labels Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] When to expect speedups for RF on CUDA? #6191

[QST] When to expect speedups for RF on CUDA? #6191

adam2392 commented Dec 19, 2024 •

edited

Loading

wphicks commented Dec 20, 2024

adam2392 commented Dec 31, 2024

[QST] When to expect speedups for RF on CUDA? #6191

[QST] When to expect speedups for RF on CUDA? #6191

Comments

adam2392 commented Dec 19, 2024 • edited Loading

wphicks commented Dec 20, 2024

adam2392 commented Dec 31, 2024

adam2392 commented Dec 19, 2024 •

edited

Loading