Compute time status before starting experiments on 2023-10-06:
Project | Resources | Granted | Used | Percentage |
---|---|---|---|---|
hk-project-test-aihero2 |
cpu | 1000000 | 66639 | 6.66% |
To find out the maximum problem size we can fit on a single compute node and to get a feeling for numbers and runtimes.
Consider a synthetic classification dataset (generated with sklearn
's make_classification
) with sklearn
's RandomForestClassifier
, the default number of features considered
when looking for the best split is
From above, we know that the overall time complexity for training a random forest is
Model-related:
- Number of trees
$\rightarrow \mathcal{O}\left(t\right)$
Data-related:
- Number of samples
$\rightarrow \mathcal{O}\left(n\log n\right)$ - Number of features
$\rightarrow \mathcal{O}\left(u\right)=\mathcal{O}\left(\sqrt{m}\right)$
We do a grid search of the following search space:
$t \in \left[10^2, 10^3, 10^4, 10^5\right]$ $n \in \left[10^6, 10^7, 10^8, 10^9\right]$ $m \in \left[10^2, 10^3, 10^4, 10^5\right]$
This makes overall 64 experiments for the single-node capacity series.
Compute time status after single-node experiments on 2023-10-17 (not sure whether another person also uses this compute time contingent):
Project | Resources | Granted | Used | Percentage |
---|---|---|---|---|
hk-project-test-aihero2 |
cpu | 1000000 | 135795 | 27.16% |
This plot shows the state of each job, i.e., COMPLETED
(green), TIMEOUT
(orange), or FAILED
(red),
depending on the job's corresponding number of samples, number of features, and number of trees.
The data points for runs considering the same dataset, i.e., same number of samples and features, but different numbers of trees overlap, where the marker size grows with the number of trees and runs using the smallest number of trees are shown innermost on top.
Runs using datasets with more than numpy.core._exceptions.MemoryError
during the dataset creation, i.e., "Unable to allocate x TiB for an array with shape (n, m) and data type float64"
.
Those are all runs shown in red.
Runs with job state TIMEOUT
ran into the single-node wall time limit of 3 days. These configurations are in principle
trainable on a single node. They do not suffer from insufficient memory but are canceled due to the limited wall time.
This can be mitigated by training the forest in a tree-parallel manner on multiple nodes.
Note that this does not solve the memory problem during dataset creation for datasets with more than
In the following experiments, we thus consider datasets with
As we are primarily interested in datasets with many features, we focus on the
As mentioned before, we focus on the following two datasets:
$\lbrace n, m\rbrace=\lbrace 10^6, 10^4\rbrace$ $\lbrace n, m\rbrace=\lbrace 10^7, 10^3\rbrace$
Considering the accuracy, the
To study strong and weak scaling of our distributed random forest implementation as well as the accuracy as a function
of the number of trees
Consider the two datasets identified in the single-node capacity experiments, i.e.,
$\lbrace n, m, t\rbrace=\lbrace 10^6, 10^4, 10^3\rbrace$ $\lbrace n, m, t\rbrace=\lbrace 10^7, 10^3, 10^2\rbrace$
We scale the problem size by varying the number of trees
For strong scaling, we aim at solving a problem of fixed size faster by ramping up the compute resources used to solve the
problem.
For the distributed RF, this means we take the base problems and solve them using increasingly more compute nodes.
Thus, the global forest is distributed over more compute nodes, decreasing the node-local number of trees.
For
Ideally, the speed-up should increase linearly with the number of processors
We start from each base problem and solve it using 1, 2, 4, 8, 16, 32, and 64 nodes. Each node should see the complete global dataset (no chunking). The accuracy should stay constant.
For weak scaling, both problem size and compute resources used to solve the problem are ramped up proportionally to each other. For the distributed RF, this means that we double the number of trees as we double the number of nodes. We aim at solving the problem in a constant amount of time over the complete series of experiments. Quantity of interest is the efficiency:
The efficiency is a measure of how well the additional processors are used in the parallel case.
We start from each base problem and increase the number of trees accordingly as we increase the number of nodes. Again, we consider 1, 2, 4, 8, 16, 32, and 64 nodes. Each node should see the complete global dataset (no chunking). The accuracy can change as we scale the problem via the number of trees.
To study how the accuracy behaves as a function of the number of trees
Before we perform this series of experiments, we try to find the maximum number of trees for our base problems defined above that still train within the wall-time limit of HoreKa's single node queue, which is 3 d = 72 h.
-
$\lbrace n, m, t\rbrace=\lbrace 10^6, 10^4, 10^3\rbrace$ training time: 44 h => Use$1.5\times t=1500 \approx 1472=23\cdot 64$ trees. -
$\lbrace n, m, t\rbrace=\lbrace 10^7, 10^3, 10^2\rbrace$ : 16 h => Use$4.5\times t=450 \approx 448 = 7\cdot 64$ trees.
As the training time should scale linearly with the number of trees, there is still room for increasing the base problem
size to harness the full 3 d wall-clock time. We use multiples of 64 for the number of trees
In addition to the training time, we explicitly measure the dataset creation time in this series of experiments.
Final setup of single-node baseline experiments:
-
$\lbrace n, m, t\rbrace=\lbrace 10^6, 10^4, 1600\rbrace$ training time: 69.57 h -
$\lbrace n, m, t\rbrace=\lbrace 10^7, 10^3, 448\rbrace$ training time: 68.97 h