-
Notifications
You must be signed in to change notification settings - Fork 4k
Open
Labels
buggpu (OpenCL)Issue is related to the OpenCL-based GPU variant.Issue is related to the OpenCL-based GPU variant.
Description
Hi, I found a bug when training with large X_train.
lgb-gpu version: 3.3.2
CUDA=11.1
CentOS
ram=2TB
GPU=A100-40G
when the X_train is more than (1800w, 1000), lgb-gpu will has a bug like this:
[LightGBM] [Fatal] Check failed: (best_split_info.left_count) > (0) at LightGBM/src/treelearner/serial_tree_learner.app, line 686
When I use LGB==3.2.1 , I have the same problem as #4480 : when the GPU memory more than 8.3G will Memory Object Allocation Failure
in this LGB version==3.3.2 , LGB can't load more than 17G GPU memory (GPU has 40G memory), it seems like something problem occur in the tree split step and only happened when GPU memory loaded more than 17G.
Another colleague has this same problem, his lgb version is 3.3.1
Metadata
Metadata
Assignees
Labels
buggpu (OpenCL)Issue is related to the OpenCL-based GPU variant.Issue is related to the OpenCL-based GPU variant.