Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xgboost4j-spark-gpu train failed on multiple gpu node with EXCLUSIVE_PROCESS mode #11119

Closed
yinqingh opened this issue Dec 19, 2024 · 5 comments

Comments

@yinqingh
Copy link

xgboost4j-spark-gpu train failed on multiplue gpu node with EXCLUSIVE_PROCESS mode

Environment

  • OS: Ubuntu 22.04.2 LTS on OCI
  • Spark version: 3.5.0
  • XGBoost4j-spark: xgboost4j-spark-gpu_2.12-3.0.0-SNAPSHOT.jar
  • rapids-4-spark: rapids-4-spark_2.12-24.12.0-SNAPSHOT-cuda12.jar
  • GPU: 4* L40S

Failure logs

  1. failed with EXCLUSIVE_PROCESS mode for all GPUs
cudaErrorDevicesUnavailable: CUDA-capable device(s) is/are busy or unavailable
24/12/19 03:23:24 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 2) (l40s.compute.sparkdev.oraclevcn.com executor 1): ml.dmlc.xgboost4j.java.XGBoostError: [03:23:24] /workspace/jvm-packages/xgboost4j/src/native/xgboost4j-gpu.cu:331: [03:23:24] /workspace/src/common/device_vector.cu:23: Memory allocation error on worker 3: [03:23:24] /workspace/src/common/common.cu:16: /workspace/src/common/device_vector.cuh: 290: cudaErrorDevicesUnavailable: CUDA-capable device(s) is/are busy or unavailable
Stack trace:
  [bt] (0) /tmp/libxgboost4j8204609160539679832.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x6c) [0x7fdbd6f35d2c]
  [bt] (1) /tmp/libxgboost4j8204609160539679832.so(dh::ThrowOnCudaError(cudaError, char const*, int)+0x4c6) [0x7fdbd7636fb6]
  [bt] (2) /tmp/libxgboost4j8204609160539679832.so(thrust::THRUST_200601_500_600_700_800_900_NS::detail::vector_base<float, dh::detail::XGBDefaultDeviceAllocatorImpl<float> >::append(unsigned long)+0x15e) [0x7fdbd766edbe]
  [bt] (3) /tmp/libxgboost4j8204609160539679832.so(void xgboost::jni::CopyMetaInfo<float>(xgboost::Json*, thrust::THRUST_200601_500_600_700_800_900_NS::device_vector<float, dh::detail::XGBDefaultDeviceAllocatorImpl<float> >*, CUstream_st*)+0x31b) [0x7fdbd7baef9b]
  [bt] (4) /tmp/libxgboost4j8204609160539679832.so(xgboost::jni::DataIteratorProxy::StageMetaInfo(xgboost::Json)+0x231) [0x7fdbd7bb0621]
  [bt] (5) /tmp/libxgboost4j8204609160539679832.so(xgboost::jni::DataIteratorProxy::StageData(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)+0xb0) [0x7fdbd7bb0ee0]
  [bt] (6) /tmp/libxgboost4j8204609160539679832.so(xgboost::jni::DataIteratorProxy::PullIterFromJVM()+0x195) [0x7fdbd7bb15d5]
  [bt] (7) /tmp/libxgboost4j8204609160539679832.so(xgboost::jni::(anonymous namespace)::Next(void*)+0x67) [0x7fdbd7ba99f7]
  [bt] (8) /tmp/libxgboost4j8204609160539679832.so(xgboost::data::IterativeDMatrix::IterativeDMatrix(void*, void*, std::shared_ptr<xgboost::DMatrix>, void (*)(void*), int (*)(void*), float, int, int, long)+0x22f) [0x7fdbd7258c5f]


- Free memory: 26.6166GB
- Requested memory: 17.3242KB

Stack trace:
  [bt] (0) /tmp/libxgboost4j8204609160539679832.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x6c) [0x7fdbd6f35d2c]
  [bt] (1) /tmp/libxgboost4j8204609160539679832.so(dh::detail::ThrowOOMError(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long)+0x493) [0x7fdbd7637cd3]
  [bt] (2) /tmp/libxgboost4j8204609160539679832.so(thrust::THRUST_200601_500_600_700_800_900_NS::detail::vector_base<float, dh::detail::XGBDefaultDeviceAllocatorImpl<float> >::append(unsigned long)+0x2a7) [0x7fdbd766ef07]
  [bt] (3) /tmp/libxgboost4j8204609160539679832.so(void xgboost::jni::CopyMetaInfo<float>(xgboost::Json*, thrust::THRUST_200601_500_600_700_800_900_NS::device_vector<float, dh::detail::XGBDefaultDeviceAllocatorImpl<float> >*, CUstream_st*)+0x31b) [0x7fdbd7baef9b]
  [bt] (4) /tmp/libxgboost4j8204609160539679832.so(xgboost::jni::DataIteratorProxy::StageMetaInfo(xgboost::Json)+0x231) [0x7fdbd7bb0621]
  [bt] (5) /tmp/libxgboost4j8204609160539679832.so(xgboost::jni::DataIteratorProxy::StageData(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)+0xb0) [0x7fdbd7bb0ee0]
  [bt] (6) /tmp/libxgboost4j8204609160539679832.so(xgboost::jni::DataIteratorProxy::PullIterFromJVM()+0x195) [0x7fdbd7bb15d5]
  [bt] (7) /tmp/libxgboost4j8204609160539679832.so(xgboost::jni::(anonymous namespace)::Next(void*)+0x67) [0x7fdbd7ba99f7]
  [bt] (8) /tmp/libxgboost4j8204609160539679832.so(xgboost::data::IterativeDMatrix::IterativeDMatrix(void*, void*, std::shared_ptr<xgboost::DMatrix>, void (*)(void*), int (*)(void*), float, int, int, long)+0x22f) [0x7fdbd7258c5f]


Stack trace:
  [bt] (0) /tmp/libxgboost4j8204609160539679832.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x6c) [0x7fdbd6f35d2c]
  [bt] (1) /tmp/libxgboost4j8204609160539679832.so(+0x3f5ea3) [0x7fdbd6df5ea3]
  [bt] (2) /tmp/libxgboost4j8204609160539679832.so(xgboost::data::IterativeDMatrix::IterativeDMatrix(void*, void*, std::shared_ptr<xgboost::DMatrix>, void (*)(void*), int (*)(void*), float, int, int, long)+0x22f) [0x7fdbd7258c5f]
  [bt] (3) /tmp/libxgboost4j8204609160539679832.so(xgboost::DMatrix* xgboost::DMatrix::Create<void*, void*, void (void*), int (void*)>(void*, void*, std::shared_ptr<xgboost::DMatrix>, void (*)(void*), int (*)(void*), float, int, int, long)+0x81) [0x7fdbd71e6871]
  [bt] (4) /tmp/libxgboost4j8204609160539679832.so(XGQuantileDMatrixCreateFromCallback+0x3af) [0x7fdbd6e3645f]
  [bt] (5) /tmp/libxgboost4j8204609160539679832.so(XGQuantileDMatrixCreateFromCallbackImpl+0x2bb) [0x7fdbd7ba977b]
  [bt] (6) /tmp/libxgboost4j8204609160539679832.so(Java_ml_dmlc_xgboost4j_java_XGBoostJNI_XGQuantileDMatrixCreateFromCallback+0x93) [0x7fdbd7b9abe3]
  [bt] (7) [0x7ff0350183e7]


	at ml.dmlc.xgboost4j.java.XGBoostJNI.checkCall(XGBoostJNI.java:48)
	at ml.dmlc.xgboost4j.java.QuantileDMatrix.<init>(QuantileDMatrix.java:69)
	at ml.dmlc.xgboost4j.java.QuantileDMatrix.<init>(QuantileDMatrix.java:38)
	at ml.dmlc.xgboost4j.scala.QuantileDMatrix.<init>(QuantileDMatrix.scala:36)
	at ml.dmlc.xgboost4j.scala.spark.GpuXGBoostPlugin.$anonfun$buildRddWatches$7(GpuXGBoostPlugin.scala:144)
	at scala.Option.getOrElse(Option.scala:189)
	at ml.dmlc.xgboost4j.scala.spark.GpuXGBoostPlugin.ml$dmlc$xgboost4j$scala$spark$GpuXGBoostPlugin$$buildQuantileDMatrix$1(GpuXGBoostPlugin.scala:144)
	at ml.dmlc.xgboost4j.scala.spark.GpuXGBoostPlugin$$anon$2.next(GpuXGBoostPlugin.scala:167)
	at ml.dmlc.xgboost4j.scala.spark.GpuXGBoostPlugin$$anon$2.next(GpuXGBoostPlugin.scala:164)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$.$anonfun$train$2(XGBoost.scala:252)
	at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2(RDDBarrier.scala:51)
	at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2$adapted(RDDBarrier.scala:51)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
	at org.apache.spark.scheduler.Task.run(Task.scala:141)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
  1. passed with following gpu settings.
    • gpu0: DEFAULT mode
    • gpu1: EXCLUSIVE_PROCESS
    • gpu2: EXCLUSIVE_PROCESS
    • gpu3: EXCLUSIVE_PROCESS

Observed processes on gpu 1,2,3 were also accessing gpu 0
image

@yinqingh
Copy link
Author

cc @wbo4958 @NvTimLiu

@trivialfis
Copy link
Member

Hi, could you please share the use case for exclusive mode with spark cluster? Seems to be quite difficult to workaround if only a single process is allowed to access the GPU.

@yinqingh
Copy link
Author

yinqingh commented Dec 20, 2024

No real use case actually. I did not realize that it only works with default mode previously.

Another strange behavior is that the process 3981172 & 3981171 & 3981167 (should be spark executor processes) ran on GPU 1,2,3 firstly and then all these 3 processes were accessing GPU 0 instead of GPU 1,2,3. Not sure if this is expected behavior or not. You can see the processes section in the screenshot.

I tried to set GPU 1 to default mode and the process still tried to access different gpus

@trivialfis
Copy link
Member

trivialfis commented Jan 2, 2025

I don't think spark or XGBoost takes GPU "modes" into consideration when allocating/accessing GPUs, and it's unlikely we will try to check the admin setting of the GPUs.

@trivialfis
Copy link
Member

Feel free to reopen if there are further questions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants