Error compile: --bazel_options=--repo_env=LOCAL_CUDA_PATH="${CUDA_HOME}" issues with clang & gcc #23575

benkirk · 2024-09-11T18:56:46Z

Description

I'm attempting to build jaxlib with a local CUDA, CUDNN, and NCCL. I'm running into (different) issues with either gcc of clang. Any ideas??:

Build command:

python build/build.py \
       --build_gpu_plugin --gpu_plugin_cuda_version=12 \
       --verbose \
       --enable_mkl_dnn \
       --enable_nccl \
       --enable_cuda \
       --cuda_compute_capabilities 8.0 \
       --target_cpu_features release \
       --bazel_options=--repo_env=LOCAL_CUDA_PATH="${CUDA_HOME}" \
       --bazel_options=--repo_env=LOCAL_CUDNN_PATH="${NCAR_ROOT_CUDNN}" \
       --bazel_options=--repo_env=LOCAL_NCCL_PATH="${PREFIX}"

`clang` error:

external/tsl/tsl/profiler/lib/nvtx_utils.cc:32:10: fatal error: 'third_party/gpus/cuda/include/cuda.h' file not found

`gcc` error:

# Configuration: d3d6c18c79c5128461901902331e6ad5ab5bc83fb9ca1bc29bc506f7fe919c16
# Execution platform: @local_execution_config_platform//:platform
gcc: error: unrecognized command-line option '--cuda-path=external/cuda_nvcc'

System info (python version, jaxlib version, accelerator, etc.)

jax:    0.4.31
jaxlib: 0.4.31
numpy:  2.1.1
python: 3.11.10 | packaged by conda-forge | (main, Sep 10 2024, 11:01:28) [GCC 13.3.0]
jax.devices (2 total, 2 local): [CudaDevice(id=0) CudaDevice(id=1)]
process_count: 1
platform: uname_result(system='Linux', node='derecho7', release='5.14.21-150400.24.18-default', version='#1 SMP PREEMPT_DYNAMIC Thu Aug 4 14:17:48 UTC 2022 (e9f7bfc)', machine='x86_64')


$ nvidia-smi
Wed Sep 11 12:37:51 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe          Off | 00000000:03:00.0 Off |                    0 |
| N/A   51C    P0              68W / 300W |    429MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80GB PCIe          Off | 00000000:C3:00.0 Off |                    0 |
| N/A   53C    P0              75W / 300W |    429MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     54137      C   python                                      416MiB |
|    1   N/A  N/A     54137      C   python                                      416MiB |
+---------------------------------------------------------------------------------------+

The text was updated successfully, but these errors were encountered:

johnnynunez · 2024-09-11T19:06:09Z

You have to put the cuda version and cudnn version unfortunaly.
Clang not detect automatically.
If you are using this setup. Maybe is better that you use JAX Toolbox.
https://github.com/NVIDIA/JAX-Toolbox

ybaturina · 2024-09-11T19:06:23Z

Hi @benkirk I'm going to update JAX docs with the link to XLA instructions.

From your command, I see that you provided environment variables:

 --bazel_options=--repo_env=LOCAL_CUDA_PATH="${CUDA_HOME}" \
       --bazel_options=--repo_env=LOCAL_CUDNN_PATH="${NCAR_ROOT_CUDNN}" \
       --bazel_options=--repo_env=LOCAL_NCCL_PATH="${PREFIX}"

Would you provide values of ${CUDA_HOME}, ${NCAR_ROOT_CUDNN} and ${PREFIX} here please?

johnnynunez · 2024-09-11T19:07:15Z

Hi @benkirk I'm going to update JAX docs with the link to XLA instructions.

From your command, I see that you provided environment variables:
 --bazel_options=--repo_env=LOCAL_CUDA_PATH="${CUDA_HOME}" \
       --bazel_options=--repo_env=LOCAL_CUDNN_PATH="${NCAR_ROOT_CUDNN}" \
       --bazel_options=--repo_env=LOCAL_NCCL_PATH="${PREFIX}"
Would you provide values of ${CUDA_HOME}, ${NCAR_ROOT_CUDNN} and ${PREFIX} here please?

the problem is here:
openxla/xla#16877

I avoid a lot of problems.
dusty-nv/jetson-containers#626

johnnynunez · 2024-09-11T19:10:33Z

Also, this is necessary:
https://github.com/NVIDIA/JAX-Toolbox/blob/main/.github/container/install-cudnn.sh
and this:
https://github.com/NVIDIA/JAX-Toolbox/blob/main/.github/container/build-jax.sh

ln -s /usr/local/cuda/lib64 /usr/local/cuda/lib

johnnynunez · 2024-09-11T19:11:18Z

Also, this is necessary: https://github.com/NVIDIA/JAX-Toolbox/blob/main/.github/container/install-cudnn.sh and this: https://github.com/NVIDIA/JAX-Toolbox/blob/main/.github/container/build-jax.sh
ln -s /usr/local/cuda/lib64 /usr/local/cuda/lib

I've update the script to not download the files.

#!/bin/bash

set -e

CUDNN_MAJOR_VERSION=9
CUDA_MAJOR_VERSION=12.2
prefix=/opt/nvidia/cudnn
arch=$(uname -m)-linux-gnu
cuda_base_path="/usr/local/cuda-${CUDA_MAJOR_VERSION}"

# Comprobar si la ruta especificada de CUDA existe
if [[ -d "${cuda_base_path}" ]]; then
  cuda_lib_path="${cuda_base_path}/lib64"
  output_path="/usr/local/cuda-${CUDA_MAJOR_VERSION}/lib"
else
  cuda_lib_path="/usr/local/cuda/lib64"
  output_path="/usr/local/cuda/lib64"
fi

# Crear enlace simbólico para CUDA
sudo ln -s "${cuda_lib_path}" "${output_path}"

# Proceso para CUDNN
for cudnn_file in $(dpkg -L libcudnn${CUDNN_MAJOR_VERSION} libcudnn${CUDNN_MAJOR_VERSION}-dev | sort -u); do
  if [[ -f "${cudnn_file}" || -h "${cudnn_file}" ]]; then
    nosysprefix="${cudnn_file#"/usr/"}"
    noarchinclude="${nosysprefix/#"include/${arch}"/include}"
    noverheader="${noarchinclude/%"_v${CUDNN_MAJOR_VERSION}.h"/.h}"
    noarchlib="${noverheader/#"lib/${arch}"/lib}"
    
    # Usar la ruta cuda_base_path o /usr/local/cuda/lib64
    if [[ -d "${cuda_base_path}" ]]; then
      link_name="${cuda_base_path}/${noarchlib}"
    else
      link_name="/usr/local/cuda/lib64/${noarchlib}"
    fi
    
    link_dir=$(dirname "${link_name}")
    mkdir -p "${link_dir}"
    ln -s "${cudnn_file}" "${link_name}"
  fi
done

benkirk · 2024-09-11T19:13:16Z

Thank you both, in my case

 --bazel_options=--repo_env=LOCAL_CUDA_PATH="/glade/u/apps/common/23.08/spack/opt/spack/cuda/12.2.1" \
 --bazel_options=--repo_env=LOCAL_CUDNN_PATH="/glade/u/apps/common/23.08/spack/opt/spack/cudnn/9.2.0.82-12" \
 --bazel_options=--repo_env=LOCAL_NCCL_PATH="<my_conda_build_prefix>"

I'll attempt providing the version strings on the command line as well and follow XLA instructions.

Building from source without a container definitely wasn't my first choice, but we do have need for a site-provided NCCL on this machine, it has a proprietary vendor network - Slingshot 11 - that needs some care & feeding.

johnnynunez · 2024-09-11T19:16:27Z

Thank you both, in my case
 --bazel_options=--repo_env=LOCAL_CUDA_PATH="/glade/u/apps/common/23.08/spack/opt/spack/cuda/12.2.1" \
 --bazel_options=--repo_env=LOCAL_CUDNN_PATH="/glade/u/apps/common/23.08/spack/opt/spack/cudnn/9.2.0.82-12" \
 --bazel_options=--repo_env=LOCAL_NCCL_PATH="<my_conda_build_prefix>"
I'll attempt providing the version strings on the command line as well and follow XLA instructions.

Building from source without a container definitely wasn't my first choice, but we do have need for a site-provided NCCL on this machine, it has a proprietary vendor network - Slingshot 11 - that needs some care & feeding.

yeah but not works, because I mention before that cuda needs lib not lib64. And cudnn needs to be renamed mainting certain structure. It's very tricky. On 0.4.31 release, it was with cuda_path etc that was easier, but now, jax use xla hermetic cuda that runs automatically everything....

hawkinsp · 2024-09-11T19:16:31Z

@benkirk You don't need to build JAX from source to use a custom NCCL. We'll use whichever libnccl.so we find in your LD_LIBRARY_PATH.

benkirk · 2024-09-11T19:26:03Z

Thanks @hawkinsp, I've got my NCCL injected with jax[cuda12]=0.4.31 properly from PIP, had a few issues trying jax[cuda12_local]=0.4.31 ; I'll revisit that as an alternative parallel path.

ybaturina · 2024-09-11T19:57:48Z

yeah but not works, because I mention before that cuda needs lib not lib64. And cudnn needs to be renamed mainting certain structure. It's very tricky. On 0.4.31 release, it was with cuda_path etc that was easier, but now, jax use xla hermetic cuda that runs automatically everything....

hi @johnnynunez, I understand your concerns, I tried to address them in the comment here.

benkirk added the bug Something isn't working label Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error compile: --bazel_options=--repo_env=LOCAL_CUDA_PATH="${CUDA_HOME}" issues with clang & gcc #23575

Error compile: --bazel_options=--repo_env=LOCAL_CUDA_PATH="${CUDA_HOME}" issues with clang & gcc #23575

benkirk commented Sep 11, 2024

johnnynunez commented Sep 11, 2024

ybaturina commented Sep 11, 2024

johnnynunez commented Sep 11, 2024 •

edited

Loading

johnnynunez commented Sep 11, 2024

johnnynunez commented Sep 11, 2024

benkirk commented Sep 11, 2024

johnnynunez commented Sep 11, 2024 •

edited

Loading

hawkinsp commented Sep 11, 2024

benkirk commented Sep 11, 2024

ybaturina commented Sep 11, 2024

Error compile: --bazel_options=--repo_env=LOCAL_CUDA_PATH="${CUDA_HOME}" issues with clang & gcc #23575

Error compile: --bazel_options=--repo_env=LOCAL_CUDA_PATH="${CUDA_HOME}" issues with clang & gcc #23575

Comments

benkirk commented Sep 11, 2024

Description

Build command:

clang error:

gcc error:

System info (python version, jaxlib version, accelerator, etc.)

johnnynunez commented Sep 11, 2024

ybaturina commented Sep 11, 2024

johnnynunez commented Sep 11, 2024 • edited Loading

johnnynunez commented Sep 11, 2024

johnnynunez commented Sep 11, 2024

benkirk commented Sep 11, 2024

johnnynunez commented Sep 11, 2024 • edited Loading

hawkinsp commented Sep 11, 2024

benkirk commented Sep 11, 2024

ybaturina commented Sep 11, 2024

`clang` error:

`gcc` error:

johnnynunez commented Sep 11, 2024 •

edited

Loading

johnnynunez commented Sep 11, 2024 •

edited

Loading