Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why is clang(CUDA) + OpenMP not supported? #639

Open
tdd11235813 opened this issue Sep 12, 2018 · 27 comments · May be fixed by #2153
Open

Why is clang(CUDA) + OpenMP not supported? #639

tdd11235813 opened this issue Sep 12, 2018 · 27 comments · May be fixed by #2153

Comments

@tdd11235813
Copy link
Contributor

tdd11235813 commented Sep 12, 2018

what was the reason for disabling CUDA(clang) + OpenMP? Does not clang support this or was this alpaka related?
( corresponding PR )

@BenjaminW3
Copy link
Member

It was already disabled before this PR. I do not remember all the details but as far as I remember, it is simply not supported by clang to do both at once.

@BenjaminW3
Copy link
Member

If you want to find out the reason, you can simply remove those lines from the CMakeLists and create a PR. The CI will tell you the truth. Maybe it is supported now.

@tdd11235813
Copy link
Contributor Author

it seems, clang disables OpenMP, when clang(cuda) is involved. So no _OPENMP and pragma omp's are defined. When CUDA is disabled, then it works again. Tested with clang5+cuda8 and clang6+cuda9.0.
Opened an issue over there.

@ax3l ax3l changed the title Why clang(CUDA) + OpenMP is not supported? Why is clang(CUDA) + OpenMP not supported? Sep 17, 2018
@ax3l
Copy link
Member

ax3l commented Sep 17, 2018

From a quick test, nvcc -std=c++11 -Xcompiler=-fopenmp (9.2) works and clang -x cuda -std=c++11 -fopenmp as well... (trunk)

If you want to find out the reason, you can simply remove those lines from the CMakeLists and create a PR. The CI will tell you the truth. Maybe it is supported now.

My suggestion as well.

@tdd11235813
Copy link
Contributor Author

I will test with the new versions next week, when our Intel/Nvidia developer system is up.
(On the cluster modules are not built properly or missing)

@tdd11235813
Copy link
Contributor Author

tdd11235813 commented Sep 24, 2018

Using clang7 and CUDA10.0 I got a known issue:

clang-7: error: cannot find libdevice for sm_30. Provide path to different CUDA installation via --cuda-path, or pass -nocudalib to build without linking with libdevice.

libdevice is where clang expects it.

$ ls /usr/local/cuda/nvvm/libdevice
libdevice.10.bc

However, as you can see, there is only that 1.0 thing. As of CUDA 9 it no more ships different libdevices. See also this forum thread on NVidia.
The "one-size-fits-all" "solution" is recognized by clang, but maybe clang still does not like CUDA 10.0.

So I installed CUDA9.2 via spack (spack silently has overwritten external system libraries again^^). Then it takes the libdevice, however, the OpenMP and CUDA(clang) back-end still does not play together. OpenMP is somehow disabled by clang when CUDA(clang) is used in conjunction.

@ax3l
Copy link
Member

ax3l commented Sep 24, 2018

Ok, CUDA 10 is not yet in clang trunk since Friday ;-)
https://github.com/llvm-mirror/clang/blob/master/include/clang/Basic/Cuda.h
See the all-mighty CUDA gist: https://gist.github.com/ax3l/9489132

@tdd11235813
Copy link
Contributor Author

12 hours ago they pushed the CUDA 10 commit :)

@ax3l
Copy link
Member

ax3l commented Sep 25, 2018

Yep, I mailed the author :)

@BenjaminW3
Copy link
Member

it seems, clang disables OpenMP, when clang(cuda) is involved. So no _OPENMP and pragma omp's are defined. When CUDA is disabled, then it works again. Tested with clang5+cuda8 and clang6+cuda9.0.
Opened an issue over there.

There has been an answer on the linked issue. We may want to reevaluate the CUDA (clang) + OpenMP support for clang 8.0.

@BenjaminW3
Copy link
Member

BenjaminW3 commented Mar 25, 2019

I just tried with clang 8.0 and I get:
AccCpuOmp2Blocks.hpp:15:5: error: '_OPENMP' is not defined, evaluates to 0 [-Werror,-Wundef]
I am not sure if the answer in the bug thread is still the same as it was at the time I last read it.
It references a patched-upstream branch on this clang repo. It would be really cool if someone could test this branch. Else we can simply wait for clang-9.0 ;-)

@tdd11235813
Copy link
Contributor Author

Sorry for not replying earlier. Just can report the last state.
I got build errors when using clang-9 (patched-upstream) and cuda (tested 9.0.176, 9.2.88, 10.0.130).

clang compilation

spack load [email protected]
SOURCE_ROOT=$HOME/sources
LLVM_INSTALL_DIR=$HOME/software/llvm-dev-gcc5/
# get llvm, use different clang source
git clone --single-branch -b patched-upstream https://github.com/clang-ykt/llvm
cd llvm/tools
 # rm -r clang # cleanup
# get clang dev
git clone --single-branch -b patched-upstream https://github.com/clang-ykt/clang
cd ..
mkdir -p build
cd build
cmake -DLLVM_BUILD_TESTS=OFF -DLLVM_PARALLEL_COMPILE_JOBS=8 -DLLVM_PARALLEL_LINK_JOBS=8 -DLLVM_BUILD_UTILS=OFF -DCMAKE_INSTALL_PREFIX=$LLVM_INSTALL_DIR -DCMAKE_BUILD_TYPE=RelWithDebInfo ../
  
make -j 8
make install

Alpaka compilation

 module load [email protected]
 module load [email protected]
 cmake -DCMAKE_CXX_COMPILER=$HOME/software/llvm-dev-gcc5/bin/clang-9 -DALPAKA_CUDA_COMPILER=clang ..
make axpy

Errors

make axpy                                                                                                                                  develop ⬆ ✭ ◼
[  0%] Building CXX object test/CMakeFiles/CatchMain.dir/CatchMain.cpp.o
[  0%] Linking CXX static library libCatchMain.a
[  0%] Built target CatchMain
[ 50%] Building CXX object test/common/CMakeFiles/common.dir/src/Dummy.cpp.o
[ 50%] Linking CXX static library libcommon.a
[ 50%] Built target common
[100%] Building CXX object test/integ/axpy/CMakeFiles/axpy.dir/src/axpy.cpp.o
In file included from <built-in>:1:
In file included from /home/mwerner/software/llvm-dev-gcc5/lib/clang/9.0.0/include/__clang_cuda_runtime_wrapper.h:327:
/opt/spack-modules/linux-ubuntu16.04-x86_64/gcc-5.4.0/cuda-10.0.130-2ncant54wqot3prhetq4yyqybnynppi4/include/crt/math_functions.hpp:671:10: error: no matching function for call to 'lgammaf'
  return lgammaf(a);
         ^~~~~~~
/usr/include/x86_64-linux-gnu/bits/math-finite.h:319:37: note: candidate function not viable: call to __host__ function from __device__ function
__extern_always_inline float __NTH (lgammaf (float __d))
                                    ^
In file included from <built-in>:1:
In file included from /home/mwerner/software/llvm-dev-gcc5/lib/clang/9.0.0/include/__clang_cuda_runtime_wrapper.h:327:
/opt/spack-modules/linux-ubuntu16.04-x86_64/gcc-5.4.0/cuda-10.0.130-2ncant54wqot3prhetq4yyqybnynppi4/include/crt/math_functions.hpp:676:10: error: no matching function for call to 'tgammaf'
  return tgammaf(a);
         ^~~~~~~
/usr/include/x86_64-linux-gnu/bits/math-finite.h:487:37: note: candidate function not viable: call to __host__ function from __device__ function
__extern_always_inline float __NTH (tgammaf (float __d))
                                    ^
In file included from /home/mwerner/cuda-workspace/alpaka/test/integ/axpy/src/axpy.cpp:15:
In file included from /home/mwerner/cuda-workspace/alpaka/include/alpaka/alpaka.hpp:22:
In file included from /home/mwerner/cuda-workspace/alpaka/include/alpaka/acc/AccCpuSerial.hpp:38:
/home/mwerner/cuda-workspace/alpaka/include/alpaka/dev/DevCpu.hpp:68:21: error: explicitly defaulted copy constructor is implicitly deleted [-Werror,-Wdefaulted-function-deleted]
                    DevCpuImpl(DevCpuImpl const &) = default;

[...]

In file included from /home/mwerner/cuda-workspace/alpaka/test/integ/axpy/src/axpy.cpp:15:
In file included from /home/mwerner/cuda-workspace/alpaka/include/alpaka/alpaka.hpp:29:
In file included from /home/mwerner/cuda-workspace/alpaka/include/alpaka/acc/AccGpuCudaRt.hpp:31:
In file included from /home/mwerner/cuda-workspace/alpaka/include/alpaka/rand/RandCuRand.hpp:27:
In file included from /opt/spack-modules/linux-ubuntu16.04-x86_64/gcc-5.4.0/cuda-10.0.130-2ncant54wqot3prhetq4yyqybnynppi4/include/curand_kernel.h:70:
/opt/spack-modules/linux-ubuntu16.04-x86_64/gcc-5.4.0/cuda-10.0.130-2ncant54wqot3prhetq4yyqybnynppi4/include/curand_philox4x32_x.h:81:9: error: macro name is a reserved identifier
      [-Werror,-Wreserved-id-macro]
#define CURAND_PHILOX4X32_X__H_
        ^
/opt/spack-modules/linux-ubuntu16.04-x86_64/gcc-5.4.0/cuda-10.0.130-2ncant54wqot3prhetq4yyqybnynppi4/include/curand_philox4x32_x.h:108:23: error: use of old-style cast
      [-Werror,-Wold-style-cast]
   unsigned int nlo = (unsigned int)(n);

@SimeonEhrig reported that clang-9 with a simple CUDA program was working. I'll talk with him about this issue...

@BenjaminW3
Copy link
Member

BenjaminW3 commented Apr 2, 2019

Thank you for trying this!

  • The DevCpu.hpp:68:21: error: explicitly defaulted copy constructor is implicitly deleted should have been fixed with the clang-8 support PR last week.
  • The issues within curand_philox4x32_x.h should probably be ignored or suppressed (or we disable werror for those experiments)
  • I am not yet sure what to do about the lgammaf and tgammaf issues. We do not call those functions but we simply include crt/math_functions.hpp. This looks like a bug in clang-9 CUDA support.

@tdd11235813
Copy link
Contributor Author

thanks for the quick response. I will try the new version of clang.
For the math the header __clang_cuda_math_forward_declares.h did not define lgammaf, but only lgamma. Same for tgamma. Looks like they are used later on from clang Builtins.def.
I'll report it after testing new branch. It still seems to be missing in that clang cuda math file.

@tdd11235813
Copy link
Contributor Author

ok, new version of alpaka and clang helped except for the math stuff, but adding missing functions to the aforementioned header works ... now I just get a glibc linker issue, our Ubuntu 16.04 probably has too old glibc, where __logl_finite is not available.

/usr/bin/ld: CMakeFiles/axpy.dir/src/axpy.cpp.o: undefined reference to symbol '__logl_finite@@GLIBC_2.15'

Well, let's call it a day...

@SimeonEhrig
Copy link
Member

@tdd11235813
I tried OpenMP with CUDA in a Ubuntu 18.04 container with Clang 9.0 and CUDA 10.0 on a Ubuntu 16.04 hostsystem. Maybe it could solve the problem and specify the problem in more detail, when you use a Ubuntu 18.04 environment.

@ax3l
Copy link
Member

ax3l commented Apr 3, 2019

Thanks for digging into this everyone!

@SimeonEhrig can you pls post your Dockerfile for reproducibility?

@tdd11235813 can you pls link the upstream LLVM bug report regarding the missing define in the __clang_cuda_math_forward_declares.h header?

@SimeonEhrig
Copy link
Member

SimeonEhrig commented Apr 3, 2019

Here is my source-code and the test-environment:

source-code:

#include <iostream>
#include <omp.h>

__global__ void foo(float * v, unsigned int start, unsigned int size, float id){
        for(unsigned int i = start; i < start+size; ++i){
          v[i] = lgammaf(id);
        }
}


int main(int argc, char const *argv[])
{
        const unsigned int number_of_threads = 14;
        const unsigned int vector_size = 4;

        float v_host[number_of_threads*vector_size];
        float * v_device;

        cudaMalloc( (void**) &v_device, sizeof(float)*number_of_threads*vector_size);

        // try to generate a race condition to show that OpenMP is works
        #pragma omp parallel for 
        for(int i = 0; i < number_of_threads; ++i)
        {
                std::cout << "Hello my id is: " << omp_get_thread_num() << std::endl;
        }

        #pragma omp parallel for
        for(unsigned int i = 0; i < number_of_threads; ++i){
          foo<<<1, 1>>>(v_device, i*vector_size, vector_size, static_cast<float>(omp_get_thread_num()+1) );
        }

        cudaMemcpy(v_host, v_device, sizeof(float)*number_of_threads*vector_size, cudaMemcpyDeviceToHost);

        for(unsigned int i = 0; i < number_of_threads; ++i){
                for(unsigned int j = 0; j < vector_size; ++j){
                        std::cout << v_host[i*vector_size+j] << " ";
                }
                std::cout << std::endl;
        }

        return 0;
}

makefile:

id_vector : id_vector_omp.cu
        clang++ -std=c++11 id_vector_omp.cu -o id_vector --cuda-gpu-arch=sm_30 -L/usr/local/cuda/lib64 -lcudart_static -ldl -lrt -pthread -fopenmp

singularity container:

Bootstrap: docker
From: nvidia/cuda:10.0-devel-ubuntu18.04

%help
        This container is based on the CUDA 10.0 docker container with Ubuntu 18.04 environment: https://hub.docker.com/r/nvidia/cuda/
        
%setup

%files

%labels
        Maintainer Simeon Ehrig
        Version 1.0

%environment
        export PATH=$PATH:/usr/local/cuda/bin/
        export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64   

%post
        apt update
        apt install -y git nano wget gdb cmake make python2.7 zlib1g-dev libomp-dev

I tested it with the current dev branch of clang/llvm (9.0-dev c64d73cce240148ea5f38d70f1851373036e716d - https://clang.llvm.org/get_started.html) and clang-ykt (Matthias his instruction).

It works with the vanila Clang. The clang-ykt throws following error:

clang++ -std=c++11 id_vector_omp.cu -o id_vector --cuda-gpu-arch=sm_30 -L/usr/local/cuda/lib64 -lcudart_static -ldl -lrt -pthread -fopenmp
In file included from <built-in>:1:
In file included from /home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_runtime_wrapper.h:162:
/home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_device_functions.h:61:59: error: no
      matching function for call to '__nv_brev'
__DEVICE__ unsigned int __brev(unsigned int __a) { return __nv_brev(__a); }
                                                          ^~~~~~~~~
/home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_libdevice_declares.h:52:16: note: 
      candidate function not viable: call to __host__ function from __device__ function
__DEVICE__ int __nv_brev(int __a);
               ^
In file included from <built-in>:1:
In file included from /home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_runtime_wrapper.h:162:
/home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_device_functions.h:63:10: error: no
      matching function for call to '__nv_brevll'
  return __nv_brevll(__a);
         ^~~~~~~~~~~
/home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_libdevice_declares.h:53:22: note: 
      candidate function not viable: call to __host__ function from __device__ function
__DEVICE__ long long __nv_brevll(long long __a);
                     ^
In file included from <built-in>:1:
In file included from /home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_runtime_wrapper.h:162:
/home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_device_functions.h:71:10: error: no
      matching function for call to '__nv_byte_perm'
  return __nv_byte_perm(__a, __b, __c);
         ^~~~~~~~~~~~~~
/home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_libdevice_declares.h:54:16: note: 
      candidate function not viable: call to __host__ function from __device__ function
__DEVICE__ int __nv_byte_perm(int __a, int __b, int __c);
               ^
In file included from <built-in>:1:
In file included from /home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_runtime_wrapper.h:162:
/home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_device_functions.h:73:40: error: no
      matching function for call to '__nv_clz'
__DEVICE__ int __clz(int __a) { return __nv_clz(__a); }
                                       ^~~~~~~~
/home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_libdevice_declares.h:59:16: note: 
      candidate function not viable: call to __host__ function from __device__ function
__DEVICE__ int __nv_clz(int __a);
               ^
In file included from <built-in>:1:
In file included from /home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_runtime_wrapper.h:162:
/home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_device_functions.h:74:48: error: no
      matching function for call to '__nv_clzll'
__DEVICE__ int __clzll(long long __a) { return __nv_clzll(__a); }
                                               ^~~~~~~~~~
/home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_libdevice_declares.h:60:16: note: 
      candidate function not viable: call to __host__ function from __device__ function
__DEVICE__ int __nv_clzll(long long __a);
               ^
In file included from <built-in>:1:
In file included from /home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_runtime_wrapper.h:162:
/home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_device_functions.h:75:45: error: no
      matching function for call to '__nv_fast_cosf'
__DEVICE__ float __cosf(float __a) { return __nv_fast_cosf(__a); }
                                            ^~~~~~~~~~~~~~
/home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_libdevice_declares.h:144:18: note: 
      candidate function not viable: call to __host__ function from __device__ function
__DEVICE__ float __nv_fast_cosf(float __a);
                 ^
In file included from <built-in>:1:
In file included from /home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_runtime_wrapper.h:162:
/home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_device_functions.h:86:10: error: no
      matching function for call to '__nv_dadd_rd'
  return __nv_dadd_rd(__a, __b);
         ^~~~~~~~~~~~
/home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_libdevice_declares.h:73:19: note: 
      candidate function not viable: call to __host__ function from __device__ function
__DEVICE__ double __nv_dadd_rd(double __a, double __b);
                  ^
In file included from <built-in>:1:
In file included from /home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_runtime_wrapper.h:162:
/home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_device_functions.h:89:10: error: no
      matching function for call to '__nv_dadd_rn'
  return __nv_dadd_rn(__a, __b);
         ^~~~~~~~~~~~
/home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_libdevice_declares.h:74:19: note: 
      candidate function not viable: call to __host__ function from __device__ function
__DEVICE__ double __nv_dadd_rn(double __a, double __b);
                  ^
In file included from <built-in>:1:
In file included from /home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_runtime_wrapper.h:162:
/home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_device_functions.h:92:10: error: no
      matching function for call to '__nv_dadd_ru'
  return __nv_dadd_ru(__a, __b);
         ^~~~~~~~~~~~
/home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_libdevice_declares.h:75:19: note: 
      candidate function not viable: call to __host__ function from __device__ function
__DEVICE__ double __nv_dadd_ru(double __a, double __b);
                  ^
In file included from <built-in>:1:
In file included from /home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_runtime_wrapper.h:162:
/home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_device_functions.h:95:10: error: no
      matching function for call to '__nv_dadd_rz'
  return __nv_dadd_rz(__a, __b);
         ^~~~~~~~~~~~
/home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_libdevice_declares.h:76:19: note: 
      candidate function not viable: call to __host__ function from __device__ function
__DEVICE__ double __nv_dadd_rz(double __a, double __b);
                  ^
In file included from <built-in>:1:
In file included from /home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_runtime_wrapper.h:162:
/home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_device_functions.h:98:10: error: no
      matching function for call to '__nv_ddiv_rd'
  return __nv_ddiv_rd(__a, __b);
         ^~~~~~~~~~~~
/home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_libdevice_declares.h:77:19: note: 
      candidate function not viable: call to __host__ function from __device__ function
__DEVICE__ double __nv_ddiv_rd(double __a, double __b);
                  ^
In file included from <built-in>:1:
In file included from /home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_runtime_wrapper.h:162:
/home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_device_functions.h:101:10: error: no
      matching function for call to '__nv_ddiv_rn'
  return __nv_ddiv_rn(__a, __b);
         ^~~~~~~~~~~~
/home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_libdevice_declares.h:78:19: note: 
      candidate function not viable: call to __host__ function from __device__ function
__DEVICE__ double __nv_ddiv_rn(double __a, double __b);
                  ^
In file included from <built-in>:1:
In file included from /home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_runtime_wrapper.h:162:
/home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_device_functions.h:104:10: error: no
      matching function for call to '__nv_ddiv_ru'
  return __nv_ddiv_ru(__a, __b);
         ^~~~~~~~~~~~
/home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_libdevice_declares.h:79:19: note: 
      candidate function not viable: call to __host__ function from __device__ function
__DEVICE__ double __nv_ddiv_ru(double __a, double __b);
                  ^
In file included from <built-in>:1:
In file included from /home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_runtime_wrapper.h:162:
/home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_device_functions.h:107:10: error: no
      matching function for call to '__nv_ddiv_rz'
  return __nv_ddiv_rz(__a, __b);
         ^~~~~~~~~~~~
/home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_libdevice_declares.h:80:19: note: 
      candidate function not viable: call to __host__ function from __device__ function
__DEVICE__ double __nv_ddiv_rz(double __a, double __b);
                  ^
In file included from <built-in>:1:
In file included from /home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_runtime_wrapper.h:162:
/home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_device_functions.h:110:10: error: no
      matching function for call to '__nv_dmul_rd'
  return __nv_dmul_rd(__a, __b);
         ^~~~~~~~~~~~
/home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_libdevice_declares.h:81:19: note: 
      candidate function not viable: call to __host__ function from __device__ function
__DEVICE__ double __nv_dmul_rd(double __a, double __b);
                  ^
In file included from <built-in>:1:
In file included from /home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_runtime_wrapper.h:162:
/home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_device_functions.h:113:10: error: no
      matching function for call to '__nv_dmul_rn'
  return __nv_dmul_rn(__a, __b);
         ^~~~~~~~~~~~
/home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_libdevice_declares.h:82:19: note: 
      candidate function not viable: call to __host__ function from __device__ function
__DEVICE__ double __nv_dmul_rn(double __a, double __b);
                  ^
In file included from <built-in>:1:
In file included from /home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_runtime_wrapper.h:162:
/home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_device_functions.h:116:10: error: no
      matching function for call to '__nv_dmul_ru'
  return __nv_dmul_ru(__a, __b);
         ^~~~~~~~~~~~
/home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_libdevice_declares.h:83:19: note: 
      candidate function not viable: call to __host__ function from __device__ function
__DEVICE__ double __nv_dmul_ru(double __a, double __b);
                  ^
In file included from <built-in>:1:
In file included from /home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_runtime_wrapper.h:162:
/home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_device_functions.h:119:10: error: no
      matching function for call to '__nv_dmul_rz'
  return __nv_dmul_rz(__a, __b);
         ^~~~~~~~~~~~
/home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_libdevice_declares.h:84:19: note: 
      candidate function not viable: call to __host__ function from __device__ function
__DEVICE__ double __nv_dmul_rz(double __a, double __b);
                  ^
In file included from <built-in>:1:
In file included from /home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_runtime_wrapper.h:162:
/home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_device_functions.h:122:10: error: no
      matching function for call to '__nv_double2float_rd'
  return __nv_double2float_rd(__a);
         ^~~~~~~~~~~~~~~~~~~~
/home/sehrig/projects/clang-ykt/install/lib/clang/9.0.0/include/__clang_cuda_libdevice_declares.h:85:18: note: 
      candidate function not viable: call to __host__ function from __device__ function
__DEVICE__ float __nv_double2float_rd(double __a);
                 ^
fatal error: too many errors emitted, stopping now [-ferror-limit=]
20 errors generated when compiling for host.
makefile:2: recipe for target 'id_vector' failed
make: *** [id_vector] Error 1

@tdd11235813
Copy link
Contributor Author

so you did not get the lgammaf error or did you fix the clang cuda header before as well?

@SimeonEhrig
Copy link
Member

No, I didn't fix the header before. I built the clang/llvm from the original Github repo and compiled the source code with the Makefile you see in the previous comment.

What is special about clang-ykt? I know the project as fork of cling, which implemented the OpenMP 4.0 GPU offloading for CUDA. But this functionality was merged in the upstream. So, what is the special purpose now?

@tdd11235813
Copy link
Contributor Author

I can't remember, and if upstream is working, then upstream is fine. I'll test the whole thing with upstream's version again.

@tdd11235813
Copy link
Contributor Author

current WIP:
I tried the llvm upstream and there were only some linker errors left, but could be fixed in Alpaka:

diff --git a/alpakaConfig.cmake b/alpakaConfig.cmake
index 9bd4732..0d2d0dc 100644
--- a/alpakaConfig.cmake
+++ b/alpakaConfig.cmake
@@ -864,6 +864,12 @@ ELSE()
     ELSEIF(CMAKE_CXX_COMPILER_ID MATCHES "Clang")
         LIST(APPEND _ALPAKA_COMPILE_OPTIONS_PUBLIC "-ftemplate-depth=512")
     ENDIF()
+
+    # clang 9.0 fails at linker level due to missing libs
+    IF(CMAKE_CXX_COMPILER_ID MATCHES "Clang" AND NOT CMAKE_CXX_COMPILER_VERSION VERSION_LESS 9.0)
+        LIST(APPEND _ALPAKA_LINK_LIBRARIES_PUBLIC "-lstdc++;m")
+    ENDIF()
+
 ENDIF()
 
 #-------------------------------------------------------------------------------

However, runtime GPU tests fail now.

/home/mwerner/cuda-workspace/alpaka/test/integ/axpy/src/axpy.cpp:221: FAILED:
  {Unknown expression after the reported line}
due to unexpected exception with message:
  /home/mwerner/cuda-workspace/alpaka/include/alpaka/queue/QueueCudaRtAsync.hpp
  (340) 'cudaStreamSynchronize_ptsz( queue.m_spQueueImpl->m_CudaQueue)' A
  previous CUDA call (not this one) set the error  :
  'cudaErrorLaunchOutOfResources': 'too many resources requested for launch'!

===============================================================================
test cases: 1 | 1 failed
assertions: 3 | 2 passed | 1 failed

Tested with CUDA 10.0.130 and CUDA 9.2.88 (requires Werror to be disabled due to philox/curand conversion warnings) (with and without CPU OpenMP backends).

PS:
updated compile script for llvm:

spack load [email protected]

SOURCE_ROOT=$HOME/sources
LLVM_INSTALL_DIR=$HOME/software/llvm-dev-gcc5/

cd $SOURCE_ROOT
git clone https://github.com/llvm/llvm-project.git llvm

cd llvm
mkdir -p build
cd build

cmake -DLLVM_ENABLE_PROJECTS=clang -G "Unix Makefiles" -DLLVM_BUILD_TESTS=OFF -DLLVM_PARALLEL_COMPILE_JOBS=12 -DLLVM_PARALLEL_LINK_JOBS=12 -DLLVM_BUILD_UTILS=OFF -DCMAKE_INSTALL_PREFIX=$LLVM_INSTALL_DIR -DCMAKE_BUILD_TYPE=RelWithDebInfo ../llvm

make -j 12
make install

@tdd11235813
Copy link
Contributor Author

current WIP (clang-cuda with upstream clang):

  • The runtime error with "too many resources" only happens in undefined CMAKE_BUILD_TYPE mode for CUDA 9.0.176 - 10.0.130 (clang produces invalid PTX code?)
  • When clang(cuda) is run in Debug mode, then ptxas crashes at compiling for CUDA <9.2.
CMAKE_BUILD_TYPE clang-9 nvcc
None ✔️
too many resources
Debug ✔️
ptxas error for cuda <9.2
Release ✔️ ✔️

Resource Usage

Resource usage of axpy Alpaka test.

clang-9, Undefined CMAKE_BUILD_TYPE

Resource usage printed with: cuobjdump -res-usage test/integ/axpy/axpy

Not sure, which one of the following is "too many". Stack can be much higher (tested), regs can be higher as well (not tested, just from theory).

 Fatbin elf code:
 ================
 arch = sm_30
 code version = [1,7]
 producer = cuda
 host = linux
 compile_size = 64bit
 
 Resource usage:
  Common:
   GLOBAL:324
  Function _ZN6alpaka6kernel4cuda6detail10cudaKernelISt17integral_constantImLm1EEm10AxpyKernelJmfPfS7_EEEvNS_3vec3VecIT_T0_EET1_DpT2_:
   REG:45 STACK:488 SHARED:0 LOCAL:0 CONSTANT[0]:368 TEXTURE:0 SURFACE:0 SAMPLER:0
 
 Fatbin ptx code:
 ================
 arch = sm_30
 code version = [6,0]
 producer = cuda
 host = linux
 compile_size = 64bit
 compressed

Note:

Note that value for REG, TEXTURE, SURFACE and SAMPLER denotes the count and for other resources it denotes no. of byte(s) used. 

clang-9, Debug

CUDA<9.2: ptxas => double free or corruption error when compiling axpy

Fatbin elf code:
 ================
 arch = sm_30
 code version = [1,7]
 producer = cuda
 host = linux
 compile_size = 64bit
 has debug info
 compressed

# ... lots of device functions ...

 Fatbin ptx code:
 ================
 arch = sm_30
 code version = [6,1]
 producer = cuda
 host = linux
 compile_size = 64bit
 has debug info
 compressed

clang-9, Release

 Fatbin elf code:
 ================
 arch = sm_30
 code version = [1,7]
 producer = cuda
 host = linux
 compile_size = 64bit
 
 Resource usage:
  Common:
   GLOBAL:0
  Function _ZN6alpaka6kernel4cuda6detail10cudaKernelISt17integral_constantImLm1EEm10AxpyKernelJmfPfS7_EEEvNS_3vec3VecIT_T0_EET1_DpT2_:
   REG:16 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:368 TEXTURE:0 SURFACE:0 SAMPLER:0
 
 Fatbin ptx code:
 ================
 arch = sm_30
 code version = [6,0]
 producer = cuda
 host = linux
 compile_size = 64bit
 compressed

cuda9.0.176, Debug

Fatbin ptx code:
 ================
 arch = sm_30
 code version = [6,0]
 producer = cuda
 host = linux
 compile_size = 64bit
 has debug info
 compressed
 identifier = /home/mwerner/cuda-workspace/alpaka/test/integ/axpy/src//axpy.cpp
 ptxasOptions =  -g --dont-merge-basicblocks --return-at-end 
 
 Fatbin elf code:
 ================
 arch = sm_30
 code version = [1,7]
 producer = cuda
 host = linux
 compile_size = 64bit
 has debug info
 compressed
 identifier = /home/mwerner/cuda-workspace/alpaka/test/integ/axpy/src//axpy.cpp

## ... lots of small device functions ...

cuda9.0.176, Release or Undefined CMAKE_BUILD_TYPE

Fatbin ptx code:
 ================
 arch = sm_30
 code version = [6,0]
 producer = cuda
 host = linux
 compile_size = 64bit
 compressed
 
 Fatbin elf code:
 ================
 arch = sm_30
 code version = [1,7]
 producer = cuda
 host = linux
 compile_size = 64bit

  Resource usage:
  Common:
   GLOBAL:77696 CONSTANT[3]:72
  Function _ZN6alpaka6kernel4cuda6detail10cudaKernelISt17integral_constantImLm1EEm10AxpyKernelJmfPfS7_EEEvNS_3vec3VecIT_T0_EET1_DpT2_:
   REG:16 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:368 TEXTURE:0 SURFACE:0 SAMPLER:0

Minimal working example for clang?

Well, my various minimal working examples are really working, i.e., I cannot provoke the "too many resources" issue. However, I could not get the register usage higher than 32 with clang-cuda, although higher numbers would be ok from CUDA perspective, and so are the other numbers. And the number of instructions are way too low to exceed any limit.

Clang-CUDA + OpenMP?

Did a new build of clang + openmp. The error still persists.
When no clang-cuda backend is used, then ok, but when clang-cuda is chosen together with OMP2 backend, clang throws again OpenMP errors indicating clang has disabled it on the fly. OMP4 cannot be selected.
(I am going to reply upstream with current results.)

@SimeonEhrig
Copy link
Member

The bug still exists with clang++ 11.0.0-++20200928083541+eb83b551d3e-1~exp1~20200928184208.110 alias clang 11 RC4.

@j-stephan
Copy link
Member

@SimeonEhrig Do you know whether this is still a problem?

@SimeonEhrig
Copy link
Member

@SimeonEhrig Do you know whether this is still a problem?

My example is compiling and running with clang 11.1.0 and CUDA 10.1 (latest official supported version). Therefore, I think we can test alpaka.

@SimeonEhrig
Copy link
Member

PR #2099 allows to create the test cases. But it needs some extra work to enable the combination, such like removing this lines of code in the CMakeLists.txt:

message(FATAL_ERROR "Clang as a CUDA compiler does not support OpenMP 2!")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants