Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Memory corruption/undefined behavior on GemmUniversal in 3.4.0 - 3.6.0 🐛 #2017

Open
warpuv opened this issue Dec 28, 2024 · 4 comments
Labels
? - Needs Triage bug Something isn't working

Comments

@warpuv
Copy link

warpuv commented Dec 28, 2024

Description of the bug:

Affected versions are 3.4.0 and 3.6.0 and in between.

When using example cutlass/examples/36_gather_scatter_fusion/gather_scatter_fusion.cu, and linking with some other code (attached to this report) I've got the error:

/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:181  cutlass::HostTensor::reserve(count=253952, device_backed_=true)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:189  cutlass::HostTensor::reserve: host_.resize(253952)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:197  cutlass::HostTensor::reserve: device_memory::allocate(253952)
/workspace/src/cutlass/tools/util/include/cutlass/util/device_memory.h:77  cutlass::device_memory::allocate: Successful cudaMalloc: bytes=507904
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:181  cutlass::HostTensor::reserve(count=1048576, device_backed_=true)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:189  cutlass::HostTensor::reserve: host_.resize(1048576)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:197  cutlass::HostTensor::reserve: device_memory::allocate(1048576)
/workspace/src/cutlass/tools/util/include/cutlass/util/device_memory.h:77  cutlass::device_memory::allocate: Successful cudaMalloc: bytes=2097152
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:181  cutlass::HostTensor::reserve(count=253952, device_backed_=true)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:189  cutlass::HostTensor::reserve: host_.resize(253952)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:197  cutlass::HostTensor::reserve: device_memory::allocate(253952)
/workspace/src/cutlass/tools/util/include/cutlass/util/device_memory.h:77  cutlass::device_memory::allocate: Successful cudaMalloc: bytes=1015808
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:181  cutlass::HostTensor::reserve(count=253952, device_backed_=true)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:189  cutlass::HostTensor::reserve: host_.resize(253952)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:197  cutlass::HostTensor::reserve: device_memory::allocate(253952)
/workspace/src/cutlass/tools/util/include/cutlass/util/device_memory.h:77  cutlass::device_memory::allocate: Successful cudaMalloc: bytes=1015808
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:181  cutlass::HostTensor::reserve(count=240, device_backed_=true)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:189  cutlass::HostTensor::reserve: host_.resize(240)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:197  cutlass::HostTensor::reserve: device_memory::allocate(240)
/workspace/src/cutlass/tools/util/include/cutlass/util/device_memory.h:77  cutlass::device_memory::allocate: Successful cudaMalloc: bytes=960
/workspace/src/cutlass/include/cutlass/gemm/kernel/params_universal_base.h:95  GemmUniversal::Arguments::Arguments() - problem_size: cutlass::gemm::GemmCoord {248,240,1024}
/workspace/src/cutlass/include/cutlass/gemm/kernel/gemm_universal.h:191  GemmUniversal::Arguments::Arguments() - problem_size: cutlass::gemm::GemmCoord {248,240,1024}
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:310  GemmUniversalBase::get_workspace_size()
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:132  GemmUniversalBase::init_device_props()
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:190    device_ordinal: (0), device_sms: (108), sm_occupancy: (2) smem_size: (81920) GemmKernel::kThreadCount: (128)
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:321    workspace_bytes: 0
/workspace/src/cutlass/tools/util/include/cutlass/util/device_memory.h:77  cutlass::device_memory::allocate: Successful cudaMalloc: bytes=0
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:271  GemmUniversalBase::can_implement()
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:329  GemmUniversalBase::get_grid_shape()
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:132  GemmUniversalBase::init_device_props()
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:342    tiled_shape: cutlass::gemm::GemmCoord {2,2,1}
  grid_dims: {2, 2, 1}
/workspace/src/cutlass/include/cutlass/gemm/kernel/gemm_universal.h:368  GemmUniversal::can_implement()
/workspace/src/cutlass/include/cutlass/gemm/kernel/gemm_universal.h:438    returning kSuccess
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:406  GemmUniversalBase::initialize() - workspace 0, stream: null
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:132  GemmUniversalBase::init_device_props()
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:181  cutlass::HostTensor::reserve(count=253952, device_backed_=true)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:189  cutlass::HostTensor::reserve: host_.resize(253952)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:197  cutlass::HostTensor::reserve: device_memory::allocate(253952)
/workspace/src/cutlass/tools/util/include/cutlass/util/device_memory.h:77  cutlass::device_memory::allocate: Successful cudaMalloc: bytes=1015808
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:434  GemmUniversalBase::run()
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:444    grid: (2, 2, 1), block: (128, 1, 1), SMEM: (81920)
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:466    grid launch failed with error invalid argument
Got cutlass error: Error Internal at: 387

Steps to reproduce:

  1. I've taken the official example cutlass/examples/36_gather_scatter_fusion/gather_scatter_fusion.cu, this file remains unchanged.
  2. Just use attached gather_scatter_fusion1.cu or:
    2.1) Made a copy of the example into gather_scatter_fusion1.cu.
    2.2) Remove the main function and rename run function to run2 in gather_scatter_fusion1.cu.
    2.3) Remove some code from the ending of the function run2 until status = gemm_op(); statement, not keeping it.
  3. Compile both .cu files into one executable & run.

IMPORTANT: gather_scatter_fusion1.cu MUST be the first in nvcc command to reproduce the error, otherwise everything works fine!

nvcc -O0 -arch=native -ccbin=clang --expt-relaxed-constexpr -I./cutlass/include -I./cutlass/tools/util/include gather_scatter_fusion1.cu gather_scatter_fusion.cu -o out
./out

Output result:

Got cutlass error: Error Internal at: 387

But expected:

Passed!
Runtime: 0.0388416 ms
 GFLOPs: 3138.31

Additional notes:

  1. IMPORTANT: Only in case when the template parameters of cutlass::gemm::device::GemmUniversal are exactly the same in both .cu files the error is occurring (that means the sass code exactly the same in both .o files).
  2. The error occurs on clang with -O2 or -O0 flags, and on gcc with -O0. (gcc with -O2 runs as expected at least with this version of code)
  3. None of the functions are called from gather_scatter_fusion1.cu during the test, existence of the function “run2” is enough to break the program.
  4. The last CUDA API call is the cudaLaunchKernel, in debugger the arguments to it looks reasonable.
  5. cudaGetLastError() returns cudaErrorInvalidValue
  6. I've found the problematic commit using git bisect, it is 8236f30 (this is release of 3.4.0 version),
  7. Since the source code of the individual commits of this huge PR is not available I cannot investigate the error further.

Environment:

GPU: A100
nvidia-smi: 470.161.03 CUDA Version: 11.4

gcc version: 13.3.0 (Ubuntu 13.3.0-6ubuntu2~24.04)
clang version: 18.1.3 (1ubuntu1)

Docker container:
nvcr.io/nvidia/cuda:12.6.3-devel-ubuntu24.04

Also reproduced on:
nvcr.io/nvidia/cuda:12.4.1-devel-ubuntu22.04 with corresponding default versions of tools/compilers.

gather_scatter_fusion1.cu.txt
gather_scatter_fusion.cu.txt

cc: @IonThruster

@warpuv warpuv added ? - Needs Triage bug Something isn't working labels Dec 28, 2024
@thakkarV
Copy link
Collaborator

thakkarV commented Dec 29, 2024

Looks like you're rolling your own build system and command line flags. We don't support that. Are you able to repot using our build flags and cmake. Also please specify your CUDA toolkit version

@warpuv
Copy link
Author

warpuv commented Dec 29, 2024

Looks like you're rolling your own build system and command line flags. We don't support that. Are you able to repot using our build flags and cmake. Also please specify your CUDA toolkit version

Hello @thakkarV , same issue when using CMake to build.

1. Copied file gather_scatter_fusion1.cu to the cutlass/examples/36_gather_scatter_fusion directory
2. Added gather_scatter_fusion1.cu line to the cutlass/examples/36_gather_scatter_fusion/CMakeLists.txt file like this:

cutlass_example_add_executable(
  36_gather_scatter_fusion
  gather_scatter_fusion1.cu
  gather_scatter_fusion.cu
  )

Again gather_scatter_fusion1.cu, must be the first in the list.

3. Run the commands:

cd cutlass
mkdir build && cd build
cmake .. -DCUTLASS_NVCC_ARCHS=80 -DCMAKE_BUILD_TYPE=Debug
cmake --build . --target test_examples_36_gather_scatter_fusion

4. Got the error:

Building CUDA object examples/36_gather_scatter_fusion/CMakeFiles/36_gather_scatter_fusion.dir/gather_scatter_fusion1.cu.o
Building CUDA object examples/36_gather_scatter_fusion/CMakeFiles/36_gather_scatter_fusion.dir/gather_scatter_fusion.cu.o
Linking CUDA executable 36_gather_scatter_fusion
Built target 36_gather_scatter_fusion
Got cutlass error: Error Internal at: 387
gmake[3]: *** [examples/36_gather_scatter_fusion/CMakeFiles/test_examples_36_gather_scatter_fusion.dir/build.make:70: examples/36_gather_scatter_fusion/CMakeFiles/test_examples_36_gather_scatter_fusion] Error 1
gmake[2]: *** [CMakeFiles/Makefile2:29530: examples/36_gather_scatter_fusion/CMakeFiles/test_examples_36_gather_scatter_fusion.dir/all] Error 2
gmake[1]: *** [CMakeFiles/Makefile2:29537: examples/36_gather_scatter_fusion/CMakeFiles/test_examples_36_gather_scatter_fusion.dir/rule] Error 2
gmake: *** [Makefile:12082: test_examples_36_gather_scatter_fusion] Error 2

nvcc --version output:

Copyright (c) 2005-2024 NVIDIA Corporation
Built on Tue_Oct_29_23:50:19_PDT_2024
Cuda compilation tools, release 12.6, V12.6.85
Build cuda_12.6.r12.6/compiler.35059454_0

@thakkarV
Copy link
Collaborator

Does it work correctly if you use a release build instead of debug?

@warpuv
Copy link
Author

warpuv commented Dec 29, 2024

Does it work correctly if you use a release build instead of debug?

Yes, it works correctly when using GCC host compiler (Release configuration)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants