[BUG] Memory corruption/undefined behavior on GemmUniversal in 3.4.0 - 3.6.0 🐛 #2017

warpuv · 2024-12-28T17:47:54Z

Description of the bug:

Affected versions are 3.4.0 and 3.6.0 and in between.

When using example cutlass/examples/36_gather_scatter_fusion/gather_scatter_fusion.cu, and linking with some other code (attached to this report) I've got the error:

/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:181  cutlass::HostTensor::reserve(count=253952, device_backed_=true)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:189  cutlass::HostTensor::reserve: host_.resize(253952)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:197  cutlass::HostTensor::reserve: device_memory::allocate(253952)
/workspace/src/cutlass/tools/util/include/cutlass/util/device_memory.h:77  cutlass::device_memory::allocate: Successful cudaMalloc: bytes=507904
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:181  cutlass::HostTensor::reserve(count=1048576, device_backed_=true)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:189  cutlass::HostTensor::reserve: host_.resize(1048576)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:197  cutlass::HostTensor::reserve: device_memory::allocate(1048576)
/workspace/src/cutlass/tools/util/include/cutlass/util/device_memory.h:77  cutlass::device_memory::allocate: Successful cudaMalloc: bytes=2097152
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:181  cutlass::HostTensor::reserve(count=253952, device_backed_=true)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:189  cutlass::HostTensor::reserve: host_.resize(253952)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:197  cutlass::HostTensor::reserve: device_memory::allocate(253952)
/workspace/src/cutlass/tools/util/include/cutlass/util/device_memory.h:77  cutlass::device_memory::allocate: Successful cudaMalloc: bytes=1015808
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:181  cutlass::HostTensor::reserve(count=253952, device_backed_=true)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:189  cutlass::HostTensor::reserve: host_.resize(253952)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:197  cutlass::HostTensor::reserve: device_memory::allocate(253952)
/workspace/src/cutlass/tools/util/include/cutlass/util/device_memory.h:77  cutlass::device_memory::allocate: Successful cudaMalloc: bytes=1015808
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:181  cutlass::HostTensor::reserve(count=240, device_backed_=true)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:189  cutlass::HostTensor::reserve: host_.resize(240)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:197  cutlass::HostTensor::reserve: device_memory::allocate(240)
/workspace/src/cutlass/tools/util/include/cutlass/util/device_memory.h:77  cutlass::device_memory::allocate: Successful cudaMalloc: bytes=960
/workspace/src/cutlass/include/cutlass/gemm/kernel/params_universal_base.h:95  GemmUniversal::Arguments::Arguments() - problem_size: cutlass::gemm::GemmCoord {248,240,1024}
/workspace/src/cutlass/include/cutlass/gemm/kernel/gemm_universal.h:191  GemmUniversal::Arguments::Arguments() - problem_size: cutlass::gemm::GemmCoord {248,240,1024}
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:310  GemmUniversalBase::get_workspace_size()
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:132  GemmUniversalBase::init_device_props()
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:190    device_ordinal: (0), device_sms: (108), sm_occupancy: (2) smem_size: (81920) GemmKernel::kThreadCount: (128)
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:321    workspace_bytes: 0
/workspace/src/cutlass/tools/util/include/cutlass/util/device_memory.h:77  cutlass::device_memory::allocate: Successful cudaMalloc: bytes=0
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:271  GemmUniversalBase::can_implement()
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:329  GemmUniversalBase::get_grid_shape()
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:132  GemmUniversalBase::init_device_props()
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:342    tiled_shape: cutlass::gemm::GemmCoord {2,2,1}
  grid_dims: {2, 2, 1}
/workspace/src/cutlass/include/cutlass/gemm/kernel/gemm_universal.h:368  GemmUniversal::can_implement()
/workspace/src/cutlass/include/cutlass/gemm/kernel/gemm_universal.h:438    returning kSuccess
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:406  GemmUniversalBase::initialize() - workspace 0, stream: null
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:132  GemmUniversalBase::init_device_props()
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:181  cutlass::HostTensor::reserve(count=253952, device_backed_=true)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:189  cutlass::HostTensor::reserve: host_.resize(253952)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:197  cutlass::HostTensor::reserve: device_memory::allocate(253952)
/workspace/src/cutlass/tools/util/include/cutlass/util/device_memory.h:77  cutlass::device_memory::allocate: Successful cudaMalloc: bytes=1015808
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:434  GemmUniversalBase::run()
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:444    grid: (2, 2, 1), block: (128, 1, 1), SMEM: (81920)
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:466    grid launch failed with error invalid argument
Got cutlass error: Error Internal at: 387

Steps to reproduce:

I've taken the official example cutlass/examples/36_gather_scatter_fusion/gather_scatter_fusion.cu, this file remains unchanged.
Just use attached gather_scatter_fusion1.cu or:
2.1) Made a copy of the example into gather_scatter_fusion1.cu.
2.2) Remove the main function and rename run function to run2 in gather_scatter_fusion1.cu.
2.3) Remove some code from the ending of the function run2 until status = gemm_op(); statement, not keeping it.
Compile both .cu files into one executable & run.

IMPORTANT: gather_scatter_fusion1.cu MUST be the first in nvcc command to reproduce the error, otherwise everything works fine!

nvcc -O0 -arch=native -ccbin=clang --expt-relaxed-constexpr -I./cutlass/include -I./cutlass/tools/util/include gather_scatter_fusion1.cu gather_scatter_fusion.cu -o out
./out

Output result:

Got cutlass error: Error Internal at: 387

But expected:

Passed!
Runtime: 0.0388416 ms
 GFLOPs: 3138.31

Additional notes:

IMPORTANT: Only in case when the template parameters of cutlass::gemm::device::GemmUniversal are exactly the same in both .cu files the error is occurring (that means the sass code exactly the same in both .o files).
The error occurs on clang with -O2 or -O0 flags, and on gcc with -O0. (gcc with -O2 runs as expected at least with this version of code)
None of the functions are called from gather_scatter_fusion1.cu during the test, existence of the function “run2” is enough to break the program.
The last CUDA API call is the cudaLaunchKernel, in debugger the arguments to it looks reasonable.
cudaGetLastError() returns cudaErrorInvalidValue
I've found the problematic commit using git bisect, it is 8236f30 (this is release of 3.4.0 version),
Since the source code of the individual commits of this huge PR is not available I cannot investigate the error further.

Environment:

GPU: A100
nvidia-smi: 470.161.03 CUDA Version: 11.4

gcc version: 13.3.0 (Ubuntu 13.3.0-6ubuntu2~24.04)
clang version: 18.1.3 (1ubuntu1)

Docker container:
nvcr.io/nvidia/cuda:12.6.3-devel-ubuntu24.04

Also reproduced on:
nvcr.io/nvidia/cuda:12.4.1-devel-ubuntu22.04 with corresponding default versions of tools/compilers.

gather_scatter_fusion1.cu.txt
gather_scatter_fusion.cu.txt

cc: @IonThruster

The text was updated successfully, but these errors were encountered:

thakkarV · 2024-12-29T01:44:20Z

Looks like you're rolling your own build system and command line flags. We don't support that. Are you able to repot using our build flags and cmake. Also please specify your CUDA toolkit version

warpuv · 2024-12-29T12:57:33Z

Looks like you're rolling your own build system and command line flags. We don't support that. Are you able to repot using our build flags and cmake. Also please specify your CUDA toolkit version

Hello @thakkarV , same issue when using CMake to build.

1. Copied file gather_scatter_fusion1.cu to the cutlass/examples/36_gather_scatter_fusion directory
2. Added gather_scatter_fusion1.cu line to the cutlass/examples/36_gather_scatter_fusion/CMakeLists.txt file like this:

cutlass_example_add_executable(
  36_gather_scatter_fusion
  gather_scatter_fusion1.cu
  gather_scatter_fusion.cu
  )

Again gather_scatter_fusion1.cu, must be the first in the list.

3. Run the commands:

cd cutlass
mkdir build && cd build
cmake .. -DCUTLASS_NVCC_ARCHS=80 -DCMAKE_BUILD_TYPE=Debug
cmake --build . --target test_examples_36_gather_scatter_fusion

4. Got the error:

Building CUDA object examples/36_gather_scatter_fusion/CMakeFiles/36_gather_scatter_fusion.dir/gather_scatter_fusion1.cu.o
Building CUDA object examples/36_gather_scatter_fusion/CMakeFiles/36_gather_scatter_fusion.dir/gather_scatter_fusion.cu.o
Linking CUDA executable 36_gather_scatter_fusion
Built target 36_gather_scatter_fusion
Got cutlass error: Error Internal at: 387
gmake[3]: *** [examples/36_gather_scatter_fusion/CMakeFiles/test_examples_36_gather_scatter_fusion.dir/build.make:70: examples/36_gather_scatter_fusion/CMakeFiles/test_examples_36_gather_scatter_fusion] Error 1
gmake[2]: *** [CMakeFiles/Makefile2:29530: examples/36_gather_scatter_fusion/CMakeFiles/test_examples_36_gather_scatter_fusion.dir/all] Error 2
gmake[1]: *** [CMakeFiles/Makefile2:29537: examples/36_gather_scatter_fusion/CMakeFiles/test_examples_36_gather_scatter_fusion.dir/rule] Error 2
gmake: *** [Makefile:12082: test_examples_36_gather_scatter_fusion] Error 2

nvcc --version output:

Copyright (c) 2005-2024 NVIDIA Corporation
Built on Tue_Oct_29_23:50:19_PDT_2024
Cuda compilation tools, release 12.6, V12.6.85
Build cuda_12.6.r12.6/compiler.35059454_0

thakkarV · 2024-12-29T13:07:49Z

Does it work correctly if you use a release build instead of debug?

warpuv · 2024-12-29T13:09:25Z

Does it work correctly if you use a release build instead of debug?

Yes, it works correctly when using GCC host compiler (Release configuration)

warpuv added ? - Needs Triage bug Something isn't working labels Dec 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Memory corruption/undefined behavior on GemmUniversal in 3.4.0 - 3.6.0 🐛 #2017

[BUG] Memory corruption/undefined behavior on GemmUniversal in 3.4.0 - 3.6.0 🐛 #2017

warpuv commented Dec 28, 2024 •

edited

Loading

thakkarV commented Dec 29, 2024 •

edited

Loading

warpuv commented Dec 29, 2024

thakkarV commented Dec 29, 2024

warpuv commented Dec 29, 2024 •

edited

Loading

[BUG] Memory corruption/undefined behavior on GemmUniversal in 3.4.0 - 3.6.0 🐛 #2017

[BUG] Memory corruption/undefined behavior on GemmUniversal in 3.4.0 - 3.6.0 🐛 #2017

Comments

warpuv commented Dec 28, 2024 • edited Loading

Description of the bug:

Steps to reproduce:

Additional notes:

Environment:

thakkarV commented Dec 29, 2024 • edited Loading

warpuv commented Dec 29, 2024

thakkarV commented Dec 29, 2024

warpuv commented Dec 29, 2024 • edited Loading

warpuv commented Dec 28, 2024 •

edited

Loading

thakkarV commented Dec 29, 2024 •

edited

Loading

warpuv commented Dec 29, 2024 •

edited

Loading