-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[amdgpu] Part3 update runtime module #6486
[amdgpu] Part3 update runtime module #6486
Conversation
✅ Deploy Preview for docsite-preview ready!
To edit notification comments on pull requests, go to your Netlify site settings. |
7db2df8
to
7fd4225
Compare
f48f3f8
to
82b4ccb
Compare
068e095
to
d30e24b
Compare
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, for hardware-specifc code or configurations, we should comment a reference link to a specific chapter/part of AMD-GPU-guide whenever we use some magic strings or magic numbers. This is especially true for setting up function attributes or so.
The previously implemented CUDA backend is filled with magic/hack and lacks enough comments/explanations. Let's improve this situation starting from AMD GPU. Thanks!
d18335c
to
325732a
Compare
823e7a0
to
bd6b58d
Compare
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Issue: taichi-dev#6434 ### Brief Summary 1. This is a special part of the Tacihi runtime module for the `AMDGPU` backend. Tacihi's runtime module uses `clang++` to generate `LLVM IR` is different in memory allocation differs from the cpu-generated `LLVM IR`. The following is an example. ``` C/C++ code void func(int *a, int *b) { *a = *b; } x86_64 backend LLVM IR define dso_local void @cpu_func(i32* %0, i32* %1) taichi-dev#2 { %3 = alloca i32*, align 8 %4 = alloca i32*, align 8 store i32* %0, i32** %3, align 8 store i32* %1, i32** %4, align 8 %5 = load i32*, i32** %4, align 8 %6 = load i32, i32* %5, align 4 %7 = load i32*, i32** %3, align 8 store i32 %6, i32* %7, align 4 ret void } __global__ function on AMDGPU define protected amdgpu_kernel void @global_func(i32 addrspace(1)* %0, i32 addrspace(1)* %1) taichi-dev#4 { %3 = alloca i32*, align 8, addrspace(5) %4 = alloca i32*, align 8, addrspace(5) %5 = alloca i32*, align 8, addrspace(5) %6 = alloca i32*, align 8, addrspace(5) %7 = addrspacecast i32* addrspace(5)* %3 to i32** %8 = addrspacecast i32* addrspace(5)* %4 to i32** %9 = addrspacecast i32* addrspace(5)* %5 to i32** %10 = addrspacecast i32* addrspace(5)* %6 to i32** %11 = addrspacecast i32 addrspace(1)* %0 to i32* store i32* %11, i32** %7, align 8 %12 = load i32*, i32** %7, align 8 %13 = addrspacecast i32 addrspace(1)* %1 to i32* store i32* %13, i32** %8, align 8 %14 = load i32*, i32** %8, align 8 store i32* %12, i32** %9, align 8 store i32* %14, i32** %10, align 8 %15 = load i32*, i32** %10, align 8 %16 = load i32, i32* %15, align 4 %17 = load i32*, i32** %9, align 8 store i32 %16, i32* %17, align 4 ret void } __device__ function on AMDGPU define hidden void @device_func(i32* %0, i32* %1) taichi-dev#2 { %3 = alloca i32*, align 8, addrspace(5) %4 = alloca i32*, align 8, addrspace(5) %5 = addrspacecast i32* addrspace(5)* %3 to i32** %6 = addrspacecast i32* addrspace(5)* %4 to i32** store i32* %0, i32** %5, align 8 store i32* %1, i32** %6, align 8 %7 = load i32*, i32** %6, align 8 %8 = load i32, i32* %7, align 4 %9 = load i32*, i32** %5, align 8 store i32 %8, i32* %9, align 4 ret void } ``` 2. There are some differences in the place about `allocainst`, specifically about addrspace (for `AMDGPU`, [this](https://llvm.org/docs/AMDGPUUsage.html#address-spaces) will be helpful). I have not found documentation describing how to write the correct `LLVM IR` on `AMDGPU`, through my observation of the `LLVM IR` generated by `clang++/hipcc`. We need to deal with the arguments of the `__global__` function and the `allocainst` (including specifying the addrspace of `allocainst` and performing addrspace-cast) while for the `__device__` function we do not need to deal with the arguments of the function. Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Issue: taichi-dev#6434 ### Brief Summary 1. This is a special part of the Tacihi runtime module for the `AMDGPU` backend. Tacihi's runtime module uses `clang++` to generate `LLVM IR` is different in memory allocation differs from the cpu-generated `LLVM IR`. The following is an example. ``` C/C++ code void func(int *a, int *b) { *a = *b; } x86_64 backend LLVM IR define dso_local void @cpu_func(i32* %0, i32* %1) taichi-dev#2 { %3 = alloca i32*, align 8 %4 = alloca i32*, align 8 store i32* %0, i32** %3, align 8 store i32* %1, i32** %4, align 8 %5 = load i32*, i32** %4, align 8 %6 = load i32, i32* %5, align 4 %7 = load i32*, i32** %3, align 8 store i32 %6, i32* %7, align 4 ret void } __global__ function on AMDGPU define protected amdgpu_kernel void @global_func(i32 addrspace(1)* %0, i32 addrspace(1)* %1) taichi-dev#4 { %3 = alloca i32*, align 8, addrspace(5) %4 = alloca i32*, align 8, addrspace(5) %5 = alloca i32*, align 8, addrspace(5) %6 = alloca i32*, align 8, addrspace(5) %7 = addrspacecast i32* addrspace(5)* %3 to i32** %8 = addrspacecast i32* addrspace(5)* %4 to i32** %9 = addrspacecast i32* addrspace(5)* %5 to i32** %10 = addrspacecast i32* addrspace(5)* %6 to i32** %11 = addrspacecast i32 addrspace(1)* %0 to i32* store i32* %11, i32** %7, align 8 %12 = load i32*, i32** %7, align 8 %13 = addrspacecast i32 addrspace(1)* %1 to i32* store i32* %13, i32** %8, align 8 %14 = load i32*, i32** %8, align 8 store i32* %12, i32** %9, align 8 store i32* %14, i32** %10, align 8 %15 = load i32*, i32** %10, align 8 %16 = load i32, i32* %15, align 4 %17 = load i32*, i32** %9, align 8 store i32 %16, i32* %17, align 4 ret void } __device__ function on AMDGPU define hidden void @device_func(i32* %0, i32* %1) taichi-dev#2 { %3 = alloca i32*, align 8, addrspace(5) %4 = alloca i32*, align 8, addrspace(5) %5 = addrspacecast i32* addrspace(5)* %3 to i32** %6 = addrspacecast i32* addrspace(5)* %4 to i32** store i32* %0, i32** %5, align 8 store i32* %1, i32** %6, align 8 %7 = load i32*, i32** %6, align 8 %8 = load i32, i32* %7, align 4 %9 = load i32*, i32** %5, align 8 store i32 %8, i32* %9, align 4 ret void } ``` 2. There are some differences in the place about `allocainst`, specifically about addrspace (for `AMDGPU`, [this](https://llvm.org/docs/AMDGPUUsage.html#address-spaces) will be helpful). I have not found documentation describing how to write the correct `LLVM IR` on `AMDGPU`, through my observation of the `LLVM IR` generated by `clang++/hipcc`. We need to deal with the arguments of the `__global__` function and the `allocainst` (including specifying the addrspace of `allocainst` and performing addrspace-cast) while for the `__device__` function we do not need to deal with the arguments of the function. Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…7023) Issue: #taichi-dev#6434 ### Brief Summary These unit tests are for taichi-dev#6486 Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Issue: #6434
Brief Summary
AMDGPU
backend. Tacihi's runtime module usesclang++
to generateLLVM IR
is different in memory allocation differs from the cpu-generatedLLVM IR
. The following is an example.allocainst
, specifically about addrspace (forAMDGPU
, this will be helpful). I have not found documentation describing how to write the correctLLVM IR
onAMDGPU
, through my observation of theLLVM IR
generated byclang++/hipcc
. We need to deal with the arguments of the__global__
function and theallocainst
(including specifying the addrspace ofallocainst
and performing addrspace-cast) while for the__device__
function we do not need to deal with the arguments of the function.