-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Taichi slower than CUDA/OpenCL #8526
Comments
Can you show us the kernels you used for cuda_matvec / opencl_matvec ? |
You can find the code for the other implementations by clicking on the word "here" in my post. For your convenience, here is the link again: https://github.com/99991/matvec-gpu/tree/main?tab=readme-ov-file#results
I tried multiple block dims and found that a block dim of 8 was among the fastest. Here is a graph showing block dim vs min execution time over 10000 runs, but note that there is some jitter in there, so it probably does not matter as long as the block dim is not larger than 128.
This makes the code slower, because there is is a race condition when summing up the result, which has to be resolved with atomics. for i, j in A:
b[i] += A[i, j] * x[j]
# ^^^^^^^
# race condition when multiple threads want to add to same b[i] But this is not important, because I am not interested in a faster matvec implementation. Instead, I want to know why Taichi is slower than CuPy and OpenCL for the same implementation. This matvec implementation just serves as an example. |
I would argue that forcing a block dim of 8 is never a good idea, and this isn't a typical implementation of mat @ vec. We have tested a more conventional approach before. Taichi has optimization for atomic reduction, I'd recommend you test with that. Also judging from the numbers you posted, I would say the matrix is way too small. The GPU results here might very well be just a CPU overhead test. This might also explain why the tiny block size works well. It's only using 512 threads, which means unless you use very small block size there simply isn't enough threads to cover the entire GPU. And yes we admit Taichi has pretty high CPU overhead. |
Again, the point of my issue is not to implement an efficient matvec implementation. It just serves as an example to demonstrate that Taichi is slower than CUDA.
I increased To make sure that CPU overhead is definitely not the issue, here is a naive matrix multiplication implementation. Naive matrix multiplication with Taichi~800 ms import taichi as ti
import numpy as np
import time
@ti.kernel
def matmul(A: ti.template(), B: ti.template(), C: ti.template()):
_, n = A.shape
ti.loop_config(block_dim=128)
for i, j in C:
s = 0.0
for k in range(n):
s += A[i, k] * B[k, j]
C[i, j] = s
@ti.kernel
def init(x: ti.template()):
for i in ti.grouped(x):
x[i] = ti.random(ti.float32)
def main():
m = 4096
k = 4096
n = 4096
A = ti.field(shape=(m, k), dtype=ti.float32)
B = ti.field(shape=(k, n), dtype=ti.float32)
C = ti.field(shape=(m, n), dtype=ti.float32)
init(A)
init(B)
init(C)
C_expected_np = A.to_numpy() @ B.to_numpy()
min_time = float("inf")
for _ in range(100):
ti.sync()
start_time = time.perf_counter()
matmul(A, B, C)
C_np = C.to_numpy()
ti.sync()
elapsed_time = time.perf_counter() - start_time
print(f"{elapsed_time * 1e3:9.3f} ms")
assert np.allclose(C_expected_np, C_np)
min_time = min(min_time, elapsed_time)
print(f"Min: {min_time * 1e3:9.3f} ms")
if __name__ == "__main__":
ti.init(arch=ti.cuda, kernel_profiler=True)
main()
ti.profiler.print_kernel_profiler_info() Naive matrix multiplication with CuPy~300 ms import cupy as cp
import numpy as np
import time
matmul = cp.RawKernel(
r'''
extern "C" __global__
void matmul(
float *A,
float *B,
float *C,
int m,
int k,
int n
){
int i = blockDim.x * blockIdx.x + threadIdx.x;
int j = blockDim.y * blockIdx.y + threadIdx.y;
if (i >= m || j >= n) return;
float s = 0;
for (int k2 = 0; k2 < k; k2++){
s += A[i * k + k2] * B[k2 * n + j];
}
C[i * n + j] = s;
}
''', "matmul")
def main():
m = 4096
k = 4096
n = 4096
A = cp.random.rand(m, k, dtype=cp.float32)
B = cp.random.rand(k, n, dtype=cp.float32)
C = cp.random.rand(m, n, dtype=cp.float32)
C_expected_np = A.get() @ B.get()
min_time = float("inf")
for _ in range(100):
cp.cuda.Device().synchronize()
start_time = time.perf_counter()
block_size = (1, 128)
grid_size = (ceil_div(m, block_size[0]), ceil_div(n, block_size[1]))
matmul(grid_size, block_size, (A, B, C, m, k, n))
C_np = C.get()
cp.cuda.Device().synchronize()
elapsed_time = time.perf_counter() - start_time
print(f"{elapsed_time * 1e3:9.3f} ms")
assert np.allclose(C_expected_np, C_np)
min_time = min(min_time, elapsed_time)
print(f"Min: {min_time * 1e3:9.3f} ms")
def ceil_div(a, b):
return (a + b - 1) // b
if __name__ == "__main__":
main() As can be seen, Taichi is even slower here. To be fair, the difference could also be due to Taichi not tiling the matrix in a favorable way (which is why I chose the matrix-vector multiplication example in my original issue), but |
@99991, thank you for this thread and your GitHub repo. Using your repo, it turns out the Vulkan and OpenGL backends perform well simply commenting out the I captured the Unix time computing for _ in range(1000):
...
The nvcc -o cuda_matvec -O3 -fmad=false cuda_matvec.cu I added Numba type signatures to the @cuda.jit('void(f4[:,:], f4[:], f4[:], i4, i4)')
def matvec(A, x, b, m, n):
... |
Here is the CPU variant where I tried using external arrays as Taichi kernel arguments. This runs slower on the GPU but no impact running on the CPU versus taichi_cpu_matvec.py import taichi as ti
import numpy as np
import time
@ti.kernel
def matvec(A: ti.types.ndarray(), x: ti.types.ndarray(), b: ti.types.ndarray()):
m, n = A.shape
for i in range(m):
s = 0.0
for j in range(n):
s += A[i, j] * x[j]
b[i] = s
def main():
m = 8192
n = 8192
A = np.random.rand(m, n).astype(np.float32)
x = np.random.rand(n).astype(np.float32)
b = np.zeros(m, dtype=np.float32)
b_expected = A @ x
for _ in range(1000):
start_time = time.perf_counter()
matvec(A, x, b)
ti.sync()
elapsed_time = time.perf_counter() - start_time
print(f"{elapsed_time * 1e6:9.3f} µs")
assert np.allclose(b_expected, b)
if __name__ == "__main__":
ti.init(arch=ti.cpu)
main() |
Describe the bug
I am currently evaluating various frameworks for GPU acceleration for a project of mine and found that Taichi is slower than expected. Due to foreign function call overhead, Taichi is expected to be a little slower than native CUDA, but it should not be three times slower than CuPy with custom kernels.
To Reproduce
Here is a Taichi implementation of matrix-vector multiplication ($A x = b$ ). Am I missing something?
I've also got
matvec
implementations for CUDA, OpenCL, CuPy, CuBLAS, Numba and Taichi with other backends here for comparison.Log/Screenshots
Additional comments
I have tried this with other Taichi versions, CUDA drivers and GPUs. The results were similar.
System Info
The text was updated successfully, but these errors were encountered: