GitHub - vhxs/nvidia-gpus: GPU architectures are complex. This is an attempt to demystify them.

Links

CUDA is Nvidia's proprietary API to execute code on their GPUs.
kernels are CUDA programs that run on Nvidia GPUs.
CPUs are hosts, GPUs are devices.
Blocks contain threads. Threads in a block can have 1d, 2d, or 3d indexes.
Grids contain blocks. Blocks in a grid can have 1d, 2d, or 3d indexes.
Streaming multiprocessors (SM) are made up of cores.
Streaming multiprocessor execute blocks. An entire block must be run on a single SM.
Warps always have 32 threads. A block is executed as several warps.
A warp consists of lanes (term doesn't seem to be used a lot though).
The GPU is responsible for allocating thread blocks to SMs.
Nvidia has changed the definition of "core" over time for marketing purposes.

Programmer writes CUDA kernel. As part of writing kernel, they specify grid and block dimensions
- How many blocks per grid? How many threads per block?
The GPU will take these blocks, and allocate them across streaming multiprocessors
- One block gets mapped to a single SM. No spreading threads in a block across SMs.
- To execute a block, an SM will further divide threads in a block into warps. Warps are currently all of size 32 across all GPUs.
- One thread is scheduled to run on a single core. So a warp requires 32 cores to run. Number of cores in an SM is a multiple of 32.
- Depending on number of blocks, there could be several blocks allocated to a single SM. They are executed in sequence.
- SM can context switch between active warps, say if a warp is waiting on a memory access to complete. It may schedule another warp that is ready to run.
- Warp scheduling: https://www.cc.gatech.edu/fac/hyesoon/gputhread.pdf
How to choose grid and block dimensions https://stackoverflow.com/a/9986748
- There are people writing PhD theses around the quantitative analysis of aspects of the problem
float computation is much faster on GPUs than double computation. For some reason.

I have an Nvidia GeForce GTX 1660 Ti.
I learned this by running nvidia-smi --query-gpu=name --format=csv.
Article on this particular GPU: https://www.nvidia.com/en-us/geforce/news/geforce-gtx-1660-ti-advanced-shaders-streaming-multiprocessor/
- Turing architecture.
- This GPU has 24 streaming multiprocessors and 1536 cores total (so 64 cores per SM).

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
README.md		README.md