Skip to content

Writing a CUDA software ray tracing renderer from scratch: a renderer with Analysis-Driven Optimization

Notifications You must be signed in to change notification settings

Enigmatisms/cuda-pt

Repository files navigation

CUDA-PT


Software Path Tracing renderer implemented in CUDA, from scratch.

sports-cars

Malorian-Arms-3516

modern-kitchen

dispersion

Compile & Run

The repo contains several external dependencies, therefore, using the following command:

git clone https://github.com/Enigmatisms/cuda-pt.git --recursive

Dependent on GLEW for the interactive viewer (./build/xx/cpt). If GLEW is not installed, only offline application is available (./build/xx/pt). Initially, this code base can be run on Linux (tested on Ubuntu 22.04) but I haven't try that since the day my Ubuntu machine broke down. Currently, using MSVC (VS2022) with CMake:

mkdir build && cd build
cmake --DCMAKE_BUILD_TYPE=release ..
cmake --build . --config Release

(./build/xx/cpt.exe) and (./build/xx/pt.exe) will be the executable files. To run the code, an example is:

cd build/Release
./cpt.exe ../../scene/xml/bunny.xml
More info

This repo currently has no plan for OptiX, since I am experiencing how to build the wheel and make it fast, instead of implementing some useful features. Useful features are incorporated in the experimental path tracer AdaPT. Check my github homepage for more information.

The scalability of this repo might be worse than that of AdaPT, but it will improve over time, since I plan to migrate from Taichi Lang to a pure-CUDA code base. Currently, this repo supports:

  • Toy CUDA depth renderer with profiling
  • Megakernel unidirectional path tracing. Two major ray-scene intersection schemes are employed: shared-memory based AABB culling, and GPU BVH (see below).
  • Wavefront unidirectional path tracing with stream compaction. Currently, WFPT is not as fast as megakernel PT due to the simplicity of the test scenes (and maybe, coalesced GMEM access problems, being working on this).
  • GPU BVH: A stackless GPU surface area heuristic BVH. The current implementation is not optimal (since the ordering of left-child and right child is left unaccounted for, and there is no 'look-back' op), but fast enough. Profiling for this part is not every complete. 1D CUDA texture is used to store the BVH nodes, for better cache performance.
  • CUDA pitched textures for environment maps, normal, roughness, index of refraction and albedo.
  • Online modification of the scene. Check out the video down below.
CUDA-PT.2024-12-22.20-54-16.mp4
TODO
  • (Recent) An imgui based interactive UI.
  • (Around 2025.01, stay tuned) Benchmarking with AdaPT (Taichi lang based renderer) and OptiX (optional). More profiling, and finally, I think I will write several blog posts on "How to implement an efficient software path tracing renderer with CUDA". The blog posts will be more focused on the soft(and hard)-ware related analysis-driven-optimization, so they will actually be posts that summarize (and teach) some best practices for programming the tasks with extremely imbalanced workloads.
Tricks (that will be covered in my incoming blog posts)

I've tried a handful of tricks, unfortunately, due to the limitation of time I haven't document any of these (including statistical profiling and analysis) and I currently only have vague (somewhat) concepts of DOs and DON'Ts. Emmm... I really want to summarize all of them, in November, after landing on a good job. So wish me good luck.

  • Divergence control part I (loop 'pre-converge')
  • Divergence control part II: megakernel or wavefront?
  • Stream compaction for WFPT. Shader Execution Reordering (SER) on Ada Lovelace architecture (NVIDIA 40x GPU) (More in-depth reading on this topic, since NVIDIA said almost nothing important in their SER white-book).
  • Coalesced access: SoA in WFPT and lg-throttle problem for AoS
  • Local memory: dynamic indexing considered harmful
  • Dynamic polymorphism: GPU based variant or device-side inheritance (virtual functions and their pointers) ?
  • Avoiding bank conflicts & Use vectorized load / store
  • IMC (constant cache miss): when should you use constant cache
  • CPU multi-threading and GPU stream-based concurrency (maybe Hyper-Q).
  • (More in-depth reading on this topic) What makes a good GPU based spatially-partitioning data structures (like BVH): well I am no expert in this, should more papers on this topic.

Repography logo / Recent activity Time period

Timeline graph Pull request status graph Trending topics Top contributors

Repography logo / Structure

Structure

Visualizer Notes

  • imgui has no CMakeLists.txt so we should write it ourselves.
  • I think it is painful to use GLEW for windows: after compilation, glew32.dll should be manually copied to Windows/System32. Also, we should build GLEW manually.

Misc

This repo originated from: w3ntao/smallpt-megakernel, but now it is very different from it. I answered his question on stackexchange computer graphics and tweaked his code, so I thought to myself... why not base on this repo and try to make it better (though, I won't call it small-pt, since it definitely won't be small after I heavily optimize the code).

About

Writing a CUDA software ray tracing renderer from scratch: a renderer with Analysis-Driven Optimization

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published