Changes Since Last Release

Major Changes

Added a PyTorch extension for using tiny-cuda-nn from within Python.
- This functionality is considered to be in a "beta" state. Please do report any issues you come across!
- See the this section of the README for installation/usage instructions.
- Caveat: the overheads of Python/PyTorch can be extensive. For example, the bundled mlp_learning_an_image example is ~2x slower through PyTorch than native CUDA. (This is still faster than implementing everything from scratch in Python, but something to be aware of.)
Significantly reduced memory usage (sometimes 3x lower)
- Added a GPU memory arena that permits efficient, stream-ordered allocation and de-allocation of temporary buffers. This circumvents the need for pre-allocation, resulting in often 3x lower memory consumption.
- The memory arena uses the GPU's virtual memory mapper to get its performance without invalidating pointers or shuffling memory around.
All neural networks in tiny-cuda-nn now additionally support row-major input memory layout. This affords higher performance and lower memory usage when transposition was otherwise required.
- GridEncoding naturally outputs row-major data and is thus sped-up by ~20% when followed by a neural network.
tiny-cuda-nn now runs on older GPUs down to compute capability 37.

Sped up the input gradient computation of GridEncoding by ~3x.
Sped up SyncedMultiStream.
Fixed incorrect gradients of SphericalHarmonicsEncoding.
Fixed incorrect gradients of GridEncoding when max_level arguments were provided or Interpolation::Nearest was used.