You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Look at the above Sigmoid operation. It consists of 4 operations.
Without the ability to fuse operations together, this would naïvely dispatch 4 GPU kernels.
This is a crime. Why?
If operators are not fused together (i.e have 4 kernels merged into 1 kernel) then they can only communicate via global GPU memory. It takes roughly 300-800 clock cycles to read from global memory, which is orders of magnitude longer than it takes to actually do all 4 computations.
Therefore, we can conclusively see we need some way of fusing these kernels together.
Why a JIT and not AOT?
Let's do it at compile time I hear you cry. It will be faster!
Currently, the rust type system (or any type system in a production language) is not capable of encoding all of the information required to do this at compile time in a way that won't make users run screaming. Particularly because we do not have all shape information at compile time (given the nature of autoregressive models e.g seq_len).
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
JIT
Do we need a JIT? Why would you need one? Isn't compile time all the rage? Let's find out.
Why do you need a JIT?
Look at the above
Sigmoid
operation. It consists of 4 operations.Without the ability to fuse operations together, this would naïvely dispatch 4 GPU kernels.
This is a crime. Why?
If operators are not fused together (i.e have 4 kernels merged into 1 kernel) then they can only communicate via global GPU memory. It takes roughly 300-800 clock cycles to read from global memory, which is orders of magnitude longer than it takes to actually do all 4 computations.
Therefore, we can conclusively see we need some way of fusing these kernels together.
Why a JIT and not AOT?
Let's do it at compile time I hear you cry. It will be faster!
Currently, the rust type system (or any type system in a production language) is not capable of encoding all of the information required to do this at compile time in a way that won't make users run screaming. Particularly because we do not have all shape information at compile time (given the nature of autoregressive models e.g
seq_len
).QED: we need a JIT.
Beta Was this translation helpful? Give feedback.
All reactions