GroupedGemm: FP8 per-tensor via cuBLAS

Add support for GroupedGemm with FP8 per-tensor quantization using cuBLAS. Ensure that grouped operations are efficiently batched and fully compatible with device-supplied data buffers.