You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are using kernel specific max work group size to avoid platform compatibility issue. The routine is,
auto kid = ::sycl::get_kernel_id<KernelClass>();
auto kbundle = ::sycl::get_kernel_bundle<::sycl::bundle_state::executable>(
ctx, {dev}, {kid});
sycl::kernel k = kbundle.get_kernel(kid);
int max_work_group_size = k.get_info<::sycl::info::kernel_device_specific::work_group_size>(dev);
sycl::get_kernel_bundles gets severe host overhead. The data is as below,
Impacts: All kernels in torch-xpu-ops launched with kernel specific max work group are impacted.
40us overhead is not acceptable for some single batch inference cases, since latency of kernels might be less than 10us.
CUDA runtime usually spends ~6us for a kernel launch.
🐛 Describe the bug
We are using kernel specific max work group size to avoid platform compatibility issue. The routine is,
sycl::get_kernel_bundles gets severe host overhead. The data is as below,
Impacts: All kernels in torch-xpu-ops launched with kernel specific max work group are impacted.
intel/llvm#15824
Versions
The text was updated successfully, but these errors were encountered: