-
Notifications
You must be signed in to change notification settings - Fork 53
nvFuser Profiler Instructions
A Cuda Device Synchronize is executed prior to each measured fusion in order to expose the Host Time to measurement. The first line of each profile includes the aggregate fusion information and 1st segment's information. Successive segments have their aggregate fusion information omitted.
- Aggregate Fusion information
-
Fus#
(nocupti) - Fusion Number -
NSegs
(nocupti) - Number of Segments -
CuEvtTime(ms)
(nocupti) - The Cuda Event Time that wraps the entire fusion process including kernel execution. -
CmpTm(ms)
(nocupti) - Is the Parallel Compilation time or the aggregate of serialized segment compilation depending on which option is used. Compilation time includes Kernel Lowering and NVRTC compilation. -
KerTime(ms)
- The aggregated Kernel Time of all the segments executed. -
EffBw(GB/S)
- The Effective Bandwidth of the inputs/outputs to the fusions which does not include the I/O specific to each segment. Therefore, segmentation should penalize the aggregate Effective Bandwidth. -
%PeakBw
- The Percentage of the Peak Bandwidth of the Fusion with respect to the previous Effective Bandwidth. This is also the%SOL
of the Fusion. -
In(MB)
( verbose) - The size of the fusion inputs in MBs. -
Out(MB)
( verbose) - The size of the fusion outputs in MBs.
-
- Segment Information
-
S-Seg#
= The segment index found in the line -
S-KerName
- The Cuda Kernel's short name -
S-KerTm(ms)
- The CUPTI profiled Kernel Time for the given segment -
S-CmpTm(ms)
( verbose) - The Compile Time for the given segment if kernels are compiled without Parallel Compilation. -
S-EffBw(GB/s)
- The effective bandwidth of the segment given the specific I/O of this segment's kernel. -
S-%PeakBw
- The percent of the Peak Bandwidth of this kernel. The kernel's%SOL
assuming the kernel is bandwidth limited. -
S-In(MB)
- The size of the kernel inputs in MBs. -
S-Out(MB)
- The size of the kernel outputs in MBs. -
S-Smem[Dyn,Stat]
- The Dynamic and Static Shared Memory used by the kernel in bytes -
S-Regs
- The registers-per-thread used by the kernel -
S-Grid
- The grid dimensions of the kernel -
S-Block
- The block dimensions of the kernel -
S-Cluster
( verbose) - The cluster dimensions of the kernel -
S-Dev
( verbose) - The device id of the kernel execution. -
S-Stm
(verbose) - The stream id of the kernel execution. -
S-DeviceName
( verbose) - The market name of the device used for kernel execution. -
S-PeakBW(GB/s)
( verbose) - The Peak Bandwidth of the device used for execution.
-
This profiler is adding device synchronizations prior to every fusion. Therefore, this profiling is not suitable for end-to-end profiling of a network where you should expect nvFuser's host work to overlap with previous device work. This profiler is suitable for measuring the host and device activity of an individual fusion and its kernel.
For End-to-End measure these Nsight Systems commands are appropriate:
- Print trace to console:
nsys nvprof --print-gpu-trace <Your Program>
- Create an Nsys Report:
nsys profile -w true -t cublas,cuda,nvtx,osrt -o <Report Name> <Your Program>
All environment variables minimally enable the Fusion Profiler that populates the FusionProfile
object.
NVFUSER_PROF
values:
-
enable
: Enables full profiling and records the activity in theFusionProfile
object only. -
enable.nocupti
: Only enables host timer profiling of fusion level activity and records the activity in theFusionProfile
object only. The "nocupti" object allows the user to see the host timing without CUPTI overhead that can be as significant as10us
to30us
. -
print
: Enables full profiling and records the activity in theFusionProfile
object and prints it to the console. -
print.verbose
: Enables full profiling and records the activity in theFusionProfile
object and prints it to the console. Additional information is printed in extra columns that may cause line wrapping on smaller terminal windows. The (verbose) level indicates the set of additional columns printed, above. -
print.nocupti
: Only enables host timer profiling of fusion level activity and records the activity in theFusionProfile
object and print it to the console with an abbreviated set of columns for the limited information. The (nocupti) level indicates the set of columns printed, above.
You can collect statistics from profiling a specific fusion by querying the output object:
FusionProfile fprof = FusionProfiler::get()->profile();
struct FusionProfile {
int64_t fusion_id{-1};
int64_t segments{0};
double cuda_evt_time_ms{0.0};
double host_time_ms{0.0};
double compile_time_ms{0.0};
double kernel_time_ms{0.0};
int64_t input_bytes{0};
int64_t output_bytes{0};
double effective_bandwidth_gbs{0.0};
double percentage_peak_bandwidth{0.0};
std::vector<KernelProfile> kernel_profiles{};
};
struct KernelProfile {
std::string name{};
int64_t segment_id;
int device{-1};
uint32_t stream{0};
uint32_t correlation_id{0};
double compile_time_ms{0.0};
double time_ms{0.0};
double effective_bandwidth_gbs{0.0};
double percentage_peak_bandwidth{0.0};
std::array<int32_t, 3> grid{0, 0, 0};
std::array<int32_t, 3> block{0, 0, 0};
std::array<uint32_t, 3> cluster{0, 0, 0};
int32_t dynamic_shared_mem{0};
int32_t static_shared_mem{0};
uint32_t registers{0};
int64_t input_bytes{0};
int64_t output_bytes{0};
std::string device_name{};
double peak_bandwidth_gbs{0.0};
};
Fus# NSegs CuEvtTm(ms) HstTm(ms) CmpTm(ms) KerTm(ms) EffBw(GB/s) %PeakBw Seg# S-KerName S-KerTm(ms) S-EffBw(GB/s) S-%PeakBw S-In(MB) S-Out(MB) S-Smem[Dyn,Stat] S-Regs S-Grid S-Block
0 1 328.582 328.529 277.940 0.059 1765.88 91.24 0 kernel1 0.059 1765.88 91.24 52.442 52.494 [512, 16] 41 [8192, 1, 1] [128, 1, 1]
1 1 95.395 95.345 93.025 0.057 1845.05 95.33 0 kernel2 0.057 1845.05 95.33 52.429 52.429 [0, 0] 16 [25600, 1, 1] [128, 1, 1]
2 1 1076.190 1076.054 1012.063 0.141 1583.82 81.84 0 kernel3 0.141 1583.82 81.84 104.870 118.030 [512, 16] 52 [8192, 1, 1] [128, 1, 1]
3 1 139.589 139.350 129.783 0.246 1707.12 88.21 0 kernel4 0.246 1707.12 88.21 209.715 209.715 [0, 0] 16 [102400, 1, 1] [128, 1, 1]
4 1 283.830 283.744 276.638 0.092 1855.98 95.90 0 kernel5 0.092 1855.98 95.90 104.858 65.536 [0, 4] 34 [25600, 1, 1] [128, 1, 1]
5 1 124.316 124.253 119.113 0.069 1716.20 88.68 0 kernel6 0.069 1716.20 88.68 65.536 52.429 [0, 0] 18 [25600, 1, 1] [128, 1, 1]
6 2 369.042 368.662 226.779 0.431 1582.54 81.77 0 kernel7 0.048 1085.88 56.11 52.429 0.006 [2048, 16] 32 [50, 17, 1] [32, 16, 1]
- - - - - - - - 1 kernel8 0.382 1645.26 85.01 419.430 209.715 [0, 0] 22 [102400, 1, 1] [128, 1, 1]
7 2 2024.829 2024.633 978.544 0.338 1436.02 74.20 0 kernel9 0.132 1585.10 81.90 209.715 0.026 [2048, 16] 32 [50, 30, 1] [64, 8, 1]
- - - - - - - - 1 kernel10 0.205 1340.02 69.24 170.466 104.870 [512, 16] 124 [1, 432, 1] [2, 64, 1]
8 2 197.137 197.089 193.144 0.101 1559.95 80.60 0 kernel12 0.049 1070.98 55.34 52.429 0.006 [2048, 16] 32 [50, 17, 1] [32, 16, 1]
- - - - - - - - 1 kernel11 0.052 2021.47 104.45 52.429 52.429 [0, 0] 16 [25600, 1, 1] [128, 1, 1]
9 2 1729.789 1729.616 872.094 0.284 1294.23 66.87 0 kernel13 0.104 1506.07 77.82 157.286 0.019 [2048, 16] 32 [38, 31, 1] [64, 8, 1]
- - - - - - - - 1 kernel14 0.179 1170.76 60.49 157.358 52.442 [512, 16] 126 [1, 432, 1] [2, 64, 1]
10 1 0.149 0.099 0.000 0.058 1809.74 93.51 0 kernel1 0.058 1809.74 93.51 52.442 52.494 [512, 16] 41 [8192, 1, 1] [128, 1, 1]
This option gives added details in the far right columns that includes things like the device type profiled and the peak bandwidth of the device.
Fus# NSegs CuEvtTm(ms) HstTm(ms) CmpTm(ms) KerTm(ms) EffBw(GB/s) %PeakBw Seg# S-KerName S-KerTm(ms) S-CmpTm(ms) S-EffBw(GB/s) S-%PeakBw S-In(MB) S-Out(MB) S-Smem[Dyn,Stat] S-Regs S-Grid S-Block S-Cluster S-Dev S-Stm S-DeviceName S-PeakBw(GB/s)
0 1 328.880 328.828 277.962 0.059 1777.37 91.84 0 kernel1 0.059 276.450 1777.37 91.84 52.442 52.494 [512, 16] 41 [8192, 1, 1] [128, 1, 1] [0, 0, 0] 0 7 NVIDIA A100 80GB P 1935.36
1 1 95.295 95.241 92.902 0.059 1791.61 92.57 0 kernel2 0.059 92.573 1791.61 92.57 52.429 52.429 [0, 0] 16 [25600, 1, 1] [128, 1, 1] [0, 0, 0] 0 7 NVIDIA A100 80GB P 1935.36
2 1 1078.698 1078.567 1014.494 0.139 1599.47 82.64 0 kernel3 0.139 1013.166 1599.47 82.64 104.870 118.030 [512, 16] 52 [8192, 1, 1] [128, 1, 1] [0, 0, 0] 0 7 NVIDIA A100 80GB P 1935.36
3 1 139.003 138.766 129.382 0.244 1719.43 88.84 0 kernel4 0.244 128.690 1719.43 88.84 209.715 209.715 [0, 0] 16 [102400, 1, 1] [128, 1, 1] [0, 0, 0] 0 7 NVIDIA A100 80GB P 1935.36
4 1 283.905 283.822 276.647 0.092 1855.35 95.87 0 kernel5 0.092 276.213 1855.35 95.87 104.858 65.536 [0, 4] 34 [25600, 1, 1] [128, 1, 1] [0, 0, 0] 0 7 NVIDIA A100 80GB P 1935.36
5 1 124.911 124.845 119.094 0.069 1706.67 88.18 0 kernel6 0.069 118.597 1706.67 88.18 65.536 52.429 [0, 0] 18 [25600, 1, 1] [128, 1, 1] [0, 0, 0] 0 7 NVIDIA A100 80GB P 1935.36
6 2 368.721 368.344 226.877 0.428 1593.32 82.33 0 kernel7 0.048 0.000 1085.91 56.11 52.429 0.006 [2048, 16] 32 [50, 17, 1] [32, 16, 1] [0, 0, 0] 0 7 NVIDIA A100 80GB P 1935.36
- - - - - - - - 1 kernel8 0.379 0.000 1657.88 85.66 419.430 209.715 [0, 0] 22 [102400, 1, 1] [128, 1, 1] [0, 0, 0] 0 7 NVIDIA A100 80GB P 1935.36
7 2 2031.692 2031.500 975.809 0.336 1441.90 74.50 0 kernel9 0.133 0.000 1571.43 81.20 209.715 0.026 [2048, 16] 32 [50, 30, 1] [64, 8, 1] [0, 0, 0] 0 7 NVIDIA A100 80GB P 1935.36
- - - - - - - - 1 kernel10 0.203 0.000 1356.72 70.10 170.466 104.870 [512, 16] 124 [1, 432, 1] [2, 64, 1] [0, 0, 0] 0 7 NVIDIA A100 80GB P 1935.36
8 2 196.050 196.000 192.079 0.103 1533.67 79.24 0 kernel12 0.049 0.000 1073.08 55.45 52.429 0.006 [2048, 16] 32 [50, 17, 1] [32, 16, 1] [0, 0, 0] 0 7 NVIDIA A100 80GB P 1935.36
- - - - - - - - 1 kernel11 0.054 0.000 1952.80 100.90 52.429 52.429 [0, 0] 16 [25600, 1, 1] [128, 1, 1] [0, 0, 0] 0 7 NVIDIA A100 80GB P 1935.36
9 2 1722.642 1722.485 865.384 0.266 1381.35 71.37 0 kernel13 0.103 0.000 1530.45 79.08 157.286 0.019 [2048, 16] 32 [38, 31, 1] [64, 8, 1] [0, 0, 0] 0 7 NVIDIA A100 80GB P 1935.36
- - - - - - - - 1 kernel14 0.163 0.000 1287.31 66.52 157.358 52.442 [512, 16] 126 [1, 432, 1] [2, 64, 1] [0, 0, 0] 0 7 NVIDIA A100 80GB P 1935.36
The "no-cupti" option omits kernel profiling and, therefore, you do not see any segment information. The purpose of this option is to show you the Host Time HstTm(ms)
without the overhead of CUPTI. It empirically appears to be between 10us
and 30us
in the Host Time when the Compile Time CmpTm(ms)
does not dominate the Host Time.
Fus# NSegs CuEvtTm(ms) HstTm(ms) CmpTm(ms)
0 1 328.874 328.818 278.639
1 1 94.973 94.920 92.629
2 1 1076.356 1076.217 1012.515
3 1 139.532 139.288 129.602
4 1 283.904 283.817 277.174
5 1 124.174 124.110 119.053
6 2 370.288 369.907 226.604
7 2 2031.641 2031.442 977.103
8 2 196.486 196.436 192.158
9 2 1726.821 1726.661 870.991