cwpearson/nvidia-performance-tools on Docker Hub.
docker pull cwpearson/nvidia-performance-tools:latest-amd64 # for x86
docker pull cwpearson/nvidia-performance-tools:latest-ppc64le # for POWER
Typically, you'll want the latest-amd64
or latest-ppc64le
tags.
If you are developing a workflow and want stability, choose a tag like amd64-10.1-master-ce03360
, which describes the architecture, CUDA version, branch, and short SHA of the corresponding git commit for cwpearson/nvidia-performance-tools on Github.
- April 21-23 2020 University of Illinois ECE 498 guest lecture for Professors Hwu, Chen, and Xiong.
- Slides
- Recorded Lectures
- April 16 2020 University of Illinois ECE 408 guest lecture for Professor Lumetta.
- Slides
- Recorded Lecture (75 mins)
- sgemm Featuring basic, shared-memory tiled, and joint shared-memory and register tiling.
- coalescing Featuring a simple code with and without memory coalescing, and discussion of how to analyze efficiency in Nsight Compute
There is a command-line (CLI) and graphical (GUI) version of each tool. They will be installed together, unless a CLI-only version is downloaded.
- macOS: You probably don't have CUDA installed, so download the Nsight Systems or Compute installer from the Nvidia website.
- Windows with CUDA:
- with CUDA: You may already find Nsight Systems or Compute in your start menu. You can download a more recent release from the Nvidia website. If you install it, you will have two entries in the start menu for different versions.
- without CUDA: Download the Nsight Systems or Compute installer from the CUDA website.
- Linux
- with CUDA: you may already have Nsight Systems and Compute (check
/usr/local/cuda/bin/nsight-sys
and/usr/local/cuda/bin/nv-nsight-cu
). If so, you can still download the Nsight Systems or Compute.deb
package to update. It may override the package that was installed with CUDA. You can also use the.run
file, which you should install to a directory not managed by the package manager, and add the location of the resulting binary files to your path. - without CUDA:
.deb
: Download the.deb
package and install it. Requires root privileges.run
: Download the.run
package and execute it. Choose a file system that you have permission to install to, and then add the resulting binary directory to your path.
- with CUDA: you may already have Nsight Systems and Compute (check
#include <nvToolsExt.h>
nvtxRangePush("span 1");
nvtxRangePush("a nested span");
nvtxRangePop(); // end nested span
nvtxRangePop(); // end span 1
Also link with -lnvToolsExt
.
Compile with optimizations turned on, and without debug information.
The most linkely relevant flags for nvcc
are below:
--profile (-pg)
Instrument generated code/executable for use by gprof (Linux only).
--debug (-g)
Generate debug information for host code.
--device-debug (-G)
Generate debug information for device code. Turns off all optimizations.
Don't use for profiling; use -lineinfo instead.
--generate-line-info (-lineinfo)
Generate line-number information for device code.
So, change nvcc -g/-pg/-G ...
to nvcc <your optimization flags> -lineinfo ...
.
If your code overwrites unallocated memory, it may corrupt the profiling process.
If profiling fails, try running your code under cuda-memcheck
.
This will instrument your binary to detect bad GPU memory activity.
Fix any errors that occur, and try profiling again.
This will cause ~100x slowdown usually, so try a small dataset first.
cuda-memcheck my-binary
Run nsys status -e
. You should see something like
Sampling Environment Check
Linux Kernel Paranoid Level = 2: OK
Linux Distribution = Ubuntu
Linux Kernel Version = 4.16.15-41615: OK
Linux perf_event_open syscall available: OK
Sampling trigger event available: OK
Intel(c) Last Branch Record support: Available
Sampling Environment: OK
Errors may reduce the amount of information collected, or cause profiling to fail. Consult documentation for troubleshooting steps.
Under this scheme, we
- use the CLI on the target to record a profiling file
- transfer that file to the client
- use the GUI on the client to analyze the record
This command will
- Generate
a.nsight-cuprof-report
with recorded profiling information - Measure metrics associated with all sections
- Profile the 6th invocation of
__global__ void kernel_name(...)
- Run a.out
nv-nsight-cu-cli \
-o a \
--sections ".*" \
--kernel-id ::kernel_name:6 \
a.out
To see sections that will be recorded for a command, add --list-sections
.
nv-nsight-cu-cli --list-sections
---------------------------- ------------------------------- --------------------------------------------------
Identifier Display Name Filename
----------------------------- ------------------------------- --------------------------------------------------
ComputeWorkloadAnalysis Compute Workload Analysis .../../../sections/ComputeWorkloadAnalysis.section
InstructionStats Instruction Statistics ...64/../../sections/InstructionStatistics.section
LaunchStats Launch Statistics ...1_3-x64/../../sections/LaunchStatistics.section
MemoryWorkloadAnalysis Memory Workload Analysis ...4/../../sections/MemoryWorkloadAnalysis.section
MemoryWorkloadAnalysis_Chart Memory Workload Analysis Chart ..../sections/MemoryWorkloadAnalysis_Chart.section
MemoryWorkloadAnalysis_Tables Memory Workload Analysis Tables .../sections/MemoryWorkloadAnalysis_Tables.section
Occupancy Occupancy ...ibc_2_11_3-x64/../../sections/Occupancy.section
SchedulerStats Scheduler Statistics ...-x64/../../sections/SchedulerStatistics.section
SourceCounters Source Counters ..._11_3-x64/../../sections/SourceCounters.section
SpeedOfLight GPU Speed Of Light ..._2_11_3-x64/../../sections/SpeedOfLight.section
WarpStateStats Warp State Statistics ...-x64/../../sections/WarpStateStatistics.section
To see supported metrics on a device, do nv-nsight-cu-cli --devices 0 --query-metrics
On some newer devices, the base metrics name will not work. You need to append an allowed suffix. To see all the legal names and suffices, do nv-nsight-cu-cli --devices 0 --query-metrics --query-metrics-mode all
The --kernel-id
flag takes a string like context-id:stream-id:[name-operator:]kernel-name:invocation-nr
.
Commonly, we might only use kernel-name
, to select kernels to profile by name, and invocation-nr
, to select which invocation of the kernels to profile.
This command will
- Record profiling info to
a.qdreq
- Run a.out
nsys profile \
-o a
a.out
In Nsight Compute:
File > Open File ... > file.nsight-cuprof-report
If you profiled on a different system than the GUI tool is running on, and you want to look at a View that includes the source, you may have to click the "resolve" button to nagivate to a local copy of the source file.
In Nsight Systems:
File > Open > file.qdrep
instructions to come
docker ps -a
docker rm
docker ps -a -q``docker system prune
Run a profiling container:
docker run cwpearson/nvidia-performance-tools:latest-amd64
Resume a previously exited container:
* docker ps -a # find the ID
* docker start <ID> # resume the exited container
* docker attach <ID> # attach a terminal to the container
See CONTRIBUTING.md
-
NVIDIA Devloper Blog
-
Interpreting Nsight Compute Results
- Workload Memory Analysis
- Stall Reasons
- Issue Efficiency
- Occupancy
-
Slides
- docs/GEMM-joint-tiling.ppt: Joint-tiling slide deck from ECE 508 Spring 2017
-
GTC
- Volta Architecture and Performance Optimization: Volta L1 will cache writes