Slower than expected GPU inference in `deployment/libtorch` example #273

mattpopovich · 2022-01-12T01:35:01Z

🐛 Describe the bug

I created some yolo-rt-stack torchscript models by following the script here. I then followed the README instructions to build the LibTorch C++ code. Everything works as expected except inference on the GPU is much slower (7x) than the CPU.

Can you confirm these results or am I doing something wrong? I believe previously (July-August 2021 timeframe) I was seeing inference times in the 8-10ms range.

v4.0:

Click to show v4.0

root@pc:yolov5-rt-stack/deployment/libtorch/build# ./yolort_torch --input_source ../../../bus.jpg --checkpoint ../../../yolov5s-v4.0-RT-v0.5.2-YOLOv5.torchscript.pt --labelmap ../../../coco.names
Set CPU mode
Loading model
Model loaded
Run once on empty image
[W TensorImpl.h:1153] Warning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (function operator())
Pre-process takes : 18 ms
Inference takes : 106 ms
Detected labels:  0
 0
 0
 5
 0
[ CPULongType{5} ]
Detected boxes:  669.2656  391.3025  809.8663  885.2344
  54.0635  397.8318  235.9531  901.3731
 222.8834  406.8119  341.5572  854.7792
  18.6320  232.9767  810.9739  760.1169
   0.4640  502.0519   88.5140  887.0480
[ CPUFloatType{5,4} ]
Detected scores:  0.8901
 0.8733
 0.8537
 0.7234
 0.3769
[ CPUFloatType{5} ]
root@pc:yolov5-rt-stack/deployment/libtorch/build# ./yolort_torch --input_source ../../../bus.jpg --checkpoint ../../../yolov5s-v4.0-RT-v0.5.2-YOLOv5.torchscript.pt --labelmap ../../../coco.names --gpu 
Set GPU mode
Loading model
Model loaded
Run once on empty image
[W TensorImpl.h:1153] Warning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (function operator())
Pre-process takes : 21 ms
Inference takes : 748 ms
Detected labels:  0
 0
 0
 5
 0
[ CUDALongType{5} ]
Detected boxes:  669.2656  391.3025  809.8663  885.2344
  54.0635  397.8318  235.9531  901.3730
 222.8834  406.8120  341.5572  854.7791
  18.6320  232.9767  810.9739  760.1170
   0.4640  502.0522   88.5139  887.0480
[ CUDAFloatType{5,4} ]
Detected scores:  0.8901
 0.8733
 0.8537
 0.7234
 0.3769
[ CUDAFloatType{5} ]

v6.0:

Click to show v6.0

root@pc:yolov5-rt-stack/deployment/libtorch/build# ./yolort_torch --input_source ../../../bus.jpg --checkpoint ../../../yolov5s-v6.0-RT-v0.5.2-YOLOv5.torchscript.pt --labelmap ../../../coco.names
Set CPU mode
Loading model
Model loaded
Run once on empty image
[W TensorImpl.h:1153] Warning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (function operator())
Pre-process takes : 15 ms
Inference takes : 95 ms
Detected labels:  0
 0
 0
 5
 0
[ CPULongType{5} ]
Detected boxes:  224.5497  402.5811  342.7194  862.6057
  51.8626  398.3438  245.3290  906.3114
 679.8232  385.5574  809.3773  883.1394
   0.1952  201.8805  812.9611  786.3345
   0.0480  558.7347   75.8148  871.5754
[ CPUFloatType{5,4} ]
Detected scores:  0.8959
 0.8846
 0.8579
 0.5181
 0.3932
[ CPUFloatType{5} ]
root@pc:yolov5-rt-stack/deployment/libtorch/build# ./yolort_torch --input_source ../../../bus.jpg --checkpoint ../../../yolov5s-v6.0-RT-v0.5.2-YOLOv5.torchscript.pt --labelmap ../../../coco.names --gpu 
Set GPU mode
Loading model
Model loaded
Run once on empty image
[W TensorImpl.h:1153] Warning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (function operator())
Pre-process takes : 28 ms
Inference takes : 746 ms
Detected labels:  0
 0
 0
 5
 0
[ CUDALongType{5} ]
Detected boxes:  224.5497  402.5810  342.7194  862.6058
  51.8626  398.3439  245.3289  906.3113
 679.8232  385.5574  809.3773  883.1393
   0.1954  201.8804  812.9608  786.3347
   0.0480  558.7346   75.8148  871.5754
[ CUDAFloatType{5,4} ]
Detected scores:  0.8959
 0.8846
 0.8579
 0.5181
 0.3932
[ CUDAFloatType{5} ]

Thanks again for all your help thus far. I'm going to look into deployment/tensorrt next to see what inference times I can get there.

Versions

Click to display Versions

# python3 -m torch.utils.collect_env
Collecting environment information...
PyTorch version: 1.9.0a0+gitd69c22d
Is debug build: False
CUDA used to build PyTorch: 11.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.2 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: version 3.21.1
Libc version: glibc-2.31

Python version: 3.8 (64-bit runtime)
Python platform: Linux-5.4.0-92-generic-x86_64-with-glibc2.29
Is CUDA available: True
CUDA runtime version: 11.2.152
GPU models and configuration: 
GPU 0: GeForce GTX 1080
GPU 1: GeForce GTX 1080
GPU 2: GeForce GTX 1080

Nvidia driver version: 460.91.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.1.0
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.21.4
[pip3] pytorch-lightning==1.5.8
[pip3] torch==1.9.0a0+gitd69c22d
[pip3] torchmetrics==0.6.2
[pip3] torchvision==0.10.0a0+300a8a4
[conda] Could not collect

The text was updated successfully, but these errors were encountered:

zhiqwang · 2022-01-12T06:04:49Z

Hi @mattpopovich ,

Seems that PyTorch 1.9 requires two warm-ups on the GPU, and we need to ignore the first two calculation times. Could you test it again or upgrade your PyTorch to 1.10.1? (Check pytorch/pytorch#58801 for more details.)

The part of TensorRT C++ is under development, we have implemented the core parts of model converting, now there are several parts that still need to be implemented:

We use the YOLO.load_from_yolov5() strategy in the TensorRT, so we should implement the pre-processing in the C++ example, and now the existing version is a bit rough.
we use the static shape mechanism in the part of the model conversion to TensorRT engine, we need to add dynamic shape support, this is very important for practical applications. Check Support dynamic shape mechanism for TensorRT #266.

And all contributions are welcome here!

mattpopovich · 2022-01-15T02:45:02Z

Great find! I ran way too many tests on my machine (below) with PyTorch, TorchVision, and OpenCV built from source (originally I was seeing slow inference no matter how many times I "warmed up" the model, but I have since been unable to recreate that).