This repository serves as an example of deploying the Models YOLOv7 model (FP16) and the YOLOv7 QAT (INT8) on Triton-Server for performance and testing. It includes support for applications developed using Nvidia DeepStream.
Instructions to deploy YOLOv7 as TensorRT engine to Triton Inference Server.
This Repo use exported Yolov7 Models ONNX to Generate TensorRT Engine.
Users can either build ONNX files themselves or simply utilize the start-container-triton-server.sh script to initiate the container and use start-triton-server.sh to download the models, generate the TRT engine, and start the Triton Server.
This repository is a continuation of philipp-schmidt work in the repository https://github.com/WongKinYiu/yolov7/tree/main/deploy/triton-inference-server.
This repo does not export pytorch models to ONNX.
You can use the Yolov7 Repository or the Yolov7 Docker Image for your convenience.
python export.py --weights yolov7.pt \
--grid \
--end2end \
--dynamic-batch \
--simplify \
--topk-all 100 \
--iou-thres 0.65 \
--conf-thres 0.35 \
--img-size 640 640
This repo does not export pytorch models to ONNX.
You can use the Yolov7 QAT Repository or the Yolov7 Docker Image for your convenience.
python scripts/qat.py export qat.pt \
--save=qat.onnx \
--dynamic \
--img-size 640 \
--end2end \
--topk-all 100 \
--simplify \
--iou-thres 0.65 \
--conf-thres 0.35
bash ./start-container-triton-server.sh
Run this script to start the Triton Inference Server container.
Note: This script must be executed on the host operating system.
- NVIDIA Docker must be installed.
- NVIDIA GPU(s) should be available.
- Model exported from PyTorch to ONNX
This script start-triton-server.sh
automatically download Yolov7 ONNX models from https://github.com/levipereira/Docker-Yolov7-Nvidia-Kit/releases/tag/v1.0
Usage:
bash ./start-triton-server.sh <max_batch_size> <opt_batch_size> [--force-build]
# - max_batch_size: Maximum batch size for TensorRT engines.
# - opt_batch_size: Optimal batch size for TensorRT engines.
# - Use the flag --force-build to rebuild TensorRT engines even if they already exist.
Example
bash ./start-triton-server.sh 16 8 --force-build
This script converts ONNX models to TensorRT engines and starts the NVIDIA Triton Inference Server.
Note: This script is intended to be executed from within the Docker Triton container.
- Checks for the existence of YOLOv7 ONNX model files.
- Downloads ONNX models if they do not exist.
- Converts YOLOv7 ONNX model to TensorRT engine with FP16 precision.
- Converts YOLOv7 Quantized and Aware Training (QAT) ONNX model to TensorRT engine with INT8 precision.
- Updates the batch size configurations in the Triton Server config files.
- Starts Triton Inference Server with the converted models.
After running script, you can verify the availability of the model by checking this output::
+------------+---------+--------+
| Model | Version | Status |
+------------+---------+--------+
| yolov7 | 1 | READY |
| yolov7_qat | 1 | READY |
+------------+---------+--------+
triton-server-yolov7/
├── models_config
│ ├── yolov7
│ │ └── config.pbtxt
│ └── yolov7_qat
│ └── config.pbtxt
See Triton Model Configuration Documentation for more info.
Example of Yolo Configuration
Note:
- The values of 100 in the det_boxes/det_scores/det_classes arrays represent the topk-all.
- The setting "max_queue_delay_microseconds: 30000" is optimized for a 30fps input rate.
name: "yolov7_qat"
platform: "tensorrt_plan"
max_batch_size: 8
input [
{
name: "images"
data_type: TYPE_FP32
dims: [ 3, 640, 640 ]
}
]
output [
{
name: "num_dets"
data_type: TYPE_INT32
dims: [ 1 ]
},
{
name: "det_boxes"
data_type: TYPE_FP32
dims: [ 100, 4 ]
},
{
name: "det_scores"
data_type: TYPE_FP32
dims: [ 100 ]
},
{
name: "det_classes"
data_type: TYPE_INT32
dims: [ 100 ]
}
]
instance_group [
{
count: 2
kind: KIND_GPU
gpus: [ 0 ]
}
]
version_policy: { latest: { num_versions: 1}}
dynamic_batching {
max_queue_delay_microseconds: 30000
}
In the log you should see:
+--------+---------+--------+
| Model | Version | Status |
+--------+---------+--------+
| yolov7 | 1 | READY |
+--------+---------+--------+
See Triton Model Analyzer Documentation for more info.
Triton-Server config: Max Batch Size 16 / 2 Instances
Test Report :
Yolov7 FP16 - Best Result: Concurrency: 48, throughput: 1049.65 infer/sec, latency 45729 usec
Yolov7 QAT (INT8) - Best Result Concurrency: 48, throughput: 1219.37 infer/sec, latency 39340 usec
Example test for a maximum of 128 concurrent clients, starting with 8 clients and incrementing by 8, using shared memory.
docker run -it --ipc=host --net=host nvcr.io/nvidia/tritonserver:23.08-py3-sdk /bin/bash
./install/bin/perf_analyzer -m yolov7_qat -u 127.0.0.1:8001 -i grpc --shared-memory system --concurrency-range 8:128:8
# Result (truncated)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 8, throughput: 146.657 infer/sec, latency 54498 usec
Concurrency: 16, throughput: 559.946 infer/sec, latency 28576 usec
Concurrency: 24, throughput: 561.719 infer/sec, latency 42684 usec
Concurrency: 32, throughput: 1009.66 infer/sec, latency 31695 usec
Concurrency: 40, throughput: 1013.22 infer/sec, latency 39475 usec
Concurrency: 48, throughput: 1219.37 infer/sec, latency 39340 usec
Concurrency: 56, throughput: 1220.32 infer/sec, latency 45875 usec
Concurrency: 64, throughput: 1220.29 infer/sec, latency 52430 usec
Concurrency: 72, throughput: 1220.26 infer/sec, latency 58979 usec
Concurrency: 80, throughput: 1220.3 infer/sec, latency 65533 usec
Concurrency: 88, throughput: 1220.29 infer/sec, latency 72078 usec
Concurrency: 96, throughput: 1220.3 infer/sec, latency 78639 usec
Concurrency: 104, throughput: 1220.28 infer/sec, latency 85189 usec
Concurrency: 112, throughput: 1220.29 infer/sec, latency 91745 usec
Concurrency: 120, throughput: 1220.29 infer/sec, latency 98297 usec
Concurrency: 128, throughput: 1220.26 infer/sec, latency 104851 usec