Abstract
With the current trend of automation gradually dominating many aspects of human life, the demand for highly accurate and timely responsive automated systems has become essential. Specifically, in the context of transportation, self-driving vehicles, automated traffic monitoring and analysis systems require a capability to read and comprehend the traffic context at a given moment to make informed decisions. My research, "Scene Text Detection for Driving Videos", aims at supporting automated transportation systems in capturing textual information from traffic signs.
- Model: PP-YOLOE+ from Paddle Detection
PP-YOLOE architecture 🡵
- Model: PP-OCRv3 (detection) from Paddle OCR
- Using student model for lightweight inference
PP-OCRv3 (detection) architecture 🡵
# | Dataset | Description | Detail | M1 Usage | M2 Usage |
---|---|---|---|---|---|
#1 | Vietnam Traffic Signs Dataset | Open source recorded traffic videos around Ho Chi Minh City | 40 videos (total length: 1h24m44s) | Fine-tuning + Testing | Fine-tuning + Testing |
#2 | VinText | Largest Vietnamese Scene text dataset | 2,000 labeled images, ~56,000 text objects (~10,500 unique objects) | Testing | Fine-tuning + Testing |
#3 | Zalo AI Challenge - Traffic Sign Detection Dataset | Zalo AI Challenge dataset for “Traffic Signs Detection" contest in 2020 with image data collected from Google Map Street View | ~8,000 traffic images with traffic sign labels | Testing | Testing |
#4 | Extra | Self collected dataset around Ho Chi Minh City | 198 images, 393 traffic sign objects | Improved Fine-tuning + Testing | Testing |
Since Dataset #1 was used in another project with different output, we need to re-process Dataset #1 to match with our project target
- Splitting and filtering images from raw videos
- Using CVAT to label traffic signs and text
- Label statistics:
# of | Images | 296 |
Traffic sign objects | 603 | |
Traffic sign classes | 12 | |
Word objects | 1,538 (274 unique words) | |
Textline objects | 628 |
Traffic sign classes and data distribution
Module | Model | Pre-trained dataset | Fine-tuned dataset | Performance | FPS |
---|---|---|---|---|---|
#1 | PP-YOLOE+ | Objects365 | Customized VTSD | mAP: ~0.677 | ~18.3 |
#2 | PP-OCRv3 (detection) | Baidu images + public datasets | Customized VTSD + VinText | H-mean: ~0.82 | ~29.5 |
- Improving M1 performance by combine Dataset #4 into Customized VTSD:
- Total number of images and number of traffic sign objects increase by ~40%
- After improved mAP: ~0.69
Improvement sample
(above and below images are before and after improvement, respectively)
#1 | |
#2 | |
#3 |
- Fine-tuning and combine Scene text recognition module into the system
- Building an End-to-end model based on Transformer
- Developing a web application for demonstration