Jetson-aware Embedded Deep learning Inference acceleration framework with TensorRT
JEDI is a simple framework to apply various parallelization techniques on tkDNN-based deep learning applications running on NVIDIA Jetson boards such as NVIDIA Jetson AGX Xavier and NVIDIA Jetson Xavier NX.
The main goal of this tool is applying various parallelization techniques to maximize the throughput of deep learning applications.
If you use JEDI in your research, please cite the following paper.
@article{10.1145/3508391,
author = {Jeong, EunJin and Kim, Jangryul and Ha, Soonhoi},
title = {TensorRT-Based Framework and Optimization Methodology for Deep Learning Inference on Jetson Boards},
year = {2022},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
issn = {1539-9087},
url = {https://doi.org/10.1145/3508391},
doi = {10.1145/3508391},
journal = {ACM Trans. Embed. Comput. Syst.},
}
- Preprocessing parallelization
- Postprocessing parallelization
- Intra-network pipelining with GPU and DLA
- Stream assignment per each pipelining stage
- Intermediate buffer assignment between pipelining stages
- Partial network duplication
- INT8 quantization on pipelined networks
- Batch
- Test environment: NVIDIA Jetson AGX Xavier (MAXN mode with jetson_clocks), Jetpack 4.3
- Input image size: 416x416
- The recent experiments are tested with opencv_parallel_num = 0 option.
| Network | Baseline GPU | GPU with JEDI | GPU + DLA with JEDI |
|---|---|---|---|
| Yolov2 relu | 74 | 187 | 295 |
| Yolov2tiny relu | 91 | 625 | 701 |
| Yolov3 relu | 50 | 85 | 128 |
| Yolov3tiny relu | 102 | 614 | 729 |
| Yolov4 relu | 45 | 81 | 128 |
| Yolov4tiny relu | 103 | 620 | 598 |
| Yolov4csp relu | 41 | 94 | 141 |
| CSPNet relu | 40 | 65 | 80 |
| Densenet+Yolo relu | 44 | 86 | 118 |
| Network | Baseline GPU | GPU with JEDI | GPU + DLA with JEDI |
|---|---|---|---|
| Yolov2 relu | 90 | 401 | 502 |
| Yolov2tiny relu | 96 | 749 | - |
| Yolov3 relu | 67 | 169 | 222 |
| Yolov3tiny relu | 110 | 833 | - |
| Yolov4 relu | 59 | 156 | 216 |
| Yolov4tiny relu | 108 | 810 | - |
| Yolov4csp relu | 49 | 180 | 233 |
| CSPNet relu | 63 | 145 | 147 |
| Densenet+Yolo relu | 61 | 186 | 230 |
This result is based on the old version of this software. (The target version is commit )
- Test environment: NVIDIA Jetson AGX Xavier (MAXN mode with jetson_clocks), Jetpack 4.3
- Input image size: 416x416
| Network | Baseline GPU (FP16) | GPU with parallelization techniques (FP16) | GPU + DLA pipelining (FP16) |
|---|---|---|---|
| Yolov2 relu | 74 | 193 | 291 |
| Yolov3 relu | 50 | 87 | 133 |
| Yolov4 relu | 43 | 73 | 90 |
| Yolov4tiny relu | 103 | 459 | 504 |
| CSPNet relu | 40 | 62 | 72 |
| Densenet+Yolo relu | 44 | 86 | 120 |
- Supported Platforms
- Prerequisite
- How to Compile JEDI
- JEDI Configuration Paramters
- How to Run JEDI
- How to Add a New Application in JEDI
- Supported and Tested Networks
- References
- NVIDIA Jetson boards are supported. (Tested on NVIDIA Jetson AGX Xavier and NVIDIA Jetson Xavier NX)
- Forked tkDNN
- All dependencies required by tkDNN
- Jetpack 4.3 or higher
- libconfig++
- OpenMP
After installing the forked version of tkDNN, compile the JEDI with the following commands.
git clone https://github.com/urmydata/tkDNN.git
mkdir build && cd build
cmake ..
make
- To run JEDI, the following parameters are needed.
./build/bin/proc -c <JEDI configuration file> -r <JSON result file> -p <tegrastats log> -t <inference time output file>
where
-
-c <JEDI configration file>: JEDI configuration file (explanation of JEDI configuration file is shown in here) -
-r <JSON result file>(optional): Output file of detection results in COCO JSON format. -
-p <tegrastats log output file>(optional): Tegrastats log output file during inference which is used for computing the utilization and power. -
-t <inference time output file>(optional): The output file which contains the total inference time -
Example commands of running JEDI
./build/bin/proc -h # print help message
./build/bin/proc -c sample.cfg -r result.json -p power.log # an example of running
- JEDI configuration file is based on libconfg format.
- sample.cfg is a sample configuration file with detailed explanation of each configuration parameters
- JEDI provides an inteface to add a new tkDNN-based deep learning application.
- Currently,
YoloApplicationandCenternetApplicationare implemented.
- Write your own deep learning application with the inference application implementation interface
readCustomOptions: Add a custom option which is used for this application.createNetwork: Create a tkDNN-based networkreferNetworkRTInfo: Refer NetworkRT class if any information in this class is neededinitializePreprocessing: Initialize preprocessing and input datasetinitializePostprocessing: Initialize postprocessingpreprocessing: Execute preprocessingpostprocessing: Execute postprocessing (batched execution must be performed inside this method)- Call order:
readCustomOptions=>createNetwork=>referNetworkRTInfo=>initializePreprocessing=>initializePostprocessing=>preprocessing/postprocessing - You can also implement your own dataset with dataset implementation interface
- Register your application with the following code in your source code.
REGISTER_JEDI_APPLICATION([Your application class name]);
- Add your source code to CMakeLists.txt
- Insert
app_type = "[Your application class name]"in the JEDI configuration file.
| Network | Trained Dataset | Input size | Network cfg | Weights |
|---|---|---|---|---|
| YOLO v21 with relu | COCO 2014 trainval | 416x416 | cfg | weights |
| YOLO v2 tiny1 with relu | COCO 2014 trainval | 416x416 | cfg | weights |
| YOLO v32 with relu | COCO 2014 trainval | 416x416 | cfg | weights |
| YOLO v3 tiny2 with relu | COCO 2014 trainval | 416x416 | cfg | weights |
| Centernet4 (DLA34 backend) | COCO 2017 train | 512x512 | - | weights |
| Cross Stage Partial Network7 with relu | COCO 2014 trainval | 416x416 | cfg | weights |
| Yolov48 with relu | COCO 2014 trainval | 416x416 | cfg | weights |
| Yolov4 tiny8 with relu | COCO 2014 trainval | 416x416 | cfg | weights |
| Scaled Yolov410 with relu | COCO 2017 train | 512x512 | cfg | weights |
| Densenet+Yolo9 with relu | COCO 2014 trainval | 416x416 | cfg | weights |
- Redmon, Joseph, and Ali Farhadi. "YOLO9000: better, faster, stronger." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
- Redmon, Joseph, and Ali Farhadi. "Yolov3: An incremental improvement." arXiv preprint arXiv:1804.02767 (2018).
- Yu, Fisher, et al. "Deep layer aggregation." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
- Zhou, Xingyi, Dequan Wang, and Philipp Krähenbühl. "Objects as points." arXiv preprint arXiv:1904.07850 (2019).
- Sandler, Mark, et al. "Mobilenetv2: Inverted residuals and linear bottlenecks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
- He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
- Wang, Chien-Yao, et al. "CSPNet: A New Backbone that can Enhance Learning Capability of CNN." arXiv preprint arXiv:1911.11929 (2019).
- Bochkovskiy, Alexey, Chien-Yao Wang, and Hong-Yuan Mark Liao. "YOLOv4: Optimal Speed and Accuracy of Object Detection." arXiv preprint arXiv:2004.10934 (2020).
- Bochkovskiy, Alexey, "Yolo v4, v3 and v2 for Windows and Linux" (https://github.com/AlexeyAB/darknet)
- Wang, Chien-Yao, Alexey Bochkovskiy, and Hong-Yuan Mark Liao. "Scaled-YOLOv4: Scaling Cross Stage Partial Network." arXiv preprint arXiv:2011.08036 (2020).