modelopt-onnx-ptq is a CLI for ONNX post-training quantization (PTQ) and TensorRT on exported object-detection ONNX models: NVIDIA Model Optimizer, COCO calibration, quantize (optional --autotune), build-trt, eval-trt, pipeline-e2e, and reports. Training and export to ONNX stay outside this repo — you start from models/*.onnx.
| CLI | modelopt-onnx-ptq |
| Docs | docs/README.md |
| AI coding agents | skills/README.md |
- Pipeline
- TREx workflow (diagram)
- Quick Steps
- Supported Output Formats
- Technology Stack
- AI coding agents
- Getting Started
- Community
- License
Export and training are not in this project — bring models/*.onnx from your stack (Ultralytics, DeepStream-Yolo, etc.). Then: download-coco → calib → quantize (optional --autotune) → build-trt → eval-trt / trt-bench; or run everything with pipeline-e2e --onnx models/…onnx (FP16 baseline + PTQ matrix + report-runs under artifacts/pipeline_e2e/sessions/…). Details: docs/workflow.md.
Optional TensorRT Engine Explorer profiling (env_trex, trex-analyze, process_engine.py, notebooks) — not required for PTQ.
You need a machine with an NVIDIA GPU and software on the host so containers can use CUDA / TensorRT:
| Requirement | Notes |
|---|---|
| NVIDIA GPU | A CUDA-capable graphics card (e.g. GeForce / RTX / datacenter GPU). |
| NVIDIA driver | Installed on the host; nvidia-smi should work before you use Docker. |
| Docker | Docker Engine installed and running. |
| NVIDIA Container Toolkit | Lets docker run --gpus all pass the GPU into the container. Install guide. |
Verify the driver with nvidia-smi on the host. After installing the toolkit, follow NVIDIA's guide to confirm GPU access from Docker (e.g. run nvidia-smi inside a test container).
The modelopt-onnx-ptq package is installed inside the image at build time. You do not need to mount the Git repository to run — only bind-mount three folders on the host so ONNX, datasets, and outputs persist when the container stops.
Clone once (or copy the docker/ context elsewhere) and build:
git clone https://github.com/levipereira/Model-Optimizer-ONNX.git
cd Model-Optimizer-ONNX
docker build -f docker/Dockerfile -t modelopt-onnx-ptq .Pick a root directory on the host (any path you like) and create the three subfolders:
export DATA_ROOT="$HOME/modelopt-onnx-ptq"
mkdir -p "$DATA_ROOT/models" "$DATA_ROOT/data" "$DATA_ROOT/artifacts"
docker run --gpus all --rm -it --network host\
-w /workspace/modelopt-onnx-ptq \
-v "$DATA_ROOT/models:/workspace/modelopt-onnx-ptq/models" \
-v "$DATA_ROOT/data:/workspace/modelopt-onnx-ptq/data" \
-v "$DATA_ROOT/artifacts:/workspace/modelopt-onnx-ptq/artifacts" \
modelopt-onnx-ptqInside the container, the working directory is /workspace/modelopt-onnx-ptq. Use the same relative paths as in the docs: models/..., data/coco/..., artifacts/... — they map to $DATA_ROOT on the host.
| Host | Container |
|---|---|
$DATA_ROOT/models |
/workspace/modelopt-onnx-ptq/models |
$DATA_ROOT/data |
/workspace/modelopt-onnx-ptq/data |
$DATA_ROOT/artifacts |
/workspace/modelopt-onnx-ptq/artifacts |
Change DATA_ROOT to another disk or folder if you want.
See docs/docker-reference.md for build args and persistence details.
Run inside the container (or locally after pip install -e .):
- One command (calib → FP16 baseline → PTQ combos → report):
modelopt-onnx-ptq pipeline-e2e --onnx models/your.onnx— add--img-size,--input-name,--output-formatas needed (--output-format autopasses--onnxto eval-trt for layout inference); use--no-fp16-baselineonly if you do not want the FP16 comparison row (workflow).
Or step by step:
- ONNX under
models/— already exported from your stack (see scope above); match letterbox, input name, and outputs to what you will use in production. modelopt-onnx-ptq download-coco --output-dir data/cocomodelopt-onnx-ptq calib --images_dir data/coco/val2017 --calibration_data_size 500 --img_size 640modelopt-onnx-ptq quantize --calibration_data artifacts/calibration/…npy --onnx_path models/your.onnx(optional:--autotune default)- Optional FP16 reference:
modelopt-onnx-ptq build-trt --onnx models/your.onnx --mode fp16(default outputartifacts/trt_engine/<stem>.fp16.b<batch>_i<img>.engine) theneval-trt/trt-benchon that engine (useSESSION_IDor--session-idon each command for a unified report). modelopt-onnx-ptq build-trt --onnx artifacts/quantized/your…quant.onnx --img-size 640 --batch 1→artifacts/trt_engine/<stem>.b<batch>_i<img>.engine(default--mode strongly-typed; see docs)modelopt-onnx-ptq eval-trt --output-format auto --onnx models/your.onnx --engine …or setultralytics/deepstream_yoloexplicitly — table belowmodelopt-onnx-ptq trt-bench --engine …for throughput logs used byreport-runsmodelopt-onnx-ptq report-runs(withSESSION_IDset) or--session-id/--trt-logs-dir/--eval-logs-diras needed
CLI details: docs/cli-reference.md · optional docs site: pip install -e ".[docs]" && mkdocs serve (mkdocs.yml)
The PyTorch → ONNX step defines tensor names, ranks, and post-processing semantics. eval-trt only supports a single detection tensor [B, N, 6] (see below). Use auto with --onnx to pick ultralytics vs deepstream_yolo. Four-tensor exports (num_dets, det_*) are not supported. Flows discussed here assume ONNX exported with --dynamic, --simplify, and --opset 18 or newer (or equivalent flags in your exporter) so shapes and graphs stay consistent through PTQ and trtexec.
modelopt-onnx-ptq eval-trt scores a TensorRT .engine on COCO by decoding how detections leave the network for your stack. Pass --output-format (or auto + --onnx) accordingly. Full flags and shapes: docs/cli-reference.md.
--output-format |
Typical source | Role |
|---|---|---|
auto |
Auto Select | With --onnx, selects ultralytics or deepstream_yolo for a single [B,N,6] output; without --onnx, uses engine tensor names/shapes. |
ultralytics |
ultralytics/ultralytics TensorRT export with integrated NMS: a single output tensor (e.g. output0) shaped [B, N, 6] (e.g. N = 300). |
Each row is x1, y1, x2, y2, score, class in letterboxed input space (NMS already applied in the graph). Filter by --conf-thres, letterbox inverse, COCO mapping, mAP. |
deepstream_yolo |
marcoslucianops/DeepStream-Yolo — engines aligned with the DeepStream custom bbox parser (nvdsparsebbox_Yolo): one output (often named output) [B, num_anchors, 6] (e.g. 8400 proposals at 640×640). |
Same six fields as the parser (xyxy + score + class). In DeepStream, clustering/NMS runs in the pipeline; in eval-trt we apply per-class NMS in Python (--iou-thres), then letterbox inverse and mAP. |
Input tensor: engines may use images, input, or another name; eval-trt binds the first input — ensure your build profile matches NCHW and the same letterbox normalization as calibration (÷255, RGB).
Batch: B may be dynamic in the engine; evaluation uses B = 1 per image.
| Layer | Choice |
|---|---|
| Quantization | nvidia-modelopt[onnx] 0.43.0 (NVIDIA PyPI; default MODELOPT_VERSION in docker/Dockerfile) |
| Calibration | ONNX Runtime GPU (CUDA 13 nightly, aligned with the image) |
| Engine | TensorRT 26.02 (NGC tensorrt:26.02-py3) |
| License | Apache 2.0 — LICENSE, NOTICE |
This repository supports AI coding agents—IDE assistants, agent-style CLI tools, and other automation that load structured project context. Conventions, PTQ/TensorRT workflows, and troubleshooting are written as Agent Skills-style markdown so agents are not limited to generic chat knowledge.
Full guide (layout, format, how to use each skill): skills/README.md
| Path | Role |
|---|---|
skills/modelopt-onnx-ptq-dev/SKILL.md |
Umbrella: repo layout, maintainer conventions (Part A), pointers to domain skills. |
skills/onnx-ptq/SKILL.md + reference.md |
PTQ workflow, mode/method tables, quantize() / modelopt CLI reference. |
skills/ptq-trt-performance/SKILL.md |
Benchmarking (pipeline-e2e, report-runs), backbone/neck/head Conv whitelists. |
skills/modelopt-troubleshooting/SKILL.md |
CUDA/ORT/TRT/modelopt diagnostics and common failures. |
All Agent Skills for this project are maintained under skills/.
The Docker image clones the NVIDIA TensorRT repository at branch release/10.15 into /workspace/TREx and installs TREx in a separate virtualenv (source install.sh --venv --full → env_trex) so TREx’s pandas pins do not collide with modelopt-onnx-ptq, CuPy, or numpy in the main image Python. Use this only for TensorRT engine profiling (layer graphs, trtexec JSON, notebooks). It is not part of the PTQ pipeline.
source /workspace/TREx/tools/experimental/trt-engine-explorer/env_trex/bin/activate
trex --help
# Notebooks and utilities: /workspace/TREx/tools/experimental/trt-engine-explorer/trex-analyze re-runs itself with env_trex when trex is not importable from the default interpreter. See docs/docker-reference.md — TREx.
The 10.15 tree is source for TREx; the TensorRT runtime matches the NGC base image. Details: docs/docker-reference.md — TREx.
To develop using the image: build it, then bind-mount your Git clone into /workspace/modelopt-onnx-ptq so you edit the repo on the host and run inside the container. Step-by-step: Edit mode with Docker (developers) in Installation.
If you want to change this project and run outside Docker, clone the repo, then install in editable mode from the repository root:
git clone https://github.com/levipereira/Model-Optimizer-ONNX.git
cd Model-Optimizer-ONNX
pip install -e .
modelopt-onnx-ptq --helpYou still need a matching CUDA / TensorRT / ONNX Runtime stack on the host; the Docker image is the supported baseline.
- Discussions — questions, benchmarks, results, PTQ recipes per model. Read the welcome thread.
- Issues — confirmed bugs only, with versions, commands, and a minimal repro.
Official reference tables (mAP, latency, recommended PTQ settings) are in progress; the community can still share findings in Discussions.
Copyright © 2026 Levi Pereira. Licensed under the Apache License, Version 2.0. See LICENSE and NOTICE for terms and third-party notices.

