This project provides a complete pipeline to pre-train the backbone of a custom YOLOv11 pose estimation model using knowledge distillation from a powerful DINOv3 vision foundation model. The entire process is handled by a single, configurable Python script that leverages the lightly-train and ultralytics libraries.
The goal is to transfer the rich, general-purpose visual understanding from the massive DINOv3 "teacher" model to a lightweight, efficient YOLOv11 "student" model. This pre-training step, performed on unlabeled images, gives the YOLO model a significant head start, leading to better performance, faster convergence, and improved data efficiency when you later fine-tune it on your specific (and often limited) labeled dataset.
| Model | Format | Processor | Quantization | Box(P) | Box(R) | Box(mAP50) | Box(mAP50-95) | Pose(P) | Pose(R) | Pose(mAP50) | Pose(mAP50-95) | Size (MB) | metrics/mAP50-95(P) | Inference time (ms/im) | FPS |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| YOLOvl-pose_distilled | OpenVINO | CPU | INT8 | 0.958 | 0.961 | 0.98 | 0.958 | 0.961 | 0.964 | 0.985 | 0.953 | 26.3 | 0.9528 | 46.3 | 21.6 |
| YOLOvn-pose_distilled | OpenVINO | CPU | INT8 | 0.966 | 0.958 | 0.981 | 0.951 | 0.966 | 0.958 | 0.981 | 0.909 | 3.6 | 0.9091 | 8.05 | 124.25 |
| YOLOvn-pose_distilled | ONNX | GPU | FP16 | 0.972 | 0.95 | 0.981 | 0.952 | 0.972 | 0.95 | 0.981 | 0.91 | 5.5 | 0.9104 | 9.39 | 106.44 |
| YOLOvl-pose_distilled | ONNX | GPU | FP16 | 0.967 | 0.948 | 0.981 | 0.962 | 0.967 | 0.948 | 0.981 | 0.95 | 49.6 | 0.9497 | 18.38 | 54.4 |
| YOLOvl-pose_distilled | TensorRT | GPU | INT8 | 0.967 | 0.945 | 0.98 | 0.953 | 0.967 | 0.945 | 0.98 | 0.946 | 33.7 | 0.9459 | 2.36 | 422.96 |
| YOLOvn-pose_distilled | TensorRT | GPU | INT8 | 0.959 | 0.965 | 0.98 | 0.945 | 0.959 | 0.965 | 0.98 | 0.901 | 9.6 | 0.9013 | 1.64 | 607.92 |
| YOLOvn-pose_distilled | TensorRT | GPU | FP16 | 0.971 | 0.951 | 0.981 | 0.955 | 0.971 | 0.951 | 0.981 | 0.91 | 11.1 | 0.9102 | 1.48 | 675.3 |
| YOLOvl-pose_distilled | TensorRT | GPU | FP16 | 0.967 | 0.949 | 0.981 | 0.961 | 0.967 | 0.949 | 0.981 | 0.949 | 56.2 | 0.9495 | 2.84 | 351.43 |
- Model Performance on RTX 4080 Super (16GvRAM) and Intel i9-14900F; All metrics in the figure are obtained after Fine-Tuning on Custom Dataset * OpenVINO = CPU * TensorRT = GPU
- Key Concepts Explained π§
- Features β¨
- Prerequisites π
- Step-by-Step Instructions π οΈ
- Understanding the Output π
- Code Breakdown π¬
- Acknowledgements π
To understand what this script does, let's break down the core ideas.
Knowledge Distillation is a machine learning technique where we train a smaller, more efficient "student" model by transferring knowledge from a larger, more powerful "teacher" model. Instead of training the student directly on ground-truth labels (which we don't have for pre-training), we train it to mimic the outputs or internal representations of the teacher.
Think of it like an apprenticeship. A master artisan (the teacher) doesn't just show the apprentice (the student) the finished product. Instead, the master demonstrates the process and guides the apprentice's technique. In our case, the DINOv3 teacher shows the YOLOv11 student how to "see" and interpret an image by forcing the student's internal feature maps to match the teacher's. The training loss is calculated based on the difference between the student's and teacher's representations.
- Teacher: DINOv3 (
dinov3/vitb16)- DINOv3 is a state-of-the-art vision foundation model developed by Meta AI. It was trained using self-supervised learning on a massive, diverse dataset of images.
- Because it wasn't trained for a single, narrow task, it has developed a profound and generalizable understanding of visual patterns, textures, shapes, and object parts. Its internal representations (feature maps) are incredibly rich and semantically meaningful.
- Student: YOLOv11-Pose (
YOLO yaml model fileorIntegraPose11x-pose.yaml) Native YOLO YAML files- YOLO (You Only Look Once) is a family of models famous for being extremely fast and efficient, making them ideal for real-time applications like pose estimation.
- Our student is a custom-defined YOLOv11 architecture for pose estimation. By itself, its randomly initialized backbone knows nothing about the visual world.
By distilling knowledge from DINOv3, we are essentially "imprinting" the teacher's sophisticated visual understanding onto the student's smaller, more efficient architecture. This provides several key advantages:
- Leverages Unlabeled Data: We can use a massive corpus of cheap, unlabeled images to give our model a robust starting point.
- Better Initialization: The student model starts the final fine-tuning process not with random weights, but with a backbone that already understands visual concepts. This is far more effective than traditional pre-training on datasets like ImageNet.
- Improved Performance & Data Efficiency: A well-pre-trained model requires less labeled data and fewer epochs during the final fine-tuning stage to achieve high performance on the target task (pose estimation).
- Distillation Pre-training: Uses
lightly-trainto distill knowledge from a local DINOv3 teacher into a YOLOv11 student backbone. - Custom YOLO YAML: Supports any custom YOLOv8/v9/v10-style pose estimation YAML file.
- Automatic YAML Patching: Intelligently patches common YAML configuration errors (e.g., residual channel mismatches) to prevent training failures.
- Flexible Resumption: Robust policies for resuming interrupted runs (
resume), starting new runs with old weights (warm_start), or starting fresh. - Automatic Export: The final pre-trained backbone is automatically exported to a
.ptfile, ready for use with theultralyticsframework. - Optional Fine-tuning: A built-in option to immediately proceed to fine-tuning the pre-trained model on a labeled dataset.
- Optional Embedding Extraction: A utility to generate image embeddings from your final pre-trained model.
- Hardware: A modern NVIDIA GPU with at least 8GB of VRAM is recommended.
- Software: Python 3.8+ and
pip. - Environment: A CUDA-enabled environment to run PyTorch on the GPU.
First, get the project files on your local machine.
git clone https://github.com/farhanaugustine/DINOv3_Distillation_YOLO-pose.git
cd DINOv3_Distillation_YOLO-pose
The script requires lightly-train with the ultralytics extra, as well as ultralytics itself.
pip install "lightly-train[ultralytics]" ultralytics
You need the pre-trained weights for the DINOv3 teacher model. This script is configured for the ViT-B/16 version.
- Download Link: You can find download links and instructions on the official DINOv3 GitHub repository. The script is configured for
dinov3_vitb16_pretrain.pth. - Save Location: Place the downloaded
.pthfile in a known location. You will provide the path to this file in the script's configuration.
- Unlabeled Images: Gather all the images you want to use for pre-training. Place them in a single folder. The script will recursively scan this folder for images. This can be thousands or millions of images.
- Example:
C:\data\unlabeled_images\
- Example:
- YOLOv11 YAML File: You need a YOLOv11-Pose model definition file (e.g.,
IntegraPose11x-pose.yaml). This file must be placed in the same folder as the Python script. - (Optional) Labeled Dataset: If you plan to use the automatic fine-tuning feature (
DO_FINETUNE = True), prepare your labeled dataset in the Ultralytics format and have thedataset.yamlfile ready.
Open the pre-train_distill_yolo11.py file and edit the variables in the CONFIG section at the top. This is the most important step.
# Unlabeled images root (Lightly scans recursively)
UNLABELED_DATA_DIR = r"C:\path\to\your\unlabeled_images"
# Your YAML is in the SAME FOLDER as this script. Put its filename here:
YOLO_MODEL = r"IntegraPose11x-pose.yaml"
# Lightly run output directory
OUT_DIR = r"C:\path\to\save\training\output"
# DINOv3 teacher weights you downloaded
TEACHER_WEIGHTS = r"C:\path\to\your\dinov3_vitb16_pretrain.pth"
# Training knobs (start low, increase if your GPU can handle it)
EPOCHS = 50
BATCH_SIZE = 2
PRECISION = "16-mixed"
# --- Optional Steps ---
# Set to True to run fine-tuning after pre-training
DO_FINETUNE = False
ULTRALYTICS_DATA_YAML = r"C:\path\to\your\labeled_dataset.yaml"
# Set to True to extract embeddings after pre-training
DO_EMBED = False
Once configured, execute the script from your terminal.
python pre-train_distill_yolo11.py
The script will log its progress, including the configuration, model setup, and training epochs.
All artifacts from the training run will be saved in the directory you specified in OUT_DIR.
C:\path\to\save\training\output\
βββ checkpoints\
β βββ last.ckpt # Checkpoint for resuming
β βββ epoch=X-step=Y.ckpt
βββ exported_models\
β βββ exported_last.pt # <<< YOUR FINAL, USABLE MODEL FOR ULTRALYTICS FINE-TUNING
βββ ... (other Lightly log files)
The most important file is exported_last.pt. This is your pre-trained YOLOv11 model, ready to be used for fine-tuning or inference with Ultralytics.
CONFIGBlock: All user-configurable parameters are centralized at the top of the script for easy access._get_model_for_lightly(): This function resolves theYOLO_MODELvariable. It smartly handles local YAML files and includes a fallback to an official alias if the local file fails to build._autopatch_yaml(): A crucial utility that prevents common training errors by creating a patched copy of your YAML on the fly. It disables shortcuts (residuals) that might have mismatched channel counts due to model scaling and ensures ascaleparameter is present._make_resume_kwargs(): Implements the logic for theRESUME_POLICY, ensuring that the training can be resumed correctly without conflicting arguments.pretrain_distill(): The main function that sets up and launches thelightly_train.trainprocess. It configures the dataloader, optimizer, and distillation method arguments. After training, it ensures the model is exported to the Ultralytics.ptformat.finetune_ultralytics()/embed_from_checkpoint(): Wrapper functions that handle the optional post-training steps.
This project stands on the shoulders of giants. Special thanks to:
- Meta AI for developing and open-sourcing the powerful DINOv3 models.
- Lightly AI for their excellent
lightly-trainlibrary, which makes complex self-supervised learning and distillation techniques accessible. - Ultralytics for the versatile and easy-to-use YOLO framework.