Skip to content

feat(rl): Implement PPO delta-waypoint training for RL refinement#131

Open
Capri2014 wants to merge 6 commits into
mainfrom
feature/vadv2-digest-survey
Open

feat(rl): Implement PPO delta-waypoint training for RL refinement#131
Capri2014 wants to merge 6 commits into
mainfrom
feature/vadv2-digest-survey

Conversation

@Capri2014
Copy link
Copy Markdown
Owner

Pull Request Template

Summary

Brief description of what changed (1-2 sentences).

Changes

  • Code changes
  • Docs changes
  • New files added

Testing

  • Tests pass (if applicable)
  • Manual verification steps
  • Verified no merge conflicts with main

Checklist

  • Based on latest main branch
  • No merge conflicts
  • Commit messages follow convention
  • Documentation updated (if applicable)
  • Related issue linked (if applicable)

Related PRs/Issues

Link to related PRs or issues.


Note: This repository uses squash merging. All commits will be collapsed into one.

Capri2014 and others added 6 commits February 18, 2026 13:34
- Add train_ppo_delta_waypoint.py: Full PPO training for residual delta-head
  - DeltaHead and ValueHead architectures
  - GAE (Generalized Advantage Estimation) implementation
  - PPO update with clipping, value loss, entropy bonus
  - Support for toy and CARLA environments
  - Configurable hyperparameters via argparse

- Add test_ppo_delta_smoke.py: Smoke tests for validation
  - Unit tests for DeltaHead, ValueHead, GAE
  - Toy environment testing
  - Policy forward pass testing
  - Minimal training loop integration test

- Update training/rl/README.md: Documentation
  - Architecture overview
  - Usage examples
  - Key arguments reference
  - Output structure
  - Comparison workflow for SFT vs RL

Architecture: final_waypoints = sft_waypoints + delta_head(z)
- Frozen SFT encoder (safer, stable)
- Trainable delta head (sample-efficient)
- Residual correction for online improvement
- Add Pipeline PR #3 summary
- Update pipeline status table
- Mark all stages as implemented
- Add eval_toy_waypoint_env.py for policy evaluation
- Compute ADE/FDE with confidence intervals (95% CI)
- Two-sample t-test for statistical significance (p-values)
- Side-by-side SFT vs RL comparison report
- Configurable episode count (default: 100 for statistical power)

Usage:
  python -m training.rl.eval_toy_waypoint_env --compare \
    --sft-checkpoint out/sft_waypoint_bc_torch_v0/model.pt \
    --rl-checkpoint out/rl_delta_ppo_v0/final.pt --episodes 100

Output:
  ADE: 5.27m ± 0.12m (SFT) → 5.19m (RL) [-2%]*
  FDE: 5.83m (SFT) → 5.66m (RL) [-3%]*
  * p < 0.05 (statistically significant)
- Survey digest for VADv2 (ICLR 2026), a modern VLM-augmented end-to-end
  autonomous driving stack newer than UniAD.
- Covers system decomposition, inputs/outputs, training objectives,
  evaluation protocol, Tesla/Ashok claims mapping, and AIResearch recommendations.
- Includes citations, code links, and 3-bullet summary.

Ref: cron:Survey PR #3 (4:00pm PT)
- Added WaymoEpisodeLoader class supporting stub, synthetic, and Waymo formats
- Data classes: Pose, Waypoint, CameraFrame, WaymoRoute, WaymoEpisode
- to_ssl_dataset(): Convert episodes to SSL pretraining format
- get_statistics(): Dataset statistics (locations, weathers)
- CLI for listing and loading episodes

Part of driving-first pipeline: Waymo episodes → SSL pretrain → waypoint BC → RL → CARLA
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant