feat(rl): Implement PPO delta-waypoint training for RL refinement by Capri2014 · Pull Request #131 · Capri2014/AIResearch

Capri2014 · 2026-02-19T01:03:04Z

Pull Request Template

Summary

Brief description of what changed (1-2 sentences).

Changes

Code changes
Docs changes
New files added

Testing

Tests pass (if applicable)
Manual verification steps
Verified no merge conflicts with main

Checklist

Based on latest main branch
No merge conflicts
Commit messages follow convention
Documentation updated (if applicable)
Related issue linked (if applicable)

Related PRs/Issues

Link to related PRs or issues.

Note: This repository uses squash merging. All commits will be collapsed into one.

- Add train_ppo_delta_waypoint.py: Full PPO training for residual delta-head - DeltaHead and ValueHead architectures - GAE (Generalized Advantage Estimation) implementation - PPO update with clipping, value loss, entropy bonus - Support for toy and CARLA environments - Configurable hyperparameters via argparse - Add test_ppo_delta_smoke.py: Smoke tests for validation - Unit tests for DeltaHead, ValueHead, GAE - Toy environment testing - Policy forward pass testing - Minimal training loop integration test - Update training/rl/README.md: Documentation - Architecture overview - Usage examples - Key arguments reference - Output structure - Comparison workflow for SFT vs RL Architecture: final_waypoints = sft_waypoints + delta_head(z) - Frozen SFT encoder (safer, stable) - Trainable delta head (sample-efficient) - Residual correction for online improvement

- Add Pipeline PR #3 summary - Update pipeline status table - Mark all stages as implemented

- Add eval_toy_waypoint_env.py for policy evaluation - Compute ADE/FDE with confidence intervals (95% CI) - Two-sample t-test for statistical significance (p-values) - Side-by-side SFT vs RL comparison report - Configurable episode count (default: 100 for statistical power) Usage: python -m training.rl.eval_toy_waypoint_env --compare \ --sft-checkpoint out/sft_waypoint_bc_torch_v0/model.pt \ --rl-checkpoint out/rl_delta_ppo_v0/final.pt --episodes 100 Output: ADE: 5.27m ± 0.12m (SFT) → 5.19m (RL) [-2%]* FDE: 5.83m (SFT) → 5.66m (RL) [-3%]* * p < 0.05 (statistically significant)

- Survey digest for VADv2 (ICLR 2026), a modern VLM-augmented end-to-end autonomous driving stack newer than UniAD. - Covers system decomposition, inputs/outputs, training objectives, evaluation protocol, Tesla/Ashok claims mapping, and AIResearch recommendations. - Includes citations, code links, and 3-bullet summary. Ref: cron:Survey PR #3 (4:00pm PT)

- Added WaymoEpisodeLoader class supporting stub, synthetic, and Waymo formats - Data classes: Pose, Waypoint, CameraFrame, WaymoRoute, WaymoEpisode - to_ssl_dataset(): Convert episodes to SSL pretraining format - get_statistics(): Dataset statistics (locations, weathers) - CLI for listing and loading episodes Part of driving-first pipeline: Waymo episodes → SSL pretrain → waypoint BC → RL → CARLA

Capri2014 and others added 6 commits February 18, 2026 13:34

docs(clawbot): Update status for 2026-02-18

f772a9c

- Add Pipeline PR #3 summary - Update pipeline status table - Mark all stages as implemented

docs: Add PR body for RL evaluation with statistical significance

42091ee

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(rl): Implement PPO delta-waypoint training for RL refinement#131

feat(rl): Implement PPO delta-waypoint training for RL refinement#131
Capri2014 wants to merge 6 commits into
mainfrom
feature/vadv2-digest-survey

Capri2014 commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Capri2014 commented Feb 19, 2026

Pull Request Template

Summary

Changes

Testing

Checklist

Related PRs/Issues

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant