Skip to content

Feature/daily 2026 02 18 rl trainer#138

Open
Capri2014 wants to merge 4 commits into
mainfrom
feature/daily-2026-02-18-rl-trainer
Open

Feature/daily 2026 02 18 rl trainer#138
Capri2014 wants to merge 4 commits into
mainfrom
feature/daily-2026-02-18-rl-trainer

Conversation

@Capri2014
Copy link
Copy Markdown
Owner

Pull Request Template

Summary

Brief description of what changed (1-2 sentences).

Changes

  • Code changes
  • Docs changes
  • New files added

Testing

  • Tests pass (if applicable)
  • Manual verification steps
  • Verified no merge conflicts with main

Checklist

  • Based on latest main branch
  • No merge conflicts
  • Commit messages follow convention
  • Documentation updated (if applicable)
  • Related issue linked (if applicable)

Related PRs/Issues

Link to related PRs or issues.


Note: This repository uses squash merging. All commits will be collapsed into one.

- Add train_ppo_delta_waypoint.py: Full PPO training for residual delta-head
  - DeltaHead and ValueHead architectures
  - GAE (Generalized Advantage Estimation) implementation
  - PPO update with clipping, value loss, entropy bonus
  - Support for toy and CARLA environments
  - Configurable hyperparameters via argparse

- Add test_ppo_delta_smoke.py: Smoke tests for validation
  - Unit tests for DeltaHead, ValueHead, GAE
  - Toy environment testing
  - Policy forward pass testing
  - Minimal training loop integration test

- Update training/rl/README.md: Documentation
  - Architecture overview
  - Usage examples
  - Key arguments reference
  - Output structure
  - Comparison workflow for SFT vs RL

Architecture: final_waypoints = sft_waypoints + delta_head(z)
- Frozen SFT encoder (safer, stable)
- Trainable delta head (sample-efficient)
- Residual correction for online improvement
- Add Pipeline PR #3 summary
- Update pipeline status table
- Mark all stages as implemented
- Add eval_toy_waypoint_env.py for policy evaluation
- Compute ADE/FDE with confidence intervals (95% CI)
- Two-sample t-test for statistical significance (p-values)
- Side-by-side SFT vs RL comparison report
- Configurable episode count (default: 100 for statistical power)

Usage:
  python -m training.rl.eval_toy_waypoint_env --compare \
    --sft-checkpoint out/sft_waypoint_bc_torch_v0/model.pt \
    --rl-checkpoint out/rl_delta_ppo_v0/final.pt --episodes 100

Output:
  ADE: 5.27m ± 0.12m (SFT) → 5.19m (RL) [-2%]*
  FDE: 5.83m (SFT) → 5.66m (RL) [-3%]*
  * p < 0.05 (statistically significant)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant