Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 45 additions & 31 deletions clawbot/STATUS.md
Original file line number Diff line number Diff line change
@@ -1,55 +1,69 @@
# Status (ClawBot)

_Last updated: 2026-02-16 (Pipeline PR #4)_
_Last updated: 2026-02-18 (Pipeline PR #1)_

## Current focus
Driving-first pipeline: **Waymo episodes → PyTorch SSL pretrain → waypoint BC → RL refinement → CARLA ScenarioRunner eval**.

## Recent changes
## Daily Cadence

- ✅ **Pipeline PR #1** (2026-02-18): RL Checkpoint Selection with Policy Entropy
- ⏳ **Pipeline PR #9** (2026-02-17): Evaluation + Metrics Hardening for RL Refinement - awaiting review
- ⏳ **Pipeline PR #8** (2026-02-17): CARLA Closed-Loop Waypoint BC Evaluation - awaiting review
- ⏳ **Pipeline PR #5** (2026-02-16): RL Refinement Stub for Residual Delta-Waypoint Learning - awaiting review

### Pipeline PR #4: CARLA ScenarioRunner Integration (Today, 1:30pm PT)
- **New: `training/eval/carla_scenariorunner_eval.py`**
- `CARLAScenarioRunner` class: Vehicle control interface for CARLA simulation
- `EvalResult` dataclass: Metrics (route completion, collisions, offroad, deviation)
- `evaluate_waypoint_policy()`: Closed-loop policy evaluation function
- `CARLAEvalConfig`: Configuration for host, port, fps, weather, map
- Connects waypoint BC models to CARLA for end-to-end evaluation
## Recent changes

- **New: `training/eval/run_carla_smoke.py`**
- Module validation smoke tests
### Pipeline PR #1: RL Checkpoint Selection with Policy Entropy (Today, 5:30am PT)
- **Updated: `training/rl/train_rl_delta_waypoint.py`**
- Added `policy_entropy` field to evaluation metrics
- Best checkpoint selection: saves `best_entropy.pt` when entropy improves
- Entropy history tracking: `entropy_history.json` with episode-wise records
- Enhanced training summary with `best_checkpoint` section
- Higher entropy = more exploration = better for RL generalization

### Pipeline PR #3: Waypoint BC with Evaluation Metrics (Today, 10:30am PT) - merged
- `training/sft/train_waypoint_bc_with_metrics.py`: Full trainer with ADE/FDE
- `run_waypoint_bc_smoke.py`: Smoke tests
- Architecture: `final_waypoints = sft_waypoints + delta_head(z)`
- Evaluation-first: metrics computed every epoch for checkpoint selection
**Key additions:**
- `_save_best_checkpoint()`: Saves checkpoint when entropy reaches new best
- `_save_entropy_history()`: Records entropy per eval interval
- Updated `compute_metrics()` to include entropy
- Updated `_save_train_summary()` with best checkpoint metadata

### Pipeline PR #2: Training-Time Metrics (Yesterday)
- `training/sft/training_metrics.py`: ADE/FDE computation, checkpoint tracking
### Pipeline PR #9: Evaluation + Metrics Hardening for RL Refinement (Yesterday)
- `training/rl/eval_toy_waypoint_env.py`: Deterministic evaluation with ADE/FDE
- ADE/FDE computation per episode for measuring RL refinement quality
- Summary metrics with mean/std, success_rate
- 3-line comparison report (ADE, FDE, Success Rate)

### Pipeline PR #1: Unified Policy Evaluation Framework (2026-02-16) - merged
- `training/rl/unified_eval.py`: SFT vs PPO vs GRPO comparison
### Pipeline PR #8: CARLA Closed-Loop Waypoint BC Evaluation (Yesterday)
- `training/eval/run_carla_closed_loop_eval.py`: Comprehensive closed-loop evaluation
- 5 scenarios: straight_clear, straight_cloudy, straight_night, straight_rain, turn_clear
- WaypointBCModelWrapper for checkpoint loading

## Next (top 3)
1. Integrate CARLA evaluation with unified_eval.py
2. Add checkpoint selection by best FDE
3. Run full training on Waymo episode data
1. Run training with new entropy tracking
2. Compare entropy curves across different seeds
3. Integrate entropy-based checkpointing with CARLA evaluation

## Blockers / questions for owner
- Confirm CARLA server availability for integration testing
- PR reviews pending for #9, #8, #5

## Architecture Reference

**Driving-First Pipeline:**
```
Waymo episodes → SSL pretrain → waypoint BC → CARLA eval
Waymo episodes → SSL pretrain → waypoint BC → RL refinement → CARLA eval
```

**Residual Delta Learning:**
```
final_waypoints = sft_waypoints + delta_head(z)
```

**Evaluation-First Design:**
- Add ADE/FDE metrics **during training**, not after
- Enables checkpoint selection based on quality metrics
- Critical for autonomous driving where precision matters
**Checkpoint Selection:**
- Reward-based: best_reward.pt
- Entropy-based: best_entropy.pt (NEW)
- Metrics: ADE/FDE, route_completion, collisions

## Links
- Daily notes: `clawbot/daily/2026-02-16.md`
- PR: https://github.com/Capri2014/AIResearch/pull/new/feature/daily-2026-02-16-d
- Daily notes: `clawbot/daily/2026-02-18.md`
- Branch: `feature/daily-2026-02-18-a`
101 changes: 101 additions & 0 deletions clawbot/daily/2026-02-18.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# 2026-02-18 Daily Notes

## Pipeline PR #1 (Daily Cadence)

**Focus:** RL Checkpoint Selection with Policy Entropy

### Changes

- **Updated:** `training/rl/train_rl_delta_waypoint.py` - Checkpoint selection with policy entropy metrics

### Key Additions

1. **Policy Entropy Tracking**
- Added `policy_entropy` field to evaluation metrics
- Tracks entropy per episode for monitoring policy exploration
- Stored in `entropy_history.json` with episode-wise records

2. **Best Checkpoint Selection**
- Added `_save_best_checkpoint()` method for entropy-based checkpointing
- Higher entropy = more exploration = better for RL
- Saves `best_entropy.pt` when new best entropy is found
- Includes metadata: episode, entropy, config

3. **Entropy History Tracking**
- Added `entropy_history` list and `_save_entropy_history()` method
- Records entropy at each eval interval
- Best entropy and episode saved for easy retrieval

4. **Training Summary Enhancement**
- Added `best_checkpoint` section to `train_metrics.json`
- Includes path, episode, and entropy value

### Metrics Schema

```python
# New eval_info structure
eval_info = {
"mean_delta_norm": float, # Mean delta norm
"max_delta_norm": float, # Max delta norm
"std_delta_norm": float, # Std delta norm
"policy_entropy": float, # NEW: Policy entropy
}

# New entropy_history.json structure
{
"episodes": [1, 10, 20, ...],
"entropy": [0.5, 0.6, 0.7, ...],
"best_entropy": 0.9,
"best_episode": 150,
}

# New best_checkpoint in train_metrics.json
"best_checkpoint": {
"path": "out/.../best_entropy.pt",
"episode": 150,
"entropy": 0.9,
}
```

### Usage

```bash
# Training automatically tracks entropy and saves best checkpoint
python -m training.rl.train_rl_delta_waypoint \
--out-dir out/rl_delta_waypoint_v0/run_001 \
--episodes 500

# After training, best checkpoint is saved at:
# out/rl_delta_waypoint_v0/run_001/best_entropy.pt

# Entropy history for analysis:
# out/rl_delta_waypoint_v0/run_001/entropy_history.json
```

### Why Entropy Matters

- **Higher entropy** = more diverse action distribution = policy explores more
- **Lower entropy** = policy becomes deterministic (may overfit)
- Entropy-based checkpoint selection helps find well-regularized policies
- Complements reward-based selection with exploration quality signal

### Next Steps

- [ ] Run training with new entropy tracking
- [ ] Compare entropy curves across different seeds
- [ ] Add entropy-based early stopping (stop if entropy drops too low)
- [ ] Integrate with CARLA evaluation for closed-loop validation

---

## Pipeline Context

Driving-first pipeline:
```
Waymo episodes → SSL pretrain → waypoint BC (SFT) → RL refinement → eval (ADE/FDE/entropy)
```

Today's contribution:
- RL training now has **best checkpoint selection** based on policy entropy
- Enables automated model selection for deployment
- Provides exploration quality signal alongside reward
Loading