Capri2014 · Capri2014 · Feb 18, 2026 · Feb 18, 2026
diff --git a/clawbot/STATUS.md b/clawbot/STATUS.md
@@ -1,55 +1,69 @@
 # Status (ClawBot)
 
-_Last updated: 2026-02-16 (Pipeline PR #4)_
+_Last updated: 2026-02-18 (Pipeline PR #1)_
 
 ## Current focus
 Driving-first pipeline: **Waymo episodes → PyTorch SSL pretrain → waypoint BC → RL refinement → CARLA ScenarioRunner eval**.
 
-## Recent changes
+## Daily Cadence
+
+- ✅ **Pipeline PR #1** (2026-02-18): RL Checkpoint Selection with Policy Entropy
+- ⏳ **Pipeline PR #9** (2026-02-17): Evaluation + Metrics Hardening for RL Refinement - awaiting review
+- ⏳ **Pipeline PR #8** (2026-02-17): CARLA Closed-Loop Waypoint BC Evaluation - awaiting review
+- ⏳ **Pipeline PR #5** (2026-02-16): RL Refinement Stub for Residual Delta-Waypoint Learning - awaiting review
 
-### Pipeline PR #4: CARLA ScenarioRunner Integration (Today, 1:30pm PT)
-- **New: `training/eval/carla_scenariorunner_eval.py`**
-  - `CARLAScenarioRunner` class: Vehicle control interface for CARLA simulation
-  - `EvalResult` dataclass: Metrics (route completion, collisions, offroad, deviation)
-  - `evaluate_waypoint_policy()`: Closed-loop policy evaluation function
-  - `CARLAEvalConfig`: Configuration for host, port, fps, weather, map
-  - Connects waypoint BC models to CARLA for end-to-end evaluation
+## Recent changes
 
-- **New: `training/eval/run_carla_smoke.py`**
-  - Module validation smoke tests
+### Pipeline PR #1: RL Checkpoint Selection with Policy Entropy (Today, 5:30am PT)
+- **Updated: `training/rl/train_rl_delta_waypoint.py`**
+  - Added `policy_entropy` field to evaluation metrics
+  - Best checkpoint selection: saves `best_entropy.pt` when entropy improves
+  - Entropy history tracking: `entropy_history.json` with episode-wise records
+  - Enhanced training summary with `best_checkpoint` section
+  - Higher entropy = more exploration = better for RL generalization
 
-### Pipeline PR #3: Waypoint BC with Evaluation Metrics (Today, 10:30am PT) - merged
-- `training/sft/train_waypoint_bc_with_metrics.py`: Full trainer with ADE/FDE
-- `run_waypoint_bc_smoke.py`: Smoke tests
-- Architecture: `final_waypoints = sft_waypoints + delta_head(z)`
-- Evaluation-first: metrics computed every epoch for checkpoint selection
+**Key additions:**
+- `_save_best_checkpoint()`: Saves checkpoint when entropy reaches new best
+- `_save_entropy_history()`: Records entropy per eval interval
+- Updated `compute_metrics()` to include entropy
+- Updated `_save_train_summary()` with best checkpoint metadata
 
-### Pipeline PR #2: Training-Time Metrics (Yesterday)
-- `training/sft/training_metrics.py`: ADE/FDE computation, checkpoint tracking
+### Pipeline PR #9: Evaluation + Metrics Hardening for RL Refinement (Yesterday)
+- `training/rl/eval_toy_waypoint_env.py`: Deterministic evaluation with ADE/FDE
+- ADE/FDE computation per episode for measuring RL refinement quality
+- Summary metrics with mean/std, success_rate
+- 3-line comparison report (ADE, FDE, Success Rate)
 
-### Pipeline PR #1: Unified Policy Evaluation Framework (2026-02-16) - merged
-- `training/rl/unified_eval.py`: SFT vs PPO vs GRPO comparison
+### Pipeline PR #8: CARLA Closed-Loop Waypoint BC Evaluation (Yesterday)
+- `training/eval/run_carla_closed_loop_eval.py`: Comprehensive closed-loop evaluation
+- 5 scenarios: straight_clear, straight_cloudy, straight_night, straight_rain, turn_clear
+- WaypointBCModelWrapper for checkpoint loading
 
 ## Next (top 3)
-1. Integrate CARLA evaluation with unified_eval.py
-2. Add checkpoint selection by best FDE
-3. Run full training on Waymo episode data
+1. Run training with new entropy tracking
+2. Compare entropy curves across different seeds
+3. Integrate entropy-based checkpointing with CARLA evaluation
 
 ## Blockers / questions for owner
-- Confirm CARLA server availability for integration testing
+- PR reviews pending for #9, #8, #5
 
 ## Architecture Reference
 
 **Driving-First Pipeline:**
 ```
-Waymo episodes → SSL pretrain → waypoint BC → CARLA eval
+Waymo episodes → SSL pretrain → waypoint BC → RL refinement → CARLA eval
+```
+
+**Residual Delta Learning:**
+```
+final_waypoints = sft_waypoints + delta_head(z)
 ```
 
-**Evaluation-First Design:**
-- Add ADE/FDE metrics **during training**, not after
-- Enables checkpoint selection based on quality metrics
-- Critical for autonomous driving where precision matters
+**Checkpoint Selection:**
+- Reward-based: best_reward.pt
+- Entropy-based: best_entropy.pt (NEW)
+- Metrics: ADE/FDE, route_completion, collisions
 
 ## Links
-- Daily notes: `clawbot/daily/2026-02-16.md`
-- PR: https://github.com/Capri2014/AIResearch/pull/new/feature/daily-2026-02-16-d
+- Daily notes: `clawbot/daily/2026-02-18.md`
+- Branch: `feature/daily-2026-02-18-a`
diff --git a/clawbot/daily/2026-02-18.md b/clawbot/daily/2026-02-18.md
@@ -0,0 +1,101 @@
+# 2026-02-18 Daily Notes
+
+## Pipeline PR #1 (Daily Cadence)
+
+**Focus:** RL Checkpoint Selection with Policy Entropy
+
+### Changes
+
+- **Updated:** `training/rl/train_rl_delta_waypoint.py` - Checkpoint selection with policy entropy metrics
+
+### Key Additions
+
+1. **Policy Entropy Tracking**
+   - Added `policy_entropy` field to evaluation metrics
+   - Tracks entropy per episode for monitoring policy exploration
+   - Stored in `entropy_history.json` with episode-wise records
+
+2. **Best Checkpoint Selection**
+   - Added `_save_best_checkpoint()` method for entropy-based checkpointing
+   - Higher entropy = more exploration = better for RL
+   - Saves `best_entropy.pt` when new best entropy is found
+   - Includes metadata: episode, entropy, config
+
+3. **Entropy History Tracking**
+   - Added `entropy_history` list and `_save_entropy_history()` method
+   - Records entropy at each eval interval
+   - Best entropy and episode saved for easy retrieval
+
+4. **Training Summary Enhancement**
+   - Added `best_checkpoint` section to `train_metrics.json`
+   - Includes path, episode, and entropy value
+
+### Metrics Schema
+
+```python
+# New eval_info structure
+eval_info = {
+    "mean_delta_norm": float,      # Mean delta norm
+    "max_delta_norm": float,       # Max delta norm
+    "std_delta_norm": float,       # Std delta norm
+    "policy_entropy": float,       # NEW: Policy entropy
+}
+
+# New entropy_history.json structure
+{
+    "episodes": [1, 10, 20, ...],
+    "entropy": [0.5, 0.6, 0.7, ...],
+    "best_entropy": 0.9,
+    "best_episode": 150,
+}
+
+# New best_checkpoint in train_metrics.json
+"best_checkpoint": {
+    "path": "out/.../best_entropy.pt",
+    "episode": 150,
+    "entropy": 0.9,
+}
+```
+
+### Usage
+
+```bash
+# Training automatically tracks entropy and saves best checkpoint
+python -m training.rl.train_rl_delta_waypoint \
+  --out-dir out/rl_delta_waypoint_v0/run_001 \
+  --episodes 500
+
+# After training, best checkpoint is saved at:
+# out/rl_delta_waypoint_v0/run_001/best_entropy.pt
+
+# Entropy history for analysis:
+# out/rl_delta_waypoint_v0/run_001/entropy_history.json
+```
+
+### Why Entropy Matters
+
+- **Higher entropy** = more diverse action distribution = policy explores more
+- **Lower entropy** = policy becomes deterministic (may overfit)
+- Entropy-based checkpoint selection helps find well-regularized policies
+- Complements reward-based selection with exploration quality signal
+
+### Next Steps
+
+- [ ] Run training with new entropy tracking
+- [ ] Compare entropy curves across different seeds
+- [ ] Add entropy-based early stopping (stop if entropy drops too low)
+- [ ] Integrate with CARLA evaluation for closed-loop validation
+
+---
+
+## Pipeline Context
+
+Driving-first pipeline:
+```
+Waymo episodes → SSL pretrain → waypoint BC (SFT) → RL refinement → eval (ADE/FDE/entropy)
+```
+
+Today's contribution:
+- RL training now has **best checkpoint selection** based on policy entropy
+- Enables automated model selection for deployment
+- Provides exploration quality signal alongside reward