Add Golden Gate Claude interactive tutorial and CLI demo#7
Conversation
Adds an interactive tutorial that teaches SAE feature steering by recreating Anthropic's famous Golden Gate Claude experiment at miniature scale using nanochat-SAE. New files: - golden_gate_demo.py: Standalone CLI walkthrough (~2min on CPU) - golden_gate_walkthrough.ipynb: Rich Colab notebook with visualizations (probability distributions, heatmaps, steering landscape plots) Both walk through the full pipeline: build model -> collect activations -> train SAE -> find features -> steer behavior, with detailed educational explanations at every step. README updated with new section and tutorial links. https://claude.ai/code/session_01RuUhm1SYvQYE61feFRAc3e
There was a problem hiding this comment.
Code Review
This pull request introduces a 'Golden Gate Claude' tutorial to the repository, featuring a new interactive Jupyter notebook (golden_gate_walkthrough.ipynb) and a CLI demo script (golden_gate_demo.py). These resources guide users through building a miniature transformer, training a Sparse Autoencoder (SAE), and performing feature steering to manipulate model behavior. Feedback on the demo script highlights a missing comparison in the feature landscape visualization step and suggests simplifying redundant tensor slicing for better readability.
| # We need to manually steer and track features | ||
| steering_config = {hook_point: (golden_feature, 10.0)} | ||
| with interp_model.steering_enabled(steering_config): | ||
| interp_model.model(test_tokens) |
There was a problem hiding this comment.
The demo script for Step 6 claims to "compare" unsteered vs steered feature landscapes, but it currently only prints the unsteered landscape (lines 382-390). Line 401 runs the model with steering enabled, but the resulting feature activations are never retrieved or printed. To make this comparison meaningful for the user, you should enable interpretation alongside steering, capture the active features, and print them similarly to the unsteered section.
golden_gate_demo.py
Outdated
|
|
||
| steered_probs = F.softmax(steered_logits[:, -1, :], dim=-1) | ||
| diff = (steered_logits - baseline_logits).abs() | ||
| top_tok = steered_probs[:, :].argmax(dim=-1).item() |
There was a problem hiding this comment.
The slicing [:, :] on steered_probs is redundant as the tensor is already 2D with shape (1, vocab_size). You can simplify this to steered_probs.argmax(dim=-1).item() or even steered_probs.argmax().item() for better readability, matching the style used later in the script (e.g., line 447).
| top_tok = steered_probs[:, :].argmax(dim=-1).item() | |
| top_tok = steered_probs.argmax().item() |
Address code review feedback: - Step 6 now captures and prints the steered feature landscape alongside the unsteered one, making the comparison meaningful for readers - Simplify redundant `[:, :]` slice to `.argmax().item()` https://claude.ai/code/session_01RuUhm1SYvQYE61feFRAc3e
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a "Golden Gate Claude" tutorial, featuring a new CLI demo and a Jupyter notebook. These resources demonstrate how to use Sparse Autoencoders (SAEs) to identify and steer internal model features, recreating Anthropic's famous interpretability experiment at a smaller scale. The review feedback suggests optimizing the activation collection process by terminating the loop once the maximum activation limit is reached and ensuring robust hook management through the use of try...finally blocks to prevent persistent hooks in the event of an error.
| with torch.no_grad(), collector: | ||
| for i in range(30): | ||
| # Feed random "text" through the model | ||
| tokens = torch.randint(0, config.vocab_size, (8, config.sequence_len)) | ||
| model(tokens) |
There was a problem hiding this comment.
The activation collection loop continues for all 30 iterations even after the ActivationCollector has reached its max_activations limit (15,000). Since each batch collects 1,024 activations (8 samples * 128 tokens), the collector will be full after approximately 15 batches. Breaking the loop early will save unnecessary computation.
| with torch.no_grad(), collector: | |
| for i in range(30): | |
| # Feed random "text" through the model | |
| tokens = torch.randint(0, config.vocab_size, (8, config.sequence_len)) | |
| model(tokens) | |
| with torch.no_grad(), collector: | |
| for i in range(30): | |
| if collector.counts[hook_point] >= collector.max_activations: | |
| break | |
| # Feed random "text" through the model | |
| tokens = torch.randint(0, config.vocab_size, (8, config.sequence_len)) | |
| model(tokens) |
| handle = module.register_forward_hook(capture_steered_features) | ||
| interp_model.model(test_tokens) | ||
| handle.remove() |
There was a problem hiding this comment.
The manual hook registration and removal should be wrapped in a try...finally block. This ensures that the hook is properly removed even if an exception occurs during the model forward pass, preventing the hook from persisting and affecting subsequent model executions.
| handle = module.register_forward_hook(capture_steered_features) | |
| interp_model.model(test_tokens) | |
| handle.remove() | |
| handle = module.register_forward_hook(capture_steered_features) | |
| try: | |
| interp_model.model(test_tokens) | |
| finally: | |
| handle.remove() |
Summary
This PR adds two comprehensive educational resources that teach users how to recreate Anthropic's famous "Golden Gate Claude" experiment using nanochat-SAE: an interactive Jupyter notebook for Google Colab and a standalone CLI walkthrough script.
Key Changes
golden_gate_walkthrough.ipynb- Interactive Jupyter notebook (202 cells) that guides users through:golden_gate_demo.py- Standalone Python script (535 lines) providing:README.md- Updated documentation:Implementation Details
Both resources follow the same pedagogical structure:
The notebook includes matplotlib visualizations of training loss, feature statistics, and probability distributions. The CLI script uses formatted text output with progress bars and ASCII art for engagement.
Both implementations use the existing nanochat and SAE infrastructure (GPT, TopKSAE, ActivationCollector, SAETrainer, SAEEvaluator, FeatureVisualizer, InterpretableModel) without requiring modifications to core libraries.
https://claude.ai/code/session_01RuUhm1SYvQYE61feFRAc3e