Skip to content

Add Golden Gate Claude interactive tutorial and CLI demo#7

Open
SolshineCode wants to merge 2 commits intoclaude/nanochat-sae-interpretability-011CUT2TocZpFerXthoW9LMffrom
claude/nanochat-educational-walkthrough-Le6mP
Open

Add Golden Gate Claude interactive tutorial and CLI demo#7
SolshineCode wants to merge 2 commits intoclaude/nanochat-sae-interpretability-011CUT2TocZpFerXthoW9LMffrom
claude/nanochat-educational-walkthrough-Le6mP

Conversation

@SolshineCode
Copy link
Copy Markdown
Owner

Summary

This PR adds two comprehensive educational resources that teach users how to recreate Anthropic's famous "Golden Gate Claude" experiment using nanochat-SAE: an interactive Jupyter notebook for Google Colab and a standalone CLI walkthrough script.

Key Changes

  • golden_gate_walkthrough.ipynb - Interactive Jupyter notebook (202 cells) that guides users through:

    • Building a miniature transformer model
    • Collecting and visualizing internal activations
    • Training a Sparse Autoencoder to decompose thoughts into interpretable features
    • Identifying and steering with individual features
    • Creating interactive visualizations and dashboards
    • Includes setup instructions for Google Colab with automatic dependency installation
  • golden_gate_demo.py - Standalone Python script (535 lines) providing:

    • Step-by-step terminal walkthrough with detailed educational explanations
    • ASCII art visualizations (Golden Gate Bridge banner)
    • Pretty-printed progress tracking and results
    • Feature steering demonstrations at various amplification strengths
    • Comparison of steered vs. unsteered model outputs
    • Feature landscape visualization and analysis
    • Negative steering (feature suppression) examples
    • HTML dashboard generation for feature exploration
    • Runs on CPU in ~2 minutes with no GPU required
  • README.md - Updated documentation:

    • Added "Build Your Own Golden Gate Claude!" section with Colab badge
    • Added comparison table explaining key SAE concepts
    • Updated file structure to reference new tutorials
    • Reorganized tutorials section to highlight new resources

Implementation Details

Both resources follow the same pedagogical structure:

  1. Build a small GPT model (8 layers, 128-dim hidden state, ~1.3M parameters)
  2. Collect 15,000 activation vectors from layer 4 (middle of the model)
  3. Train a TopK-SAE with 1,024 features (8x expansion) and k=16 sparsity
  4. Identify the most active feature as the "Golden Gate" feature
  5. Demonstrate feature steering by amplifying the feature at strengths 1.0→50.0
  6. Show how steering changes output probability distributions
  7. Compare steering with different features and negative steering

The notebook includes matplotlib visualizations of training loss, feature statistics, and probability distributions. The CLI script uses formatted text output with progress bars and ASCII art for engagement.

Both implementations use the existing nanochat and SAE infrastructure (GPT, TopKSAE, ActivationCollector, SAETrainer, SAEEvaluator, FeatureVisualizer, InterpretableModel) without requiring modifications to core libraries.

https://claude.ai/code/session_01RuUhm1SYvQYE61feFRAc3e

Adds an interactive tutorial that teaches SAE feature steering by
recreating Anthropic's famous Golden Gate Claude experiment at
miniature scale using nanochat-SAE.

New files:
- golden_gate_demo.py: Standalone CLI walkthrough (~2min on CPU)
- golden_gate_walkthrough.ipynb: Rich Colab notebook with
  visualizations (probability distributions, heatmaps, steering
  landscape plots)

Both walk through the full pipeline: build model -> collect
activations -> train SAE -> find features -> steer behavior,
with detailed educational explanations at every step.

README updated with new section and tutorial links.

https://claude.ai/code/session_01RuUhm1SYvQYE61feFRAc3e
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a 'Golden Gate Claude' tutorial to the repository, featuring a new interactive Jupyter notebook (golden_gate_walkthrough.ipynb) and a CLI demo script (golden_gate_demo.py). These resources guide users through building a miniature transformer, training a Sparse Autoencoder (SAE), and performing feature steering to manipulate model behavior. Feedback on the demo script highlights a missing comparison in the feature landscape visualization step and suggests simplifying redundant tensor slicing for better readability.

# We need to manually steer and track features
steering_config = {hook_point: (golden_feature, 10.0)}
with interp_model.steering_enabled(steering_config):
interp_model.model(test_tokens)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The demo script for Step 6 claims to "compare" unsteered vs steered feature landscapes, but it currently only prints the unsteered landscape (lines 382-390). Line 401 runs the model with steering enabled, but the resulting feature activations are never retrieved or printed. To make this comparison meaningful for the user, you should enable interpretation alongside steering, capture the active features, and print them similarly to the unsteered section.


steered_probs = F.softmax(steered_logits[:, -1, :], dim=-1)
diff = (steered_logits - baseline_logits).abs()
top_tok = steered_probs[:, :].argmax(dim=-1).item()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The slicing [:, :] on steered_probs is redundant as the tensor is already 2D with shape (1, vocab_size). You can simplify this to steered_probs.argmax(dim=-1).item() or even steered_probs.argmax().item() for better readability, matching the style used later in the script (e.g., line 447).

Suggested change
top_tok = steered_probs[:, :].argmax(dim=-1).item()
top_tok = steered_probs.argmax().item()

Address code review feedback:
- Step 6 now captures and prints the steered feature landscape alongside
  the unsteered one, making the comparison meaningful for readers
- Simplify redundant `[:, :]` slice to `.argmax().item()`

https://claude.ai/code/session_01RuUhm1SYvQYE61feFRAc3e
@SolshineCode
Copy link
Copy Markdown
Owner Author

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a "Golden Gate Claude" tutorial, featuring a new CLI demo and a Jupyter notebook. These resources demonstrate how to use Sparse Autoencoders (SAEs) to identify and steer internal model features, recreating Anthropic's famous interpretability experiment at a smaller scale. The review feedback suggests optimizing the activation collection process by terminating the loop once the maximum activation limit is reached and ensuring robust hook management through the use of try...finally blocks to prevent persistent hooks in the event of an error.

Comment on lines +156 to +160
with torch.no_grad(), collector:
for i in range(30):
# Feed random "text" through the model
tokens = torch.randint(0, config.vocab_size, (8, config.sequence_len))
model(tokens)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The activation collection loop continues for all 30 iterations even after the ActivationCollector has reached its max_activations limit (15,000). Since each batch collects 1,024 activations (8 samples * 128 tokens), the collector will be full after approximately 15 batches. Breaking the loop early will save unnecessary computation.

Suggested change
with torch.no_grad(), collector:
for i in range(30):
# Feed random "text" through the model
tokens = torch.randint(0, config.vocab_size, (8, config.sequence_len))
model(tokens)
with torch.no_grad(), collector:
for i in range(30):
if collector.counts[hook_point] >= collector.max_activations:
break
# Feed random "text" through the model
tokens = torch.randint(0, config.vocab_size, (8, config.sequence_len))
model(tokens)

Comment on lines +412 to +414
handle = module.register_forward_hook(capture_steered_features)
interp_model.model(test_tokens)
handle.remove()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The manual hook registration and removal should be wrapped in a try...finally block. This ensures that the hook is properly removed even if an exception occurs during the model forward pass, preventing the hook from persisting and affecting subsequent model executions.

Suggested change
handle = module.register_forward_hook(capture_steered_features)
interp_model.model(test_tokens)
handle.remove()
handle = module.register_forward_hook(capture_steered_features)
try:
interp_model.model(test_tokens)
finally:
handle.remove()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants