Add interpretability analysis and checkpoint saving#10
Open
machinesbefree wants to merge 1 commit intopathwaycom:mainfrom
Open
Add interpretability analysis and checkpoint saving#10machinesbefree wants to merge 1 commit intopathwaycom:mainfrom
machinesbefree wants to merge 1 commit intopathwaycom:mainfrom
Conversation
The README claims BDH has "sparse and positive" activations and a "Hebbian
working memory... displaying monosemanticity", but the repository has no
code for measuring or reproducing those claims. This change adds a
self-contained analysis workflow.
- train.py: save the trained model + config to bdh_checkpoint.pt at the end
of training (path overridable via BDH_CHECKPOINT). The sample generation
step is unchanged.
- analyze.py: a new standalone script that loads a checkpoint, runs the
held-out validation split through an instrumented forward pass (mirrors
BDH.forward without editing bdh.py), and reports per-layer statistics
for the xy_sparse units:
* mean firing rate and firing-rate distribution
* selectivity (max/mean activation) for candidate monosemantic neurons
* byte-context window for each surfaced neuron's strongest activation
A matplotlib figure with firing-rate and selectivity histograms is
written to analysis.png when matplotlib is available; the text report
works without it.
- README: short section documenting the analysis workflow.
On a default-config model trained on tiny-Shakespeare, the report shows
~3% mean firing rate across layers, ~40% of neurons firing on <1% of
tokens, and top-selective neurons that cluster cleanly around Shakespeare
dialogue turn boundaries (e.g. "\\n\\nPETRUCHIO:", "\\n\\nARIEL:"). No
changes to bdh.py.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
The README presents interpretability as a core property of BDH — "interpretable activations that are sparse and positive" and "Hebbian working memory... displaying monosemanticity" — but the repository currently has no code for measuring or reproducing those properties. Anyone wanting to verify those claims from the public repo has to write the tooling themselves.
This PR adds a small, self-contained workflow that quantifies them on a trained model.
What changes
train.py(+16 lines): save the trainedstate_dict+ config tobdh_checkpoint.ptat the end of training (path overridable viaBDH_CHECKPOINT=...).torch.compile's_orig_mod.prefix is stripped so the checkpoint loads cleanly into a freshBDHinstance. The sample-generation step is unchanged.analyze.py(new, ~380 lines): standalone script that loads a checkpoint, runs the held-out split through an instrumented forward pass (mirrorsBDH.forwardwithout editingbdh.py), and reports per-layer statistics on thexy_sparseunits:analysis.png) whenmatplotlibis available; text report works without itREADME.md(+13 lines): one short section documenting the workflow..gitignore: ignore__pycache__/,*.pt, andanalysis.png.No changes to
bdh.py.Example output
On a default-config BDH trained on tiny-Shakespeare (GPU-memory permitting — smaller batch locally),
analyze.pyproduces:Several of the top-selective neurons cluster cleanly around Shakespeare dialogue turn boundaries (
\n\nPETRUCHIO:,\n\nARIEL:,\n\nKATHARINA:) — the sort of pattern a monosemantic detector should exhibit.Test plan
train.pyruns, converges (loss 5.65 → 1.33 at default config w/ smaller batch for 8 GB GPU), saves a valid checkpointanalyze.pyloads the checkpoint, produces the report and figureanalyze.pyworks withoutmatplotlibinstalled (text report only)BDH.forwardnumerically (same math, same order)bdh.pyNotes
analyze.pysettings analyze 10 batches of 8×256 (~20k token positions); adjustable with--n-batches,--batch-size,--block-size.max / mean, which favours rarely-firing but strongly-activating neurons. Neurons that never fire on the sample are excluded from median selectivity.