Skip to content

Add interpretability analysis and checkpoint saving#10

Open
machinesbefree wants to merge 1 commit intopathwaycom:mainfrom
machinesbefree:analyze-interpretability
Open

Add interpretability analysis and checkpoint saving#10
machinesbefree wants to merge 1 commit intopathwaycom:mainfrom
machinesbefree:analyze-interpretability

Conversation

@machinesbefree
Copy link
Copy Markdown

Motivation

The README presents interpretability as a core property of BDH — "interpretable activations that are sparse and positive" and "Hebbian working memory... displaying monosemanticity" — but the repository currently has no code for measuring or reproducing those properties. Anyone wanting to verify those claims from the public repo has to write the tooling themselves.

This PR adds a small, self-contained workflow that quantifies them on a trained model.

What changes

train.py (+16 lines): save the trained state_dict + config to bdh_checkpoint.pt at the end of training (path overridable via BDH_CHECKPOINT=...). torch.compile's _orig_mod. prefix is stripped so the checkpoint loads cleanly into a fresh BDH instance. The sample-generation step is unchanged.

analyze.py (new, ~380 lines): standalone script that loads a checkpoint, runs the held-out split through an instrumented forward pass (mirrors BDH.forward without editing bdh.py), and reports per-layer statistics on the xy_sparse units:

  • mean firing rate, firing-rate distribution, fraction of near-silent neurons
  • selectivity (max / mean activation) — candidate monosemantic detectors
  • byte-context window for each surfaced neuron's strongest activation
  • histogram figure (analysis.png) when matplotlib is available; text report works without it

README.md (+13 lines): one short section documenting the workflow.

.gitignore: ignore __pycache__/, *.pt, and analysis.png.

No changes to bdh.py.

Example output

On a default-config BDH trained on tiny-Shakespeare (GPU-memory permitting — smaller batch locally), analyze.py produces:

  layer  mean firing %  median firing %  % neurons <1% fire   median selectivity
  ----------------------------------------------------------------------
      0          2.69%            1.01%              49.83%               484.07
      1          2.69%            1.17%              46.81%               531.09
      2          3.11%            1.60%              39.68%               475.59
      3          3.37%            1.82%              36.89%               456.28
      4          3.56%            1.88%              36.32%               442.31
      5          3.67%            1.84%              36.95%               440.48

Top-selective neurons per layer (candidate monosemantic detectors):

  Layer 2
    neuron   firing %  selectivity  context (±24 bytes around max)
  h00·u07169      0.01%      19649.5  he ground indeed is tawny.\n\nSEBASTIAN:\nWith an ey
  h02·u04357      0.02%      19489.6  arth, thou! speak.\n\nCALIBAN:\n\nPROSPERO:\nCome fort
  h00·u07701      0.02%      19406.6  uest,\nThat, upon knowledge of my parentage,\nI may
  ...

Several of the top-selective neurons cluster cleanly around Shakespeare dialogue turn boundaries (\n\nPETRUCHIO:, \n\nARIEL:, \n\nKATHARINA:) — the sort of pattern a monosemantic detector should exhibit.

Test plan

  • train.py runs, converges (loss 5.65 → 1.33 at default config w/ smaller batch for 8 GB GPU), saves a valid checkpoint
  • analyze.py loads the checkpoint, produces the report and figure
  • analyze.py works without matplotlib installed (text report only)
  • Instrumented forward matches BDH.forward numerically (same math, same order)
  • No edits to bdh.py

Notes

  • Default analyze.py settings analyze 10 batches of 8×256 (~20k token positions); adjustable with --n-batches, --batch-size, --block-size.
  • Selectivity is reported as max / mean, which favours rarely-firing but strongly-activating neurons. Neurons that never fire on the sample are excluded from median selectivity.

The README claims BDH has "sparse and positive" activations and a "Hebbian
working memory... displaying monosemanticity", but the repository has no
code for measuring or reproducing those claims. This change adds a
self-contained analysis workflow.

- train.py: save the trained model + config to bdh_checkpoint.pt at the end
  of training (path overridable via BDH_CHECKPOINT). The sample generation
  step is unchanged.

- analyze.py: a new standalone script that loads a checkpoint, runs the
  held-out validation split through an instrumented forward pass (mirrors
  BDH.forward without editing bdh.py), and reports per-layer statistics
  for the xy_sparse units:
    * mean firing rate and firing-rate distribution
    * selectivity (max/mean activation) for candidate monosemantic neurons
    * byte-context window for each surfaced neuron's strongest activation
  A matplotlib figure with firing-rate and selectivity histograms is
  written to analysis.png when matplotlib is available; the text report
  works without it.

- README: short section documenting the analysis workflow.

On a default-config model trained on tiny-Shakespeare, the report shows
~3% mean firing rate across layers, ~40% of neurons firing on <1% of
tokens, and top-selective neurons that cluster cleanly around Shakespeare
dialogue turn boundaries (e.g. "\\n\\nPETRUCHIO:", "\\n\\nARIEL:"). No
changes to bdh.py.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant