Add interpretability analysis and checkpoint saving by machinesbefree · Pull Request #10 · pathwaycom/bdh

machinesbefree · 2026-04-17T17:59:33Z

Motivation

The README presents interpretability as a core property of BDH — "interpretable activations that are sparse and positive" and "Hebbian working memory... displaying monosemanticity" — but the repository currently has no code for measuring or reproducing those properties. Anyone wanting to verify those claims from the public repo has to write the tooling themselves.

This PR adds a small, self-contained workflow that quantifies them on a trained model.

What changes

train.py (+16 lines): save the trained state_dict + config to bdh_checkpoint.pt at the end of training (path overridable via BDH_CHECKPOINT=...). torch.compile's _orig_mod. prefix is stripped so the checkpoint loads cleanly into a fresh BDH instance. The sample-generation step is unchanged.

analyze.py (new, ~380 lines): standalone script that loads a checkpoint, runs the held-out split through an instrumented forward pass (mirrors BDH.forward without editing bdh.py), and reports per-layer statistics on the xy_sparse units:

mean firing rate, firing-rate distribution, fraction of near-silent neurons
selectivity (max / mean activation) — candidate monosemantic detectors
byte-context window for each surfaced neuron's strongest activation
histogram figure (analysis.png) when matplotlib is available; text report works without it

README.md (+13 lines): one short section documenting the workflow.

.gitignore: ignore __pycache__/, *.pt, and analysis.png.

No changes to bdh.py.

Example output

On a default-config BDH trained on tiny-Shakespeare (GPU-memory permitting — smaller batch locally), analyze.py produces:

  layer  mean firing %  median firing %  % neurons <1% fire   median selectivity
  ----------------------------------------------------------------------
      0          2.69%            1.01%              49.83%               484.07
      1          2.69%            1.17%              46.81%               531.09
      2          3.11%            1.60%              39.68%               475.59
      3          3.37%            1.82%              36.89%               456.28
      4          3.56%            1.88%              36.32%               442.31
      5          3.67%            1.84%              36.95%               440.48

Top-selective neurons per layer (candidate monosemantic detectors):

  Layer 2
    neuron   firing %  selectivity  context (±24 bytes around max)
  h00·u07169      0.01%      19649.5  he ground indeed is tawny.\n\nSEBASTIAN:\nWith an ey
  h02·u04357      0.02%      19489.6  arth, thou! speak.\n\nCALIBAN:\n\nPROSPERO:\nCome fort
  h00·u07701      0.02%      19406.6  uest,\nThat, upon knowledge of my parentage,\nI may
  ...

Several of the top-selective neurons cluster cleanly around Shakespeare dialogue turn boundaries (\n\nPETRUCHIO:, \n\nARIEL:, \n\nKATHARINA:) — the sort of pattern a monosemantic detector should exhibit.

Test plan

train.py runs, converges (loss 5.65 → 1.33 at default config w/ smaller batch for 8 GB GPU), saves a valid checkpoint
analyze.py loads the checkpoint, produces the report and figure
analyze.py works without matplotlib installed (text report only)
Instrumented forward matches BDH.forward numerically (same math, same order)
No edits to bdh.py

Notes

Default analyze.py settings analyze 10 batches of 8×256 (~20k token positions); adjustable with --n-batches, --batch-size, --block-size.
Selectivity is reported as max / mean, which favours rarely-firing but strongly-activating neurons. Neurons that never fire on the sample are excluded from median selectivity.

The README claims BDH has "sparse and positive" activations and a "Hebbian working memory... displaying monosemanticity", but the repository has no code for measuring or reproducing those claims. This change adds a self-contained analysis workflow. - train.py: save the trained model + config to bdh_checkpoint.pt at the end of training (path overridable via BDH_CHECKPOINT). The sample generation step is unchanged. - analyze.py: a new standalone script that loads a checkpoint, runs the held-out validation split through an instrumented forward pass (mirrors BDH.forward without editing bdh.py), and reports per-layer statistics for the xy_sparse units: * mean firing rate and firing-rate distribution * selectivity (max/mean activation) for candidate monosemantic neurons * byte-context window for each surfaced neuron's strongest activation A matplotlib figure with firing-rate and selectivity histograms is written to analysis.png when matplotlib is available; the text report works without it. - README: short section documenting the analysis workflow. On a default-config model trained on tiny-Shakespeare, the report shows ~3% mean firing rate across layers, ~40% of neurons firing on <1% of tokens, and top-selective neurons that cluster cleanly around Shakespeare dialogue turn boundaries (e.g. "\\n\\nPETRUCHIO:", "\\n\\nARIEL:"). No changes to bdh.py.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add interpretability analysis and checkpoint saving#10

Add interpretability analysis and checkpoint saving#10
machinesbefree wants to merge 1 commit intopathwaycom:mainfrom
machinesbefree:analyze-interpretability

machinesbefree commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

machinesbefree commented Apr 17, 2026

Motivation

What changes

Example output

Test plan

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant