merge by masterkni6 · Pull Request #1 · masterkni6/lczero-training

masterkni6 · 2022-02-23T01:59:39Z

No description provided.

…ring-training

Replace _eval_jit_cache global dict and _make_eval_jit factory with a single @functools.partial(jax.jit, static_argnums=(0, 1)) function, letting JAX cache per (graphdef, loss_fn) pair natively.

…l cache" This reverts commit 76e3d9c.

This reverts commit ac23c8b.

…al cache" This reverts commit b8da2dd.

Replace static_argnums=(0,1) approach (which requires graphdef to be hashable) with an id-keyed cache of JIT closures that capture graphdef and loss_fn, matching the working ea35598 approach.

The per-instance _eval_jit field was a latent bug — it ignored graphdef on subsequent calls if the instance was reused with a different graphdef. Since _make_eval_jit already deduplicates by (graphdef, loss_fn) id at module level, the field is redundant. Also tighten Any to a typed alias.

Allows freezing selected weights during training by wrapping the gradient transformation with optax.masked, so frozen weights receive no gradient updates.

- describe-training: --weight_paths lists all model weight paths in slash-separated format, sorted numerically - migrate-checkpoint: --dump_source_paths / --dump_destination_paths print paths as proto rule stubs and exit early

…ozen params optax.masked(tx, mask) does not zero updates for False-masked params — it passes them through unchanged. This meant every "frozen" param got param += raw_gradient each step, causing the activation explosions seen during block-14 reset training. Fix: chain the masked optimizer with optax.set_to_zero() so frozen params receive exactly zero updates.

is_leaf stopped at any NamedTuple (MaskedState, etc.), preventing traversal to inner ScaleByAdamState/ScaleByScheduleState nodes. Now is_leaf only stops at known count-bearing types. Also adds a post-update assertion that no state with a 'count' field was missed, so new wrapper types fail loudly instead of silently leaving step counters at 0.

tuple.count() is a builtin method on all tuples, so hasattr(x, "count") matched MaskedState and other NamedTuples. Check _fields instead to only match NamedTuples that have count as an actual data field.

Stop at any NamedTuple with a 'count' field via is_leaf, assert it's a known type, and replace. Regular array leaves pass through unchanged. Replaces the broken second-traversal approach.

Both the train command and daemon pipeline were missing the training_steps field when constructing LeelaExportOptions.

This reverts commit beb94e2.

mooskagh added 30 commits October 3, 2025 09:53

Ensure tune_lr uses updated training state for validation

405b763

Merge pull request #9 from mooskagh/codex/fix-weights-not-updating-du…

b3c98d0

…ring-training

Add overfit training utility

400a2bc

Fix overfit step logging for JAX arrays

2a3d6ea

Merge pull request #10 from mooskagh/codex/2025-10-03-15-50-59

91ddabd

Add coin flip overfit mode and stream CSV results

7e89ce5

Merge pull request #11 from mooskagh/codex/2025-10-04-06-20-05

ef8e970

Do not forward chunks that failed rescoring.

312d98f

Merge branch 'master' of github.com:mooskagh/lczero-training

a4b54f5

Log chunk metadata on rescore failure

33c464c

Merge pull request #12 from mooskagh/codex/2025-10-04-10-54-21

68afe10

Add debug chunk source for synthetic training data

1ffd4dc

Merge pull request #13 from mooskagh/codex/2025-10-04-11-05-02

8b8efbf

Add tool to dump V6 chunk files

0c7c74c

Merge pull request #14 from mooskagh/codex/2025-10-04-13-04-05

d3d8349

Wire max_grad_norm into daemon optimizer

018d411

Different seed every time.

6777bf7

Tool for startpos stats.

64c802c

Add tests for shuffling chunk pool metrics

0aa468c

Merge pull request #17 from mooskagh/codex/2025-10-06-20-14-16

3d24e8b

Merge branch 'master' of github.com:mooskagh/lczero-training

1af1d89

Fixing what codex did..

c448cdb

Way to debug batches.

d93bd90

More tools

daf6e30

Fix deepnorm initializer

ee82296

Passing deepnorm_beta as parameter. Not sure whetehr better.

a4f042d

Merge pull request #20 from mooskagh/deepnorm

8a57b6e

Allow shuffling pool startup with fewer chunks

37cbcb2

Fail shuffling pool startup without any chunks

bb1bdfa

Merge pull request #21 from mooskagh/codex/2025-10-09-20-06-31

fccdd17

mooskagh and others added 30 commits March 2, 2026 22:08

Donate jit_state buffers in training loop to reduce memory fragmentation

0c8a44f

Use module-level jax.jit with static_argnums instead of manual cache

76e3d9c

Replace _eval_jit_cache global dict and _make_eval_jit factory with a single @functools.partial(jax.jit, static_argnums=(0, 1)) function, letting JAX cache per (graphdef, loss_fn) pair natively.

Revert "Use module-level jax.jit with static_argnums instead of manua…

b8da2dd

…l cache" This reverts commit 76e3d9c.

Addressing review comments

ac23c8b

Revert "Addressing review comments"

acb4c6b

This reverts commit ac23c8b.

Reapply "Use module-level jax.jit with static_argnums instead of manu…

6a3c193

…al cache" This reverts commit b8da2dd.

Addressing review comments

a15066b

Update jax and jaxlib to 0.9.1

1e83626

Merge pull request #232 from mooskagh/heads

4ad434a

Merge remote-tracking branch 'upstream/master' into heads

a03fc23

Fix JIT caching in metrics.py to avoid unhashable ModelConfig

90be7e4

Replace static_argnums=(0,1) approach (which requires graphdef to be hashable) with an id-keyed cache of JIT closures that capture graphdef and loss_fn, matching the working ea35598 approach.

Merge pull request #233 from mooskagh/heads

2cdcce0

Merge pull request #234 from mooskagh/fix-jit-hashability

779c1d0

Add freeze_selector to OptimizerConfig

71ae329

Allows freezing selected weights during training by wrapping the gradient transformation with optax.masked, so frozen weights receive no gradient updates.

Merge pull request #235 from mooskagh/freeze

8cf0f4e

Merge pull request #236 from mooskagh/migrate-dump

3175301

Fix false positive in update_optimizer_step assertion

3eb91b4

tuple.count() is a builtin method on all tuples, so hasattr(x, "count") matched MaskedState and other NamedTuples. Check _fields instead to only match NamedTuples that have count as an actual data field.

Simplify update_optimizer_step validation

1d274b1

Stop at any NamedTuple with a 'count' field via is_leaf, assert it's a known type, and replace. Regular array leaves pass through unchanged. Replaces the broken second-traversal approach.

Merge pull request #237 from mooskagh/fix-freeze-selector

5771095

Export training_steps in network exports

1e819fc

Both the train command and daemon pipeline were missing the training_steps field when constructing LeelaExportOptions.

Move mutex to file source

4775209

Prefetch batch

beb94e2

Merge pull request #238 from mooskagh/fix-export

c235eae

Address PR review

215d937

Revert "Prefetch batch"

1be6b6c

This reverts commit beb94e2.

Merge pull request #239 from john-sp/parallel-tar-fetch

a60c7d2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge#1

merge#1
masterkni6 wants to merge 578 commits into
masterkni6:masterfrom
LeelaChessZero:master

masterkni6 commented Feb 23, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

masterkni6 commented Feb 23, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants