Skip to content

Restore backbone-only .compile() override on forcefield regressors#165

Closed
timduignan wants to merge 2 commits into
orbital-materials:mainfrom
timduignan:restore-backbone-compile-override
Closed

Restore backbone-only .compile() override on forcefield regressors#165
timduignan wants to merge 2 commits into
orbital-materials:mainfrom
timduignan:restore-backbone-compile-override

Conversation

@timduignan

Copy link
Copy Markdown
Contributor

Summary

Without an override, model.compile(...) falls through to nn.Module.compile() which compiles self.__call__ (= forward). That has two problems:

  • DirectForcefieldRegressor: predict() calls self.model(batch) directly and never goes through __call__, so the compiled callable is created but never invoked. .compile() is silently a no-op for inference.
  • ConservativeForcefieldRegressor: predict() does flow through __call__, but compiling the full regressor pulls the energy autograd backward and (for OrbMol-v2) the Coulomb / PME path into the traced graph, which dynamo fragments badly.

This PR restores the small .compile() override on both classes so self.model (the GNS message-passing backbone, where ~all the FLOPs live) is the thing actually compiled; post-backbone work runs eager. This matches what the internal core repo's orbmolv2 branch already does, and was the behaviour before the recent porting commit that exposed this regression.

Numbers

orb_v3_direct_20_omat on a single H100, 3 trials each (variance ~1%):

variant 1k atoms 10k atoms
compile=True (pre-fix default) 9.23 ms 38.36 ms
compile=False (eager) 9.23 ms 38.32 ms
manual model.model.compile(...) (this fix) 7.07 ms 30.04 ms

So the pre-fix .compile() was a literal no-op in inference for direct models, and restoring the backbone compile recovers a 22-24% inference speedup. Conservative models also benefit by getting a clean single-fragment compile graph instead of the fragmented full-regressor trace.

Tests

Adds test_compile_engages_backbone to both tests/forcefield/test_direct_regressor.py and tests/forcefield/test_conservative.py. These assert that model.model._compiled_call_impl is not None after .compile(), so this can't silently re-break if someone removes the override again in a future refactor.

Other changes

  • README update: the "Full-model compilation" bullet wasn't accurate (we compile the backbone, not the full regressor). Reworded to reflect what's actually happening and the realistic speedup range.

Test plan

  • pytest tests/forcefield/test_direct_regressor.py::test_compile_engages_backbone tests/forcefield/test_conservative.py::test_compile_engages_backbone passes
  • Full tests/forcefield/test_{direct_regressor,conservative}.py passes in CI
  • Existing test_regressor_compile_matches_eager still passes (compiled output equivalent to eager)

🤖 Generated with Claude Code

Without an override, model.compile(...) falls through to
nn.Module.compile() which compiles self.__call__ (= forward). For
DirectForcefieldRegressor that has no effect at inference: predict()
calls self.model(batch) directly and never goes through __call__, so
the compiled callable is created but never invoked. For
ConservativeForcefieldRegressor predict() does flow through __call__,
but compiling the full regressor pulls the energy autograd backward
and (for OrbMol-v2) the Coulomb / PME path into the traced graph,
which dynamo fragments badly.

Restore the override on both classes so that .compile() compiles
self.model (the GNS message-passing backbone) where almost all the
FLOPs live; post-backbone work runs eager. This matches what the
internal core repo's orbmolv2 branch already does, and was the
behaviour before the porting commit that exposed this regression.

Verified on a single H100, orb_v3_direct_20_omat, 3 trials each
(variance ~1%):

  variant                                     1k atoms   10k atoms
  compile=True (pre-fix default)              9.23 ms    38.36 ms
  compile=False (eager)                       9.23 ms    38.32 ms
  manual model.model.compile() (this fix)     7.07 ms    30.04 ms

i.e. the pre-fix .compile() was a no-op in inference for direct
models, and the restored backbone compile recovers a 22-24% inference
speedup.

Adds tests/forcefield/test_{direct,conservative}.py::test_compile_engages_backbone
that asserts the backbone has _compiled_call_impl set after .compile(),
so this can't silently re-break.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@timduignan timduignan marked this pull request as draft June 4, 2026 20:20
The replacement bullet I added in the previous commit framed the
restored backbone compile as a feature with a ~1.3x speedup, which is
misleading. Older releases already had backbone compile working via
the same override; v0.7.0 silently broke it (df8f4f0) and the
previous commit just restores prior behaviour. There's no new
speedup to announce, so the cleanest thing is to drop the bullet
entirely.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vsimkus

vsimkus commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Aren't we overcorrecting here by removing the full compilation? Especially since it's passed our tests/evaluations for the conservative model. The direct regressor was indeed missing the compiled path via .predict(), but I've already put together a fix and tested it here: https://github.com/orbital-materials/orb/pull/3071

@timduignan

Copy link
Copy Markdown
Contributor Author

Brilliant thanks yep sorry yeah this is wrong deleting

@timduignan timduignan closed this Jun 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants