Vectorize O(N^2) bond feature computation with NumPy#16
Vectorize O(N^2) bond feature computation with NumPy#16longleo17 wants to merge 1 commit intobytedance:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR replaces the O(N_token²) nested-loop implementation of Featurizer.get_bond_features() with a vectorized NumPy approach by mapping bonded atom pairs directly to token indices, and adds a test module intended to validate correctness and benchmark speed.
Changes:
- Vectorize bond feature computation in
pxdesign/data/featurizer.pyusingnp.nonzero()+ token/atom index mapping and boolean masks. - Add a new test module
tests/test_vectorize_bond_features.pywith correctness checks and a speed comparison benchmark.
Reviewed changes
Copilot reviewed 2 out of 3 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
pxdesign/data/featurizer.py |
Replaces nested Python loops with vectorized NumPy filtering/assignment to compute token_bonds. |
tests/test_vectorize_bond_features.py |
Adds mock-based validation and a performance benchmark for the vectorized approach. |
tests/__init__.py |
Present but no functional changes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def test_vectorized_speedup(): | ||
| """Benchmark vectorized vs original implementation.""" | ||
| np.random.seed(42) | ||
| num_tokens = 500 | ||
| atoms_per_token = 5 | ||
| num_bonds = 1000 | ||
| token_array, atom_array = _make_mock_data(num_tokens, atoms_per_token, num_bonds) | ||
|
|
||
| # Benchmark original | ||
| start = time.monotonic() | ||
| for _ in range(3): | ||
| _bond_features_original(token_array, atom_array) | ||
| time_original = (time.monotonic() - start) / 3 | ||
|
|
||
| # Benchmark vectorized | ||
| start = time.monotonic() | ||
| for _ in range(3): | ||
| _bond_features_vectorized(token_array, atom_array) | ||
| time_vectorized = (time.monotonic() - start) / 3 | ||
|
|
||
| speedup = time_original / max(time_vectorized, 1e-9) | ||
| print(f"\nBond features benchmark ({num_tokens} tokens, {num_tokens * atoms_per_token} atoms):") | ||
| print(f" original (O(N^2) loops): {time_original:.3f}s") | ||
| print(f" vectorized (NumPy): {time_vectorized:.3f}s") | ||
| print(f" speedup: {speedup:.0f}x") | ||
|
|
||
| assert time_vectorized < time_original, ( | ||
| f"Vectorized ({time_vectorized:.3f}s) should be faster than original ({time_original:.3f}s)" | ||
| ) |
There was a problem hiding this comment.
test_vectorized_speedup() asserts a wall-clock speed comparison, which is often flaky across CI hardware/load and can cause nondeterministic failures. Consider marking this as a benchmark/slow test (skipped by default) or rewriting it to validate correctness only (keeping performance measurement out of the pass/fail criteria).
| # Filter: skip design tokens | ||
| valid = ~is_design[token_i] & ~is_design[token_j] | ||
| token_i = token_i[valid] | ||
| token_j = token_j[valid] |
There was a problem hiding this comment.
This changes the design-token exclusion behavior: the previous loop skipped bonds only when the outer-loop token was a design residue, but the new vectorized filter drops any bond where either endpoint is a design token. If this semantic change is intentional, it should be called out in the PR description and ideally covered by a regression test against Featurizer.get_bond_features(); if it’s meant to be a pure vectorization, adjust the filter to preserve the prior behavior.
|
|
||
| def _bond_features_original(token_array, atom_array): | ||
| """Original O(N^2) implementation for reference.""" | ||
| # Inline constants to avoid importing pxdesign.data.constants (requires rdkit) |
There was a problem hiding this comment.
Line 94 is over-indented relative to the function body (8 spaces instead of 4), which will raise an IndentationError and prevent the test module from importing. Align the # Inline constants ... comment and subsequent constants with the function indentation.
| # Inline constants to avoid importing pxdesign.data.constants (requires rdkit) | |
| # Inline constants to avoid importing pxdesign.data.constants (requires rdkit) |
| def _bond_features_vectorized(token_array, atom_array): | ||
| """Vectorized implementation.""" | ||
| # Inline constants to avoid importing pxdesign.data.constants (requires rdkit) | ||
| DESIGN_RESIDUES = {"xpb": 32, "xpa": 33, "rbb": 34, "raa": 35} | ||
| STD_RESIDUES = { |
There was a problem hiding this comment.
Line 149 is over-indented relative to the function body (8 spaces instead of 4), which will raise an IndentationError and prevent the test module from importing. Align the # Inline constants ... comment and subsequent constants with the function indentation.
| import torch | ||
|
|
||
|
|
There was a problem hiding this comment.
The module imports torch but never uses it. Removing the unused import will avoid unnecessary dependency loading and keeps the test file clean.
| import torch |
| matrix, mapping bonded atom pairs directly to their token indices. This | ||
| eliminates the inner loops entirely. | ||
|
|
||
| Expected: ~50-100x speedup on the bond feature computation. |
There was a problem hiding this comment.
The docstring claims an expected ~50–100x speedup, but the PR description reports ~13x. This mismatch is likely to confuse future readers; update the expectation text (or remove the numeric claim) to reflect the actual measured speedup.
| Expected: ~50-100x speedup on the bond feature computation. | |
| Expected: substantial (often order-of-magnitude) speedup on the bond feature computation. |
Summary
Replaces two nested Python
forloops (O(N_token^2)) inget_bond_features()with vectorized NumPy operations. The original iterated over all token pairs checking atom-level bonds in pure Python.Measured impact (synthetic benchmark)
Changes
pxdesign/data/featurizer.py—Featurizer.get_bond_features()np.nonzero, filter with vectorized boolean masksTest plan
pytest tests/test_vectorize_bond_features.py -v