-
Notifications
You must be signed in to change notification settings - Fork 764
ENH: Change get_hbond_map to use KDTree for distance search
#5182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello there first time contributor! Welcome to the MDAnalysis community! We ask that all contributors abide by our Code of Conduct and that first time contributors introduce themselves on GitHub Discussions so we can get to know you. You can learn more about participating here. Please also add yourself to package/AUTHORS as part of this PR.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #5182 +/- ##
===========================================
- Coverage 92.72% 92.71% -0.02%
===========================================
Files 180 180
Lines 22472 22480 +8
Branches 3188 3191 +3
===========================================
+ Hits 20838 20842 +4
- Misses 1176 1179 +3
- Partials 458 459 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| hmap = np.tile(h_1, (1, 1, n)).reshape(n, n, 3) | ||
| # We can use KDTree to find candidate pairs within cutoff distance without having to | ||
| # compute a full pairwise distance matrix, much faster especially for larger proteins | ||
| n_kdtree = KDTree(n_atoms) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know the original code was vendored from somewhere else, so it explains the heavy reliance on linalg.norm and the lack of PBC awareness - however, now that we're making changes, should we be using one of our lib.distances calls (maybe capped_distance) instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not at all experienced with lib.distances are you talking about for the KDTree / potential pair generation or for the actual distance computation? If just for computing distances then on the KDTree generation we would already be discarding some potential across-boundary bonds.
We should be able to provide simple periodic checking with KDTree you can specify a boxsize parameter. But this wouldn't take into account the more complex boundaries we might come across?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I just dug in and found capped_distance which should do exactly what we are after with better periodic boundary checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The performance isn't as good but also as expected with more checks. Still a large improvement over current.
============================================================
Scaling benchmark:
Residues Time (ms) Time/Res (μs)
------------------------------------------------------------
50 0.142 2.835
100 0.291 2.906
150 0.489 3.257
200 0.736 3.682
250 1.056 4.226
300 1.629 5.429
============================================================
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've currently updated the pair search to use capped_distances - but the actual distance computation still isn't taking into account periodic boundaries. Is there a periodic boundary aware distance calculation that doesn't do "everything vs everything" which distance_array seems to do? Ideally we would avoid repeatedly calling capped_distance as we have already constructed the KDTree once and can just use those results for computation later on, but still taking into account periodic boundary crossing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pair-wise distances between two equal groups of atoms can be calculated with MDAnalysis.lib.distances.calc_bonds.
|
@BradyAJohnston wow, that's super impressive results, thanks! Some comments:
Just to make sure I understand this correctly: the current implementation is plain wrong when the boundary condition is crossed, since it just computes the wrong distance, right? I mean this part: # compute distances and energy as previously but only for the potential pairs
# still returning the same energy matrix that would have otherwise been made
o_indices = pairs[:, 1]
n_indices = pairs[:, 0]
d_on = np.linalg.norm(o_atoms[o_indices] - n_atoms[n_indices], axis=-1)
d_ch = np.linalg.norm(c_atoms[o_indices] - h_1[n_indices], axis=-1)
d_oh = np.linalg.norm(o_atoms[o_indices] - h_1[n_indices], axis=-1)
d_cn = np.linalg.norm(c_atoms[o_indices] - n_atoms[n_indices], axis=-1)so we're looking for periodic contition aware d_on = np.linalg.norm(capped_distance(o_atoms[o_indices], n_atoms[n_indices]))
...When defaulting to |
|
Hi @marinegor yes if We could just have multiple If there was just a In the current setup while we do detect periodic bonds to check - the results end up being wrong like you say. |
perhaps there are some functions in distopia (namely,
do you see this being a separate issue, perhaps if you don't find a quick solution to use among |
@BradyAJohnston have a look at |
Please don't try to use distopia directly, it's meant to be an (optional for now) backend replacement. |
|
Thanks for the suggestion @IAlibay - this would then just become as such: d_on = np.linalg.norm(minimize_vectors(o_atoms[o_indices] - n_atoms[n_indices], box=box), axis=-1)
d_ch = np.linalg.norm(minimize_vectors(c_atoms[o_indices] - h_1[n_indices], box=box), axis=-1)
d_oh = np.linalg.norm(minimize_vectors(o_atoms[o_indices] - h_1[n_indices], box=box), axis=-1)
d_cn = np.linalg.norm(minimize_vectors(c_atoms[o_indices] - n_atoms[n_indices], box=box), axis=-1)The usage of |
I guess so, yeah -- it's just a norm calculation, there should not be any problem with that as long as the vector itself is boundary-corrected. As for cryoEM default, the placeholder unit cell is UPD: added this remark to changes request. |
| dssp = assign( | ||
| coords, | ||
| donor_mask=self._donor_mask, | ||
| box=self._trajectory.ts.dimensions, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if dimensions are (1, 1, 1, 90, 90, 90) (cryoEM placeholder), just set to None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be done here, but rather when loading the PDB - there is an a bad edge case where you might actually just have that unit box.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed we do this already: https://github.com/MDAnalysis/mdanalysis/blob/develop/package/MDAnalysis/coordinates/PDB.py#L228-L231
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On further thought, I wouldn't be completely against this check - maybe a warning would be ok too if we don't want to be too strict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On further thought, I wouldn't be completely against this check - maybe a warning would be ok too if we don't want to be too strict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not 100% with how this should be handled. If I am raising a warning in _single_frame won't that warning be raised a bunch of times for every frame?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def _single_frame(self):
coords = self._get_coords()
DEFAULT_BOX_DIMENSIONS = (1, 1, 1, 90, 90, 90)
box = self._trajectory.ts.dimensions
if box == DEFAULT_BOX_DIMENSIONS:
box = None
warnings.warn(
f"Default box dimensions {DEFAULT_BOX_DIMENSIONS} are treated as no box"
)
dssp = assign(
coords,
donor_mask=self._donor_mask,
box=self._trajectory.ts.dimensions,
)
self.results.dssp_ndarray.append(dssp)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BradyAJohnston - how about making that warning happen in init? The atomgroups are set there, so it would be best to do it when the class is created.
Yep, don't trust me fully (I'm sleep deprived), but that sounds reasonable. |
|
One thing, it probably shouldn't happen in `_single_frame` but rather in `__init__`, otherwise i'm ok with a warning.
|
| if np.array_equal(self._box, (1, 1, 1, 90, 90, 90)): | ||
| self._box = None | ||
| warnings.warn( | ||
| "Box dimensions are (1, 1, 1, 90, 90, 90), not using periodic boundary conditions in DSSP calculations" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not 100% on what the warning should be - but this is how we are imagining it? @IAlibay
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that looks about right. With regards to the warning, your message is close enough to what I was thinking, i.e. something like "Default unit cells of 1 A cubed found. This likely comes from a cryo-em file. To avoid usses, we are disabling periodic boundary conditions in the DSSP calculation".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A warning is currently already raised twice: once when loading the universe and again when running the DSSP (not sure why it is happening for the run(). As the unit cell dimensions are already set to None it's not triggering the new warning. In most cases the unit cell wouldn't make it "all the way" to DSSP but I'll change it to just be the same warning message.
In [9]: p = Path("/Users/brady/Downloads/9N9M.pdb")
In [10]: u = mda.Universe(p)
/Users/brady/git/mdanalysis/package/MDAnalysis/coordinates/PDB.py:479: UserWarning: 1 A^3 CRYST1 record, this is usually a placeholder. Unit cell dimensions will be set to None.
warnings.warn(
In [11]: a = dssp.DSSP(u)
In [12]: a.run()
/Users/brady/git/mdanalysis/package/MDAnalysis/coordinates/PDB.py:479: UserWarning: 1 A^3 CRYST1 record, this is usually a placeholder. Unit cell dimensions will be set to None.
warnings.warn(
/Users/brady/git/mdanalysis/package/MDAnalysis/analysis/base.py:562: UserWarning: Reader has no dt information, set to 1.0 ps
self.times[idx] = ts.time
Out[12]: <MDAnalysis.analysis.dssp.dssp.DSSP at 0x110d4c910>There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this is what I was trying to convey earlier in the conversation when I was saying that the PDB reader takes care of that edge case already. I'm not sure what the best approach is here, thoughts @marinegor ?
| return np.linalg.norm(x[o_indices] - y[n_indices], axis=-1) | ||
| else: | ||
| return np.linalg.norm( | ||
| minimize_vectors(x[o_indices] - y[n_indices], box=box), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or use directly MDAnalysis.lib.distances.calc_bonds
return calc_bonds(x[o_indices], y[n_indices], box=box)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah this is good to know! (and yeah I missed it from naming). Will update to use that instead of this combo.
orbeckst
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea to use our distance functions. (see comments on using calc_bonds, whose name betrays its flexibility and thus may not be widely know as the the function for PBC-aware distances.
I would add an explicit backend keyword argument to all calls of our distance functions (and a way for the user to also set the backend when invoking the class) so that it becomes possible to enable parallel/distopia backends.
|
@marinegor may I assign you as the PR shepherd? If you don't have the bandwidth, please unassign yourself. |
Changes made in this Pull Request:
This updates the
get_hbond_mapfunction to useMDAnalysis.lib.distances.capped_distanceto speed up distance calculations. Nothing changes in the API, but rather than computing a pairwise distance matrix of every residue against every other residue, we can first construct a KDTree for both theNandOatoms and use that for finding relevant pairs within a certain cutoff distance. The returned value fromget_hbond_mapremains the same but we have avoided having to compute distances and energies between residues that would never form an HBond to begin with.In my benchmarks running just the
get_hbond_mapfunction this produces a speedup from 2-5x, with that difference continuing to increase with size of the protein system:developThis PR
PR Checklist
package/CHANGELOGfile updated?package/AUTHORS? (If it is not, add it!)Developers Certificate of Origin
I certify that I can submit this code contribution as described in the Developer Certificate of Origin, under the MDAnalysis LICENSE.
📚 Documentation preview 📚: https://mdanalysis--5182.org.readthedocs.build/en/5182/