Reworking ECOD data wrangling to use ECOD hierarchy IDs instead of manually constructed ones#19
Open
Reworking ECOD data wrangling to use ECOD hierarchy IDs instead of manually constructed ones#19
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Previously, the code was manually constructing and assigning hierarchy/lineage IDs to the different levels of ECOD annotations (http://prodata.swmed.edu/ecod/af2_pdb/documentation); however, there is a hierarchy file that actually has predefined IDs that we could use: http://prodata.swmed.edu/ecod/distributions/ecod.v294.hierarchy.txt
Another issue with the existing code is that it would potentially overwrite the parent IDs for children if they have multiple parents.
Also, the parameter names are extremely confusing and are reused with different meanings throughout the code.
This PR is intended to replace the manually constructed ID process with the use of predefined ECOD IDs, allow for multiple parents, and clean up the code.
I got through most of this, but at the last moment realized that I am incidentally overwriting the mappings between certain ECOD group names and their IDs, because apparently even within a group (say,
X), the same name can be used but with a different ID. So, I think the solution will be to create the child: parent mapping based on the1.1.1.1string, in which1.1.1is the parent, and so forth. Then,Agroup IDs (which are of a different format) would need to be included in the mapping where relevant (based on the content of the ECOD flat file).There is a lot of commented out code and notes still in there, but it will all be removed once I figure out this last bit. I just want to create this PR so I don't lose track of it.