Fix CIF parsing failures for structures with numeric values and relative paths#39
Open
adobles96 wants to merge 8 commits intosteineggerlab:masterfrom
Open
Fix CIF parsing failures for structures with numeric values and relative paths#39adobles96 wants to merge 8 commits intosteineggerlab:masterfrom
adobles96 wants to merge 8 commits intosteineggerlab:masterfrom
Conversation
The pdbtbx-cif lexer parses unquoted all-numeric tokens like `919`
(a valid PDB ligand comp_id) as Value::Numeric instead of Value::Text.
The CIF field parsers (get_three_char_array, get_four_char_array,
get_one_char) only handled Value::Text, returning None for Numeric
values, which caused a panic at cif.rs:254 ("Residue name should be
provided") when indexing structures like PDB 6MWE.
Fix: coerce Value::Numeric to a string in all three text-field parsers
before applying the existing length-based conversion logic.
Reproducer: `folddisco index` on any CIF with an all-numeric comp_id
(e.g. RCSB 6MWE which contains ligand 919).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…x-cif) The CIF lexer treated any token starting with '.' as Inapplicable and '?' as Unknown, consuming only the single character. This broke parsing of CIF files containing unquoted values like './path/to/file' (e.g. in _ihm_external_files.file_path) — the '.' was consumed as Inapplicable, leaving '/path/to/file' as a stray token that threw off the loop item count, causing "Loop has incorrect number of data items". Fix: only treat '.' and '?' as special values when they stand alone (followed by whitespace or EOF). When followed by other characters: - '.' + digit → try numeric parse (e.g. '.42'), fall back to text - '.' + non-digit → parse as text identifier (e.g. './path') - '?' + any char → parse as text identifier (e.g. '?foo') Reproducer: RCSB 9A1O which contains `./EM/...` paths in _ihm_external_files.file_path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Field parser tests (cif.rs): - get_three_char_array with Numeric(919) → "919" (ligand code) - get_four_char_array with Numeric(10) → " 10 " (atom name) - get_one_char with Numeric(1) and Numeric(10) (single and multi-digit chain IDs) Lexer tests (pdbtbx-cif): - '.' followed by digit parses as numeric (e.g. '.5' → 0.5) - '.' followed by path parses as text (e.g. './EM/part_4.mrc') - '.' and '?' at end of input parse as Inapplicable/Unknown Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Truncating "10" to b'1' would silently merge different chains, corrupting the structural data and geometric hashes. Instead, return None for multi-character values so the caller falls through to label_asym_id (which is typically a proper single-char like "A"). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…t-parsing Fix CIF parsing failures for structures with numeric values and relative paths
Large cryo-EM structures (e.g. PDB 9A1O) have multi-character chain IDs
like "10" in auth_asym_id. Previously these were truncated to a single
byte. This commit widens chain storage to String throughout the pipeline:
- AtomVector, Structure, CompactStructure chain fields: u8 -> String
- ResidueMatch type: (u8, u64) -> (String, u64)
- CIF parser: replace get_one_char with get_text_string (preserves full chain name)
- PDB/FCZ parsers: convert at boundary, Atom struct unchanged for FFI compat
- Query format: underscore separator (e.g. "AA_250") for unambiguous multi-char chains,
with backward compat for single-char legacy format ("A250")
- All output formatting uses CHAIN_RESIDUE separator
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…sname-fields feat: widen chain IDs from u8 to String for multi-char CIF support
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Folddisco's CIF parser crashes or silently corrupts data on a number of real-world RCSB structures. We hit these while building folddisco indices at scale.
Fixes
1. Lexer:
.and?misparsed when followed by non-whitespace (lib/pdbtbx-cif/src/lib.rs)The lexer treated . as Inapplicable and ? as Unknown regardless of what followed. In PDB 9A1O, the field
_ihm_external_files.file_pathcontains unquoted relative paths like./EM/part_4.mrc. The leading . was consumed as Inapplicable, leaving/EM/part_4.mrcas an orphan token and breaking loop parsing with"Loop has incorrect number of data items".Fix:
.and?are only treated as special values when followed by whitespace or EOF. Otherwise they're parsed as text identifiers.2. Numeric residue/ligand names cause panic (
src/structure/io/cif.rs)All-numeric ligand codes (e.g.
919in PDB 6MWE, the Tie2 inhibitor DCC-2036) are lexed asValue::Numeric(919.0).get_three_char_arrayandget_four_char_arrayonly handledValue::Text, returningNoneand triggering the panic at line 254:"Residue name should be provided".Fix:
get_three_char_array,get_four_char_array, andget_one_charnow convertValue::Numericto its string representation before processing.3. Multi-character chain IDs cause error (
src/structure/io/cif.rs)Large cryo-EM structures like PDB 9A1O have multi-character chain ids (e.g.
auth_asym_id = 10).get_one_charerrored because "10" can't fit in a single byte.Fix:
get_one_charreturnsNonefor multi-character chain IDs, allowing the caller to fall back tolabel_asym_id(which is typically single-character). NOTE: this is only a band-aid fix. The proper fix is to allow chain ids to be 4 bytes (thus matching mmCIF).Tests
Added unit tests covering: