Skip to content

Fix CIF parsing failures for structures with numeric values and relative paths#39

Open
adobles96 wants to merge 8 commits intosteineggerlab:masterfrom
genesistherapeutics:master
Open

Fix CIF parsing failures for structures with numeric values and relative paths#39
adobles96 wants to merge 8 commits intosteineggerlab:masterfrom
genesistherapeutics:master

Conversation

@adobles96
Copy link
Copy Markdown

@adobles96 adobles96 commented Mar 27, 2026

Problem

Folddisco's CIF parser crashes or silently corrupts data on a number of real-world RCSB structures. We hit these while building folddisco indices at scale.

Fixes

1. Lexer: . and ? misparsed when followed by non-whitespace (lib/pdbtbx-cif/src/lib.rs)

The lexer treated . as Inapplicable and ? as Unknown regardless of what followed. In PDB 9A1O, the field _ihm_external_files.file_path contains unquoted relative paths like ./EM/part_4.mrc. The leading . was consumed as Inapplicable, leaving /EM/part_4.mrc as an orphan token and breaking loop parsing with "Loop has incorrect number of data items".

Fix: . and ? are only treated as special values when followed by whitespace or EOF. Otherwise they're parsed as text identifiers.

2. Numeric residue/ligand names cause panic (src/structure/io/cif.rs)

All-numeric ligand codes (e.g. 919 in PDB 6MWE, the Tie2 inhibitor DCC-2036) are lexed as Value::Numeric(919.0). get_three_char_array and get_four_char_array only handled Value::Text, returning None and triggering the panic at line 254: "Residue name should be provided".

Fix: get_three_char_array, get_four_char_array, and get_one_char now convert Value::Numeric to its string representation before processing.

3. Multi-character chain IDs cause error (src/structure/io/cif.rs)

Large cryo-EM structures like PDB 9A1O have multi-character chain ids (e.g. auth_asym_id = 10). get_one_char errored because "10" can't fit in a single byte.

Fix: get_one_char returns None for multi-character chain IDs, allowing the caller to fall back to label_asym_id (which is typically single-character). NOTE: this is only a band-aid fix. The proper fix is to allow chain ids to be 4 bytes (thus matching mmCIF).

Tests

Added unit tests covering:

  • Numeric Value handling in get_three_char_array, get_four_char_array, get_one_char
  • Multi-digit chain ID fallback
  • Lexer behavior for . and ? in various positions (bare, before digits, before path characters, at EOF)

adobles96 and others added 8 commits March 25, 2026 15:59
The pdbtbx-cif lexer parses unquoted all-numeric tokens like `919`
(a valid PDB ligand comp_id) as Value::Numeric instead of Value::Text.
The CIF field parsers (get_three_char_array, get_four_char_array,
get_one_char) only handled Value::Text, returning None for Numeric
values, which caused a panic at cif.rs:254 ("Residue name should be
provided") when indexing structures like PDB 6MWE.

Fix: coerce Value::Numeric to a string in all three text-field parsers
before applying the existing length-based conversion logic.

Reproducer: `folddisco index` on any CIF with an all-numeric comp_id
(e.g. RCSB 6MWE which contains ligand 919).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…x-cif)

The CIF lexer treated any token starting with '.' as Inapplicable and
'?' as Unknown, consuming only the single character. This broke parsing
of CIF files containing unquoted values like './path/to/file' (e.g. in
_ihm_external_files.file_path) — the '.' was consumed as Inapplicable,
leaving '/path/to/file' as a stray token that threw off the loop item
count, causing "Loop has incorrect number of data items".

Fix: only treat '.' and '?' as special values when they stand alone
(followed by whitespace or EOF). When followed by other characters:
- '.' + digit → try numeric parse (e.g. '.42'), fall back to text
- '.' + non-digit → parse as text identifier (e.g. './path')
- '?' + any char → parse as text identifier (e.g. '?foo')

Reproducer: RCSB 9A1O which contains `./EM/...` paths in
_ihm_external_files.file_path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Field parser tests (cif.rs):
- get_three_char_array with Numeric(919) → "919" (ligand code)
- get_four_char_array with Numeric(10) → " 10 " (atom name)
- get_one_char with Numeric(1) and Numeric(10) (single and multi-digit chain IDs)

Lexer tests (pdbtbx-cif):
- '.' followed by digit parses as numeric (e.g. '.5' → 0.5)
- '.' followed by path parses as text (e.g. './EM/part_4.mrc')
- '.' and '?' at end of input parse as Inapplicable/Unknown

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Truncating "10" to b'1' would silently merge different chains,
corrupting the structural data and geometric hashes. Instead, return
None for multi-character values so the caller falls through to
label_asym_id (which is typically a proper single-char like "A").

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…t-parsing

Fix CIF parsing failures for structures with numeric values and relative paths
Large cryo-EM structures (e.g. PDB 9A1O) have multi-character chain IDs
like "10" in auth_asym_id. Previously these were truncated to a single
byte. This commit widens chain storage to String throughout the pipeline:

- AtomVector, Structure, CompactStructure chain fields: u8 -> String
- ResidueMatch type: (u8, u64) -> (String, u64)
- CIF parser: replace get_one_char with get_text_string (preserves full chain name)
- PDB/FCZ parsers: convert at boundary, Atom struct unchanged for FFI compat
- Query format: underscore separator (e.g. "AA_250") for unambiguous multi-char chains,
  with backward compat for single-char legacy format ("A250")
- All output formatting uses CHAIN_RESIDUE separator

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…sname-fields

feat: widen chain IDs from u8 to String for multi-char CIF support
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant