Fix CIF parsing failures for structures with numeric values and relative paths by adobles96 · Pull Request #39 · steineggerlab/folddisco

adobles96 · 2026-03-27T19:31:18Z

Problem

Folddisco's CIF parser crashes or silently corrupts data on a number of real-world RCSB structures. We hit these while building folddisco indices at scale.

Fixes

1. Lexer: `.` and `?` misparsed when followed by non-whitespace (`lib/pdbtbx-cif/src/lib.rs`)

The lexer treated . as Inapplicable and ? as Unknown regardless of what followed. In PDB 9A1O, the field _ihm_external_files.file_path contains unquoted relative paths like ./EM/part_4.mrc. The leading . was consumed as Inapplicable, leaving /EM/part_4.mrc as an orphan token and breaking loop parsing with "Loop has incorrect number of data items".

Fix: . and ? are only treated as special values when followed by whitespace or EOF. Otherwise they're parsed as text identifiers.

2. Numeric residue/ligand names cause panic (`src/structure/io/cif.rs`)

All-numeric ligand codes (e.g. 919 in PDB 6MWE, the Tie2 inhibitor DCC-2036) are lexed as Value::Numeric(919.0). get_three_char_array and get_four_char_array only handled Value::Text, returning None and triggering the panic at line 254: "Residue name should be provided".

Fix: get_three_char_array, get_four_char_array, and get_one_char now convert Value::Numeric to its string representation before processing.

3. Multi-character chain IDs cause error (`src/structure/io/cif.rs`)

Large cryo-EM structures like PDB 9A1O have multi-character chain ids (e.g. auth_asym_id = 10). get_one_char errored because "10" can't fit in a single byte.

Fix: get_one_char returns None for multi-character chain IDs, allowing the caller to fall back to label_asym_id (which is typically single-character). NOTE: this is only a band-aid fix. The proper fix is to allow chain ids to be 4 bytes (thus matching mmCIF).

Tests

Added unit tests covering:

Numeric Value handling in get_three_char_array, get_four_char_array, get_one_char
Multi-digit chain ID fallback
Lexer behavior for . and ? in various positions (bare, before digits, before path characters, at EOF)

The pdbtbx-cif lexer parses unquoted all-numeric tokens like `919` (a valid PDB ligand comp_id) as Value::Numeric instead of Value::Text. The CIF field parsers (get_three_char_array, get_four_char_array, get_one_char) only handled Value::Text, returning None for Numeric values, which caused a panic at cif.rs:254 ("Residue name should be provided") when indexing structures like PDB 6MWE. Fix: coerce Value::Numeric to a string in all three text-field parsers before applying the existing length-based conversion logic. Reproducer: `folddisco index` on any CIF with an all-numeric comp_id (e.g. RCSB 6MWE which contains ligand 919). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…x-cif) The CIF lexer treated any token starting with '.' as Inapplicable and '?' as Unknown, consuming only the single character. This broke parsing of CIF files containing unquoted values like './path/to/file' (e.g. in _ihm_external_files.file_path) — the '.' was consumed as Inapplicable, leaving '/path/to/file' as a stray token that threw off the loop item count, causing "Loop has incorrect number of data items". Fix: only treat '.' and '?' as special values when they stand alone (followed by whitespace or EOF). When followed by other characters: - '.' + digit → try numeric parse (e.g. '.42'), fall back to text - '.' + non-digit → parse as text identifier (e.g. './path') - '?' + any char → parse as text identifier (e.g. '?foo') Reproducer: RCSB 9A1O which contains `./EM/...` paths in _ihm_external_files.file_path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Field parser tests (cif.rs): - get_three_char_array with Numeric(919) → "919" (ligand code) - get_four_char_array with Numeric(10) → " 10 " (atom name) - get_one_char with Numeric(1) and Numeric(10) (single and multi-digit chain IDs) Lexer tests (pdbtbx-cif): - '.' followed by digit parses as numeric (e.g. '.5' → 0.5) - '.' followed by path parses as text (e.g. './EM/part_4.mrc') - '.' and '?' at end of input parse as Inapplicable/Unknown Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Truncating "10" to b'1' would silently merge different chains, corrupting the structural data and geometric hashes. Instead, return None for multi-character values so the caller falls through to label_asym_id (which is typically a proper single-char like "A"). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…t-parsing Fix CIF parsing failures for structures with numeric values and relative paths

Large cryo-EM structures (e.g. PDB 9A1O) have multi-character chain IDs like "10" in auth_asym_id. Previously these were truncated to a single byte. This commit widens chain storage to String throughout the pipeline: - AtomVector, Structure, CompactStructure chain fields: u8 -> String - ResidueMatch type: (u8, u64) -> (String, u64) - CIF parser: replace get_one_char with get_text_string (preserves full chain name) - PDB/FCZ parsers: convert at boundary, Atom struct unchanged for FFI compat - Query format: underscore separator (e.g. "AA_250") for unambiguous multi-char chains, with backward compat for single-char legacy format ("A250") - All output formatting uses CHAIN_RESIDUE separator Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…sname-fields feat: widen chain IDs from u8 to String for multi-char CIF support

adobles96 and others added 8 commits March 25, 2026 15:59

manual edits

95c9c0f

Merge pull request #1 from genesistherapeutics/fix/cif-numeric-and-do…

d30dfd3

…t-parsing Fix CIF parsing failures for structures with numeric values and relative paths

Merge pull request #2 from genesistherapeutics/fix/widen-chain-and-re…

56894e7

…sname-fields feat: widen chain IDs from u8 to String for multi-char CIF support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CIF parsing failures for structures with numeric values and relative paths#39

Fix CIF parsing failures for structures with numeric values and relative paths#39
adobles96 wants to merge 8 commits intosteineggerlab:masterfrom
genesistherapeutics:master

adobles96 commented Mar 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

adobles96 commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fixes

1. Lexer: . and ? misparsed when followed by non-whitespace (lib/pdbtbx-cif/src/lib.rs)

2. Numeric residue/ligand names cause panic (src/structure/io/cif.rs)

3. Multi-character chain IDs cause error (src/structure/io/cif.rs)

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

adobles96 commented Mar 27, 2026 •

edited

Loading

1. Lexer: `.` and `?` misparsed when followed by non-whitespace (`lib/pdbtbx-cif/src/lib.rs`)

2. Numeric residue/ligand names cause panic (`src/structure/io/cif.rs`)

3. Multi-character chain IDs cause error (`src/structure/io/cif.rs`)