Skip to content

Add merged structure IO and mmCIF parity#74

Closed
heathcliff233 wants to merge 2 commits intosteineggerlab:masterfrom
heathcliff233:feature/merged-structure-io
Closed

Add merged structure IO and mmCIF parity#74
heathcliff233 wants to merge 2 commits intosteineggerlab:masterfrom
heathcliff233:feature/merged-structure-io

Conversation

@heathcliff233
Copy link
Copy Markdown

Summary

This PR adds practical multi-chain support while keeping FCZ chunk storage unchanged (single-chain per chunk).

Highlights

  • Added split/merge workflow for multi-chain/discontinuous structures:
    • Python: compress(..., split=True) and open(..., merge_fragments=True)
  • Added Python/CLI format parity for decompression output:
    • pdb | mmcif | cif
    • CLI: foldcomp decompress --output-format ...
  • Added shared mmCIF writer path in C++ output utilities.
  • Improved robustness by creating/checking parent directories for tar/db outputs.

Compatibility

  • Backward compatible with existing FCZ/DB files.
  • Single-chain behavior remains unchanged.

Validation

  • conda run -n foldcomp pytest test -q12 passed
  • conda run -n foldcomp ./build.sh test → pass

- add Python split compression and merged fragment database reads

- expose source fragment indices for merged entries

- support format-selectable decompression in Python and CLI (pdb|mmcif|cif)

- add shared mmCIF atom writer in C++ output path

- harden tar/db output path handling with parent-directory checks

- expand tests and docs using existing multichain fixture
@heathcliff233
Copy link
Copy Markdown
Author

Hi authors, thanks for the great tool and the efforts on maintaining it. I have checked the error log and followed the black formatting requirements. Other errors seem to be on the github server side that failed on apt install.

This PR aims to add protein multimer support based on the current storage format. It seems that there is already support for segment storage, so I reuse it for multi-chain support and add an additional layer to allow sample-level iteration (with additional mmcif write option by gemmi). Hope it can help.

Best,
Liang

@milot-mirdita
Copy link
Copy Markdown
Member

Thanks a lot for the work. However, we have been exploring a different approach to be able to store full complexes. We have started to implement a container format that stores multiple models, chains, etc directly in the Foldcomp codebase (and not in the Python API).

It's still incomplete but he work is here:
https://github.com/milot-mirdita/foldcomp

@heathcliff233
Copy link
Copy Markdown
Author

Thanks for the clarification and for sharing the new direction. Glad to know that a container format for full complexes is being developed.

I’ll keep an eye on the progress, and I’d be happy to contribute once the new format stabilizes.

Best,
Liang

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants