Skip to content

Replace pickle with safer serialization for shard caching #41

Description

@david-thrower

Replace pickle with safer serialization for shard caching

Background:

The streaming preprocessing path in helix_lm/dataset.py uses pickle to save/load shard cache files (.pkl). While these are internal temporary files, pickle has known security risks (arbitrary code execution on untrusted data) and performance limitations.

Current usage: (Branch: 43-deterministic-shard-order](https://github.com/david-thrower/HelixLM/tree/43-deterministic-shard-order)

• pickle.dump() at lines ~755, ~790, ~945 (shard serialization)
• pickle.load() in HelixShardedDataset.getitem

Proposed alternatives:

1 torch.save() with weights_only=True - Drop-in replacement, safer than pickle
2 safetensors - Preferred forHF ecosystem, but requires flattening nested dict structure
3 numpy.savez_compressed() - Simple, but loses torch tensor metadata

Trade-offs:

Approach Effort Safety Performance Notes
torch.save w/ weights_only Low Medium Similar Minimal code change
safetensors Medium High Fast Requires data model refactor
Keep pickle None Low Baseline Acceptable for trusted internal cache

Recommendation:

Evaluate after training stability confirmed. If we proceed with safetensors, need to flatten List[Dict[str, Tensor]] structure or pad to uniform tensors per shard.

Priority: Low to medium (internal cache files, same-process read/write)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions