Replace pickle with safer serialization for shard caching
Background:
The streaming preprocessing path in helix_lm/dataset.py uses pickle to save/load shard cache files (.pkl). While these are internal temporary files, pickle has known security risks (arbitrary code execution on untrusted data) and performance limitations.
• pickle.dump() at lines ~755, ~790, ~945 (shard serialization)
• pickle.load() in HelixShardedDataset.getitem
Proposed alternatives:
1 torch.save() with weights_only=True - Drop-in replacement, safer than pickle
2 safetensors - Preferred forHF ecosystem, but requires flattening nested dict structure
3 numpy.savez_compressed() - Simple, but loses torch tensor metadata
Trade-offs:
| Approach |
Effort |
Safety |
Performance |
Notes |
| torch.save w/ weights_only |
Low |
Medium |
Similar |
Minimal code change |
| safetensors |
Medium |
High |
Fast |
Requires data model refactor |
| Keep pickle |
None |
Low |
Baseline |
Acceptable for trusted internal cache |
Recommendation:
Evaluate after training stability confirmed. If we proceed with safetensors, need to flatten List[Dict[str, Tensor]] structure or pad to uniform tensors per shard.
Priority: Low to medium (internal cache files, same-process read/write)
Replace pickle with safer serialization for shard caching
Background:
The streaming preprocessing path in helix_lm/dataset.py uses pickle to save/load shard cache files (.pkl). While these are internal temporary files, pickle has known security risks (arbitrary code execution on untrusted data) and performance limitations.
Current usage: (Branch: 43-deterministic-shard-order](https://github.com/david-thrower/HelixLM/tree/43-deterministic-shard-order)
• pickle.dump() at lines ~755, ~790, ~945 (shard serialization)
• pickle.load() in HelixShardedDataset.getitem
Proposed alternatives:
1 torch.save() with weights_only=True - Drop-in replacement, safer than pickle
2 safetensors - Preferred forHF ecosystem, but requires flattening nested dict structure
3 numpy.savez_compressed() - Simple, but loses torch tensor metadata
Trade-offs:
Recommendation:
Evaluate after training stability confirmed. If we proceed with safetensors, need to flatten List[Dict[str, Tensor]] structure or pad to uniform tensors per shard.
Priority: Low to medium (internal cache files, same-process read/write)