Replace pickle with safer serialization for shard caching


# Replace pickle with safer serialization for shard caching

## Background: 

The streaming preprocessing path in helix_lm/dataset.py uses pickle to save/load shard cache files (.pkl). While these are internal temporary files, pickle has known security risks (arbitrary code execution on untrusted data) and performance limitations.

## Current usage: (Branch: 43-deterministic-shard-order](https://github.com/david-thrower/HelixLM/tree/43-deterministic-shard-order)

• pickle.dump() at lines ~755, ~790, ~945 (shard serialization)
• pickle.load() in HelixShardedDataset.__getitem__

## Proposed alternatives:

   1 torch.save() with weights_only=True - Drop-in replacement, safer than pickle
   2 safetensors - Preferred forHF ecosystem, but requires flattening nested dict structure
   3 numpy.savez_compressed() - Simple, but loses torch tensor metadata


##  Trade-offs:

| Approach | Effort | Safety | Performance  |  Notes  | 
|---|---|---|---|---|
| torch.save w/ weights_only | Low | Medium  | Similar  | Minimal code change  |
|  safetensors  | Medium | High | Fast | Requires data model refactor  |
| Keep pickle |  None |  Low |  Baseline | Acceptable for trusted internal cache |

## Recommendation: 

Evaluate after training stability confirmed. If we proceed with safetensors, need to flatten List[Dict[str, Tensor]] structure or pad to uniform tensors per shard.

Priority: Low to medium (internal cache files, same-process read/write)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace pickle with safer serialization for shard caching #41

Replace pickle with safer serialization for shard caching

Background:

Current usage: (Branch: 43-deterministic-shard-order](https://github.com/david-thrower/HelixLM/tree/43-deterministic-shard-order)

Proposed alternatives:

Trade-offs:

Recommendation:

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Approach	Effort	Safety	Performance	Notes
torch.save w/ weights_only	Low	Medium	Similar	Minimal code change
safetensors	Medium	High	Fast	Requires data model refactor
Keep pickle	None	Low	Baseline	Acceptable for trusted internal cache

Replace pickle with safer serialization for shard caching #41

Description

Replace pickle with safer serialization for shard caching

Background:

Current usage: (Branch: 43-deterministic-shard-order](https://github.com/david-thrower/HelixLM/tree/43-deterministic-shard-order)

Proposed alternatives:

Trade-offs:

Recommendation:

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions