BatchDataDict chunking overhead from repeated concats

**Describe the bug**

`BatchedDataDict.shard_by_batch_size()` currently performs incremental concatenation in the chunk loop for tensor and `PackedTensor` fields. This causes some overhead as every chunk is re-copied into a new output tensor on every concat call, more noticeable when rollout sample count per step increases.