Describe the bug
BatchedDataDict.shard_by_batch_size() currently performs incremental concatenation in the chunk loop for tensor and PackedTensor fields. This causes some overhead as every chunk is re-copied into a new output tensor on every concat call, more noticeable when rollout sample count per step increases.