Conversation
stephantul
left a comment
There was a problem hiding this comment.
Lots of small things, nothing major. Most of it relates to prior things we could fix.
|
|
||
| Tokenlearn was developed by the [Minish](https://github.com/MinishLab) team consisting of [Stephan Tulkens](https://github.com/stephantul) and [Thomas van Dongen](https://github.com/Pringled). | ||
|
|
||
| ## Citation |
There was a problem hiding this comment.
Maybe we can add a zenodo citation for the software as well? If that's possible. (not this PR, just future)
| {"text": texts, "embedding": [e.tolist() for e in embeddings]}, | ||
| features=_FEATURES, | ||
| ) | ||
| part.save_to_disk(str(checkpoints_dir / f"part_{part_idx:08d}")) |
There was a problem hiding this comment.
regarding saving. You can directly save this as a parquet file, e.g.,
shard_00000.parquet
shard_00001.parquet
To do this, you can apparently first force it into a single shard, and then save it to parquet:
single_shard = ds.shard(num_shards=1, index=0)
single_shard.to_parquet(f"shard_{part_idx:08d}.parquet")Note that this only works for datasets, not datasetdicts.
| part.save_to_disk(str(checkpoints_dir / f"part_{part_idx:08d}")) | ||
|
|
||
|
|
||
| def _compact_checkpoints(checkpoints_dir: Path, output_dir: Path, keep_checkpoints: bool) -> None: |
There was a problem hiding this comment.
If you do the above, you'd just need to write the metadata. But I think you can get away with not writing any metadata tbh.
| @@ -53,7 +124,7 @@ | |||
| if i * batch_size >= max_means: | |||
There was a problem hiding this comment.
maybe rewrite to max_rows, or call the other variable means_done (I prefer renaming this line)
| logger.info(f"Reached maximum number of means: {max_means}") | ||
| break | ||
| if largest_batch and i <= largest_batch: | ||
| if i * batch_size < rows_done: |
There was a problem hiding this comment.
you compute i * batch_size twice.
| json.dump(texts, open(output_dir_path / f"feature_{i}.json", "w"), indent=4) | ||
| np.save(output_dir_path / f"feature_{i}.npy", embeddings) | ||
| _save_checkpoint(checkpoints_dir, texts, embeddings, part_idx) | ||
| part_idx += 1 |
There was a problem hiding this comment.
part_idx is necessarily equal to i // _SAVE_EVERY, so no need to bookkeep it here.
There was a problem hiding this comment.
This is kind of random. So if someone were to switch batch size after resuming, the resume logic would still work I guess. But you wouldn't be able to guess part_idx anymore. So the above is true, except if people switch batch size.
So relying on i * batch_size is a bit brittle. So what I would suggest, perhaps is the following:
Reinterpret _SAVE_EVERY as the number of items. That way, your shard size no longer relies on the batch size. Right now it's actually a bit weird that we produce much smaller shards for smaller batch sizes. Still wouldn't solve the problem though.
| model.max_seq_length = max_length | ||
| logger.info(f"Set tokenizer maximum length to {max_length}.") | ||
| # Binding i in case the dataset is empty. | ||
| i = 0 |
There was a problem hiding this comment.
not sure if this is necessary any more (not part of this PR, I know, sorry!)
No description provided.