-
Notifications
You must be signed in to change notification settings - Fork 154
adds a new sort utility to order h5ad files for faster cell load #251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,72 @@ | ||
| import argparse as ap | ||
|
|
||
|
|
||
| def add_arguments_sort(parser: ap.ArgumentParser): | ||
| """Add arguments for the sort subcommand.""" | ||
| parser.add_argument( | ||
| "--input", | ||
| type=str, | ||
| required=True, | ||
| help="Path to input AnnData file (.h5ad)", | ||
| ) | ||
| parser.add_argument( | ||
| "--output", | ||
| type=str, | ||
| required=True, | ||
| help="Path to output sorted AnnData file (.h5ad)", | ||
| ) | ||
| parser.add_argument( | ||
| "--context-col", | ||
| type=str, | ||
| required=True, | ||
| help="obs column to sort by context (e.g. cell type)", | ||
| ) | ||
| parser.add_argument( | ||
| "--batch-col", | ||
| type=str, | ||
| required=False, | ||
| default=None, | ||
| help="optional obs column to sort by batch (if omitted, sorts by context + perturbation)", | ||
| ) | ||
| parser.add_argument( | ||
| "--pert-col", | ||
| type=str, | ||
| required=True, | ||
| help="obs column to sort by perturbation", | ||
| ) | ||
|
|
||
|
|
||
| def run_tx_sort(args: ap.Namespace): | ||
| import logging | ||
| from pathlib import Path | ||
|
|
||
| import anndata as ad | ||
|
|
||
| logging.basicConfig(level=logging.INFO) | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
| input_path = args.input | ||
| output_path = args.output | ||
| sort_cols = [args.context_col] | ||
| if args.batch_col: | ||
| sort_cols.append(args.batch_col) | ||
| sort_cols.append(args.pert_col) | ||
|
|
||
| logger.info("Loading AnnData from %s", input_path) | ||
| adata = ad.read_h5ad(input_path) | ||
|
|
||
| missing = [col for col in sort_cols if col not in adata.obs.columns] | ||
| if missing: | ||
| raise ValueError(f"Missing obs columns for sorting: {missing}") | ||
|
|
||
| logger.info("Sorting AnnData by columns: %s", sort_cols) | ||
| order = adata.obs.sort_values(by=sort_cols, kind="mergesort").index | ||
| adata_sorted = adata[order].copy() | ||
|
|
||
| output_dir = Path(output_path).parent | ||
| if output_dir and not output_dir.exists(): | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| output_dir.mkdir(parents=True, exist_ok=True) | ||
|
|
||
| logger.info("Writing sorted AnnData to %s", output_path) | ||
| adata_sorted.write_h5ad(output_path) | ||
| logger.info("Sort complete. Wrote %d cells.", adata_sorted.n_obs) | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For better code organization and adherence to PEP 8, standard library imports like
loggingandpathlibshould be placed at the top of the file. While it's a good practice to delay importing heavy libraries likeanndatainside a function for CLI tools to improve startup time for other subcommands, this doesn't apply to lightweight standard libraries.