You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
gtfparse today uses polars to do the CSV-style read of GTF/GFF files, then immediately converts to pandas for everything downstream — there's a load-bearing comment in read_gtf.py explaining the conversion is there because "Polars bugs manifest as pyo3_runtime.PanicException: assertion left == right failed: impl error and are generally insane to chase down." That, plus a steady stream of polars-API-renamed issues (#36, #47, #50, #54), plus the global string cache footgun guarded by polars.enable_string_cache() in parse_with_polars_lazy, suggests the polars dependency is paying its cost without earning back enough value for a small TSV parser. Filing this issue to evaluate the migration on numbers.
What polars actually buys us today
Fast threaded CSV reader with column-type overrides.
Lazy plan for the features= row filter (minor — files small enough for eager).
That's it. The rest of read_gtf (attribute expansion, alias rename, version casting, biotype inference, usecols filter, result_type conversion) is all pandas.
What it costs us
Bug surface: opaque PyO3 panics, the explicit to_pandas() round-trip in the middle of read_gtf exists because of them.
String-cache global state: polars.enable_string_cache() is process-global. We added test_version_cast_is_idempotent_via_double_read to prove repeated reads don't bite us; that test wouldn't exist with pyarrow.
Two-DataFrame API surface: result_type=\"polars\" / \"pandas\" / \"dict\" means we maintain three output shapes against two underlying libraries. Dropping polars collapses this.
Proposed migration target
Replace the polars.read_csv path with pyarrow.csv.read_csv (or pyarrow.csv.open_csv for streaming), hand off to pandas via Table.to_pandas(types_mapper=pd.ArrowDtype, self_destruct=True, split_blocks=True). Concretely:
No comment=\"#\" shortcut. Use skip_rows + an invalid_row_handler that returns \"skip\" when row.text.startswith(\"#\"). The handler only fires on column-count mismatch, so a #!genome-build header (1 field vs expected 9) triggers it cleanly.
All-or-nothing quoting. quote_char=False is the right call for GTF since Ensembl release 78 famously has unescaped quotes; existing fix_quotes_columns post-processing in parse_with_polars_lazy can move into the new path as a pandas-side regex.
23.0.0 type-inference regression on extreme scientific notation strings ([Python] CSV reader returns different values in 23.0.0 apache/arrow#49003); fixed in 23.0.1. We already pin types defensively for the columns that matter, so this doesn't bite — but it's a reminder to always declare column_types and not rely on inference.
BOM not auto-stripped. Add a one-time sniff for \\xef\\xbb\\xbf.
Performance expectations (from May 2026 benchmarks)
Independent reproductions of the Polars PDS-H benchmark and the pandas-vs-polars-2025 writeups put us at:
polars.read_csv: ~10x faster than pandas C engine, ~2-3x faster than pyarrow.csv on raw CSV reads at SF-10 / SF-100 scale.
Memory: polars ~179 MB peak on a 1 GB CSV vs pandas numpy backend ~1.4 GB. pyarrow.csv → to_pandas(types_mapper=pd.ArrowDtype) closes most of that gap because buffers stay Arrow.
Net call: ~2-3x slower parse, ~same memory. For typical GTF sizes (Ensembl release human GTF is ~1.3 GB uncompressed, GENCODE comprehensive ~1.5 GB), we're looking at adding 5-15 seconds to a one-shot load that downstream tools cache anyway. Not a blocker; pyensembl and varcode load this once at index time.
What we'd gain
No more global string cache.
No more PyO3 panic exception class — pyarrow exceptions are normal Python ArrowInvalid / ArrowTypeError.
Stable API surface — pyarrow CSV options have been stable across major versions; polars renames its CSV kwargs roughly once a year.
Native nullable Int64 for the *_version columns without the pd.to_numeric(...).astype(\"Int64\") dance.
Native arrow-backed string columns via dtype_backend=\"pyarrow\" — categoricals become a pandas extension dtype instead of a polars-side Categorical that converts oddly.
Table.filter(pyarrow.compute.field(\"feature\")...) for the features= filter — same expressive power as polars' lazy filter.
Users who want polars can still call pl.from_arrow(table) zero-copy.
Decision criteria — when to actually pull the trigger
Concrete benchmark plan to settle the call:
Reproducer file: Ensembl release 114 Homo_sapiens.GRCh38.114.gtf.gz (~60 MB compressed, ~1.4 GB uncompressed). Run read_gtf under timeit / memray with each of:
Re-test with GENCODE v48 primary (gencode.v48.primary_assembly.annotation.gtf) for the GENCODE-specific corner cases.
If pyarrow path is within 3x of polars on parse and within 1.5x on total read_gtf time, migrate. If it's worse than that, hold.
Non-goals
Not migrating attribute expansion (expand_attribute_strings) — that's pure-Python and stays as-is.
Not breaking the public API. read_gtf keeps the same kwargs; only the internal parser changes. result_type=\"polars\" becomes a thin pl.from_arrow(table) wrapper (still supported, just no longer the native path).
Not rewriting create_missing_features — it's already pandas.
Motivation
gtfparsetoday uses polars to do the CSV-style read of GTF/GFF files, then immediately converts to pandas for everything downstream — there's a load-bearing comment inread_gtf.pyexplaining the conversion is there because "Polars bugs manifest aspyo3_runtime.PanicException: assertionleft == rightfailed: impl errorand are generally insane to chase down." That, plus a steady stream of polars-API-renamed issues (#36, #47, #50, #54), plus the global string cache footgun guarded bypolars.enable_string_cache()inparse_with_polars_lazy, suggests the polars dependency is paying its cost without earning back enough value for a small TSV parser. Filing this issue to evaluate the migration on numbers.What polars actually buys us today
features=row filter (minor — files small enough for eager).seqname/source/feature/strand.That's it. The rest of
read_gtf(attribute expansion, alias rename, version casting, biotype inference,usecolsfilter,result_typeconversion) is all pandas.What it costs us
to_pandas()round-trip in the middle ofread_gtfexists because of them.scan_csvsepkwarg removed), DeprecationWarning: The argumentdtypesforread_csvis deprecated. #50 (dtypeskwarg deprecated), Update gtfparse to support pyarrow >=0.15 #47 (pyarrow pinning), open PR update the polars' version and fix warnings #54 (polars version + warnings) are all instances of the same maintenance tax.polars.enable_string_cache()is process-global. We addedtest_version_cast_is_idempotent_via_double_readto prove repeated reads don't bite us; that test wouldn't exist with pyarrow.result_type=\"polars\"/\"pandas\"/\"dict\"means we maintain three output shapes against two underlying libraries. Dropping polars collapses this.Proposed migration target
Replace the
polars.read_csvpath withpyarrow.csv.read_csv(orpyarrow.csv.open_csvfor streaming), hand off to pandas viaTable.to_pandas(types_mapper=pd.ArrowDtype, self_destruct=True, split_blocks=True). Concretely:Known pyarrow gaps to engineer around
comment=\"#\"shortcut. Useskip_rows+ aninvalid_row_handlerthat returns\"skip\"whenrow.text.startswith(\"#\"). The handler only fires on column-count mismatch, so a#!genome-buildheader (1 field vs expected 9) triggers it cleanly.quote_char=Falseis the right call for GTF since Ensembl release 78 famously has unescaped quotes; existingfix_quotes_columnspost-processing inparse_with_polars_lazycan move into the new path as a pandas-side regex.column_typesand not rely on inference.\\xef\\xbb\\xbf.Performance expectations (from May 2026 benchmarks)
Independent reproductions of the Polars PDS-H benchmark and the pandas-vs-polars-2025 writeups put us at:
to_pandas(types_mapper=pd.ArrowDtype)closes most of that gap because buffers stay Arrow.Net call: ~2-3x slower parse, ~same memory. For typical GTF sizes (Ensembl release human GTF is ~1.3 GB uncompressed, GENCODE comprehensive ~1.5 GB), we're looking at adding 5-15 seconds to a one-shot load that downstream tools cache anyway. Not a blocker; pyensembl and varcode load this once at index time.
What we'd gain
ArrowInvalid/ArrowTypeError.Int64for the*_versioncolumns without thepd.to_numeric(...).astype(\"Int64\")dance.dtype_backend=\"pyarrow\"— categoricals become a pandas extension dtype instead of a polars-sideCategoricalthat converts oddly.Table.filter(pyarrow.compute.field(\"feature\")...)for thefeatures=filter — same expressive power as polars' lazy filter.pl.from_arrow(table)zero-copy.Decision criteria — when to actually pull the trigger
Concrete benchmark plan to settle the call:
Homo_sapiens.GRCh38.114.gtf.gz(~60 MB compressed, ~1.4 GB uncompressed). Runread_gtfunder timeit / memray with each of:dtype_backend=\"pyarrow\")gencode.v48.primary_assembly.annotation.gtf) for the GENCODE-specific corner cases.read_gtftime, migrate. If it's worse than that, hold.Non-goals
expand_attribute_strings) — that's pure-Python and stays as-is.read_gtfkeeps the same kwargs; only the internal parser changes.result_type=\"polars\"becomes a thinpl.from_arrow(table)wrapper (still supported, just no longer the native path).create_missing_features— it's already pandas.References