Skip to content

Evaluate migration from polars to pyarrow + pandas (with measurement plan) #69

@iskandr

Description

@iskandr

Motivation

gtfparse today uses polars to do the CSV-style read of GTF/GFF files, then immediately converts to pandas for everything downstream — there's a load-bearing comment in read_gtf.py explaining the conversion is there because "Polars bugs manifest as pyo3_runtime.PanicException: assertion left == right failed: impl error and are generally insane to chase down." That, plus a steady stream of polars-API-renamed issues (#36, #47, #50, #54), plus the global string cache footgun guarded by polars.enable_string_cache() in parse_with_polars_lazy, suggests the polars dependency is paying its cost without earning back enough value for a small TSV parser. Filing this issue to evaluate the migration on numbers.

What polars actually buys us today

  • Fast threaded CSV reader with column-type overrides.
  • Lazy plan for the features= row filter (minor — files small enough for eager).
  • Cheap categorical encoding of seqname / source / feature / strand.

That's it. The rest of read_gtf (attribute expansion, alias rename, version casting, biotype inference, usecols filter, result_type conversion) is all pandas.

What it costs us

Proposed migration target

Replace the polars.read_csv path with pyarrow.csv.read_csv (or pyarrow.csv.open_csv for streaming), hand off to pandas via Table.to_pandas(types_mapper=pd.ArrowDtype, self_destruct=True, split_blocks=True). Concretely:

import pyarrow as pa
import pyarrow.csv as pacsv

def _read_gtf_arrow(filepath_or_buffer, features=None):
    read_opts = pacsv.ReadOptions(
        column_names=REQUIRED_COLUMNS,
        skip_rows=_count_leading_header_lines(filepath_or_buffer),
    )
    parse_opts = pacsv.ParseOptions(
        delimiter=\"\t\",
        quote_char=False,            # GTFs have unescaped quotes in attributes
        invalid_row_handler=lambda row: \"skip\" if row.text.startswith(\"#\") else \"error\",
    )
    convert_opts = pacsv.ConvertOptions(
        null_values=[\".\"],
        strings_can_be_null=True,
        column_types={
            \"start\": pa.int64(), \"end\": pa.int64(),
            \"score\": pa.float32(), \"frame\": pa.uint32(),
            \"seqname\": pa.dictionary(pa.int32(), pa.string()),
            \"source\": pa.dictionary(pa.int32(), pa.string()),
            \"feature\": pa.dictionary(pa.int32(), pa.string()),
            \"strand\": pa.dictionary(pa.int32(), pa.string()),
        },
    )
    table = pacsv.read_csv(filepath_or_buffer, read_opts, parse_opts, convert_opts)
    if features is not None:
        import pyarrow.compute as pc
        table = table.filter(pc.is_in(pc.field(\"feature\"), pa.array(sorted(features))))
    return table.to_pandas(types_mapper=pd.ArrowDtype, self_destruct=True, split_blocks=True)

Known pyarrow gaps to engineer around

  • No comment=\"#\" shortcut. Use skip_rows + an invalid_row_handler that returns \"skip\" when row.text.startswith(\"#\"). The handler only fires on column-count mismatch, so a #!genome-build header (1 field vs expected 9) triggers it cleanly.
  • All-or-nothing quoting. quote_char=False is the right call for GTF since Ensembl release 78 famously has unescaped quotes; existing fix_quotes_columns post-processing in parse_with_polars_lazy can move into the new path as a pandas-side regex.
  • 23.0.0 type-inference regression on extreme scientific notation strings ([Python] CSV reader returns different values in 23.0.0 apache/arrow#49003); fixed in 23.0.1. We already pin types defensively for the columns that matter, so this doesn't bite — but it's a reminder to always declare column_types and not rely on inference.
  • BOM not auto-stripped. Add a one-time sniff for \\xef\\xbb\\xbf.

Performance expectations (from May 2026 benchmarks)

Independent reproductions of the Polars PDS-H benchmark and the pandas-vs-polars-2025 writeups put us at:

  • polars.read_csv: ~10x faster than pandas C engine, ~2-3x faster than pyarrow.csv on raw CSV reads at SF-10 / SF-100 scale.
  • Memory: polars ~179 MB peak on a 1 GB CSV vs pandas numpy backend ~1.4 GB. pyarrow.csv → to_pandas(types_mapper=pd.ArrowDtype) closes most of that gap because buffers stay Arrow.

Net call: ~2-3x slower parse, ~same memory. For typical GTF sizes (Ensembl release human GTF is ~1.3 GB uncompressed, GENCODE comprehensive ~1.5 GB), we're looking at adding 5-15 seconds to a one-shot load that downstream tools cache anyway. Not a blocker; pyensembl and varcode load this once at index time.

What we'd gain

  1. No more global string cache.
  2. No more PyO3 panic exception class — pyarrow exceptions are normal Python ArrowInvalid / ArrowTypeError.
  3. Stable API surface — pyarrow CSV options have been stable across major versions; polars renames its CSV kwargs roughly once a year.
  4. Native nullable Int64 for the *_version columns without the pd.to_numeric(...).astype(\"Int64\") dance.
  5. Native arrow-backed string columns via dtype_backend=\"pyarrow\" — categoricals become a pandas extension dtype instead of a polars-side Categorical that converts oddly.
  6. Table.filter(pyarrow.compute.field(\"feature\")...) for the features= filter — same expressive power as polars' lazy filter.
  7. Users who want polars can still call pl.from_arrow(table) zero-copy.

Decision criteria — when to actually pull the trigger

Concrete benchmark plan to settle the call:

  1. Reproducer file: Ensembl release 114 Homo_sapiens.GRCh38.114.gtf.gz (~60 MB compressed, ~1.4 GB uncompressed). Run read_gtf under timeit / memray with each of:
    • current code path (polars 0.20.x → pandas)
    • pyarrow.csv → pandas (numpy backend)
    • pyarrow.csv → pandas (dtype_backend=\"pyarrow\")
  2. Capture: wall-clock parse + attribute-expansion + cast, peak RSS, intermediate frame size.
  3. Re-test with GENCODE v48 primary (gencode.v48.primary_assembly.annotation.gtf) for the GENCODE-specific corner cases.
  4. If pyarrow path is within 3x of polars on parse and within 1.5x on total read_gtf time, migrate. If it's worse than that, hold.

Non-goals

  • Not migrating attribute expansion (expand_attribute_strings) — that's pure-Python and stays as-is.
  • Not breaking the public API. read_gtf keeps the same kwargs; only the internal parser changes. result_type=\"polars\" becomes a thin pl.from_arrow(table) wrapper (still supported, just no longer the native path).
  • Not rewriting create_missing_features — it's already pandas.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions