Evaluate migration from polars to pyarrow + pandas (with measurement plan)

## Motivation

`gtfparse` today uses polars to do the CSV-style read of GTF/GFF files, then immediately converts to pandas for everything downstream — there's a load-bearing comment in `read_gtf.py` explaining the conversion is there because *"Polars bugs manifest as `pyo3_runtime.PanicException: assertion `left == right` failed: impl error` and are generally insane to chase down."* That, plus a steady stream of polars-API-renamed issues (#36, #47, #50, #54), plus the global string cache footgun guarded by `polars.enable_string_cache()` in `parse_with_polars_lazy`, suggests the polars dependency is paying its cost without earning back enough value for a small TSV parser. Filing this issue to evaluate the migration on numbers.

## What polars actually buys us today

- Fast threaded CSV reader with column-type overrides.
- Lazy plan for the `features=` row filter (minor — files small enough for eager).
- Cheap categorical encoding of `seqname` / `source` / `feature` / `strand`.

That's it. The rest of `read_gtf` (attribute expansion, alias rename, version casting, biotype inference, `usecols` filter, `result_type` conversion) is all pandas.

## What it costs us

- **Bug surface**: opaque PyO3 panics, the explicit `to_pandas()` round-trip in the middle of `read_gtf` exists because of them.
- **API churn**: polars releases biweekly and breaks public API between minors. #36 (`scan_csv` `sep` kwarg removed), #50 (`dtypes` kwarg deprecated), #47 (pyarrow pinning), open PR #54 (polars version + warnings) are all instances of the same maintenance tax.
- **String-cache global state**: `polars.enable_string_cache()` is process-global. We added `test_version_cast_is_idempotent_via_double_read` to prove repeated reads don't bite us; that test wouldn't exist with pyarrow.
- **Two-DataFrame API surface**: `result_type=\"polars\"` / `\"pandas\"` / `\"dict\"` means we maintain three output shapes against two underlying libraries. Dropping polars collapses this.

## Proposed migration target

Replace the `polars.read_csv` path with `pyarrow.csv.read_csv` (or `pyarrow.csv.open_csv` for streaming), hand off to pandas via `Table.to_pandas(types_mapper=pd.ArrowDtype, self_destruct=True, split_blocks=True)`. Concretely:

```python
import pyarrow as pa
import pyarrow.csv as pacsv

def _read_gtf_arrow(filepath_or_buffer, features=None):
    read_opts = pacsv.ReadOptions(
        column_names=REQUIRED_COLUMNS,
        skip_rows=_count_leading_header_lines(filepath_or_buffer),
    )
    parse_opts = pacsv.ParseOptions(
        delimiter=\"\t\",
        quote_char=False,            # GTFs have unescaped quotes in attributes
        invalid_row_handler=lambda row: \"skip\" if row.text.startswith(\"#\") else \"error\",
    )
    convert_opts = pacsv.ConvertOptions(
        null_values=[\".\"],
        strings_can_be_null=True,
        column_types={
            \"start\": pa.int64(), \"end\": pa.int64(),
            \"score\": pa.float32(), \"frame\": pa.uint32(),
            \"seqname\": pa.dictionary(pa.int32(), pa.string()),
            \"source\": pa.dictionary(pa.int32(), pa.string()),
            \"feature\": pa.dictionary(pa.int32(), pa.string()),
            \"strand\": pa.dictionary(pa.int32(), pa.string()),
        },
    )
    table = pacsv.read_csv(filepath_or_buffer, read_opts, parse_opts, convert_opts)
    if features is not None:
        import pyarrow.compute as pc
        table = table.filter(pc.is_in(pc.field(\"feature\"), pa.array(sorted(features))))
    return table.to_pandas(types_mapper=pd.ArrowDtype, self_destruct=True, split_blocks=True)
```

## Known pyarrow gaps to engineer around

- **No `comment=\"#\"` shortcut**. Use `skip_rows` + an `invalid_row_handler` that returns `\"skip\"` when `row.text.startswith(\"#\")`. The handler only fires on column-count mismatch, so a `#!genome-build` header (1 field vs expected 9) triggers it cleanly.
- **All-or-nothing quoting**. `quote_char=False` is the right call for GTF since Ensembl release 78 famously has unescaped quotes; existing `fix_quotes_columns` post-processing in `parse_with_polars_lazy` can move into the new path as a pandas-side regex.
- **23.0.0 type-inference regression** on extreme scientific notation strings (apache/arrow#49003); fixed in 23.0.1. We already pin types defensively for the columns that matter, so this doesn't bite — but it's a reminder to always declare `column_types` and not rely on inference.
- **BOM not auto-stripped**. Add a one-time sniff for `\\xef\\xbb\\xbf`.

## Performance expectations (from May 2026 benchmarks)

Independent reproductions of the Polars PDS-H benchmark and the pandas-vs-polars-2025 writeups put us at:

- polars.read_csv: ~10x faster than pandas C engine, ~2-3x faster than pyarrow.csv on raw CSV reads at SF-10 / SF-100 scale.
- Memory: polars ~179 MB peak on a 1 GB CSV vs pandas numpy backend ~1.4 GB. pyarrow.csv → `to_pandas(types_mapper=pd.ArrowDtype)` closes most of that gap because buffers stay Arrow.

**Net call**: ~2-3x slower parse, ~same memory. For typical GTF sizes (Ensembl release human GTF is ~1.3 GB uncompressed, GENCODE comprehensive ~1.5 GB), we're looking at adding 5-15 seconds to a one-shot load that downstream tools cache anyway. Not a blocker; pyensembl and varcode load this once at index time.

## What we'd gain

1. No more global string cache.
2. No more PyO3 panic exception class — pyarrow exceptions are normal Python `ArrowInvalid` / `ArrowTypeError`.
3. Stable API surface — pyarrow CSV options have been stable across major versions; polars renames its CSV kwargs roughly once a year.
4. Native nullable `Int64` for the `*_version` columns without the `pd.to_numeric(...).astype(\"Int64\")` dance.
5. Native arrow-backed string columns via `dtype_backend=\"pyarrow\"` — categoricals become a pandas extension dtype instead of a polars-side `Categorical` that converts oddly.
6. `Table.filter(pyarrow.compute.field(\"feature\")...)` for the `features=` filter — same expressive power as polars' lazy filter.
7. Users who want polars can still call `pl.from_arrow(table)` zero-copy.

## Decision criteria — when to actually pull the trigger

Concrete benchmark plan to settle the call:

1. Reproducer file: Ensembl release 114 `Homo_sapiens.GRCh38.114.gtf.gz` (~60 MB compressed, ~1.4 GB uncompressed). Run `read_gtf` under timeit / memray with each of:
   - current code path (polars 0.20.x → pandas)
   - pyarrow.csv → pandas (numpy backend)
   - pyarrow.csv → pandas (`dtype_backend=\"pyarrow\"`)
2. Capture: wall-clock parse + attribute-expansion + cast, peak RSS, intermediate frame size.
3. Re-test with GENCODE v48 primary (`gencode.v48.primary_assembly.annotation.gtf`) for the GENCODE-specific corner cases.
4. If pyarrow path is within 3x of polars on parse and within 1.5x on total `read_gtf` time, migrate. If it's worse than that, hold.

## Non-goals

- Not migrating attribute expansion (`expand_attribute_strings`) — that's pure-Python and stays as-is.
- Not breaking the public API. `read_gtf` keeps the same kwargs; only the internal parser changes. `result_type=\"polars\"` becomes a thin `pl.from_arrow(table)` wrapper (still supported, just no longer the native path).
- Not rewriting `create_missing_features` — it's already pandas.

## References

- [Apache Arrow CSV docs (23.x)](https://arrow.apache.org/docs/python/csv.html)
- [pyarrow.csv.read_csv](https://arrow.apache.org/docs/python/generated/pyarrow.csv.read_csv.html)
- [pyarrow.csv.InvalidRow](https://arrow.apache.org/docs/python/generated/pyarrow.csv.InvalidRow.html) (the malformed-row escape hatch)
- [Polars PDS-H May 2025 benchmarks](https://pola.rs/posts/benchmarks/)
- [Pandas vs Polars 2025](https://python.plainenglish.io/pandas-vs-polars-in-2025-should-you-finally-make-the-switch-90fb2756ffe1)
- [polars-bio, Oxford Bioinformatics 2025](https://academic.oup.com/bioinformatics/article/41/12/btaf640/8362264) — the value is Arrow, not Polars specifically
- [apache/arrow#49003](https://github.com/apache/arrow/issues/49003) — 23.0.0 type-inference regression we'd want to pin around
- Related polars footgun reports: pola-rs/polars#23224, #20233, #19285 (string-cache panics)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate migration from polars to pyarrow + pandas (with measurement plan) #69

Motivation

What polars actually buys us today

What it costs us

Proposed migration target

Known pyarrow gaps to engineer around

Performance expectations (from May 2026 benchmarks)

What we'd gain

Decision criteria — when to actually pull the trigger

Non-goals

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Evaluate migration from polars to pyarrow + pandas (with measurement plan) #69

Description

Motivation

What polars actually buys us today

What it costs us

Proposed migration target

Known pyarrow gaps to engineer around

Performance expectations (from May 2026 benchmarks)

What we'd gain

Decision criteria — when to actually pull the trigger

Non-goals

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions