parq-cli

A command-line tool for inspecting, transforming, and comparing tabular files.

Overview

parq focuses on the workflows that come up most often when working with .parquet, .csv, and .xlsx files:

inspect metadata and schema
preview the first or last rows
count rows
split large files
compute lightweight column stats
convert between supported formats
diff two datasets by key
merge compatible files

The CLI keeps startup light with lazy imports, preserves plain and json output modes for automation, and avoids unnecessary full-table materialization for large CSV/XLSX workflows where possible.

Installation

pip install parq-cli

Enable .xlsx support with the optional dependency:

pip install "parq-cli[xlsx]"

Quick Start

# Inspect metadata
parq meta data.parquet
parq meta --fast data.csv

# Show schema
parq schema data.xlsx

# Preview rows
parq head data.parquet
parq head -n 10 --columns id,name data.csv
parq tail -n 20 data.csv

# Count rows
parq count data.parquet

# Split files
parq split data.csv --record-count 100000 -n "chunks/part-%03d.csv"
parq split data.parquet --file-count 4 -n "chunks/part-%02d.parquet"

# Column statistics
parq stats sales.parquet --columns amount,discount --limit 10

# Format conversion
parq convert raw.xlsx cleaned.parquet
parq convert source.parquet export.csv --columns id,name,status

# Dataset diff
parq diff old.parquet new.parquet --key id --columns status,amount
parq diff left.csv right.csv --key id --summary-only

# Merge compatible inputs
parq merge part-001.parquet part-002.parquet merged.parquet

Supported Formats

Command	Parquet	CSV	XLSX
`meta`	yes	yes	yes
`schema`	yes	yes	yes
`head` / `tail`	yes	yes	yes
`count`	yes	yes	yes
`split`	yes	yes	yes
`stats`	yes	yes	yes
`convert`	yes	yes	yes
`diff`	yes	yes	no, convert first
`merge`	yes	yes	yes

XLSX support requires openpyxl.

Command Reference

`meta`

parq meta FILE
parq meta --fast FILE

Shows file-level metadata such as path, format, column count, file size, row-group count, and when available, row count and Parquet-specific metadata.

Use --fast when you want a cheap metadata pass on CSV/XLSX files. In fast mode, expensive fields such as full row counts are skipped.

`schema`

parq schema FILE

Shows column names, types, and nullable information.

`head` and `tail`

parq head FILE
parq head -n 20 FILE
parq head -n 20 --columns id,name FILE

parq tail FILE
parq tail -n 20 FILE
parq tail -n 20 --columns id,name FILE

Notes:

default preview size is 5
--columns accepts a comma-separated list
missing files return a friendly error with exit code 1
empty header-only CSV/XLSX files return an empty preview with detected columns
an empty csv with no header raises a friendly Empty CSV file error

`count`

parq count FILE

Returns the total row count.

`split`

parq split FILE --file-count N
parq split FILE --record-count N
parq split FILE --record-count 100000 -n "chunks/part-%03d.parquet"

Splits one input file into multiple output files.

Rules:

specify exactly one of --file-count or --record-count
output format is inferred from --name-format
existing target files are not overwritten
in --record-count mode, CSV/XLSX now stream in a single pass instead of pre-counting the entire file

`stats`

parq stats FILE
parq stats FILE --columns amount,discount
parq stats FILE --limit 20

Computes simple per-column statistics.

numeric columns include count, null_count, min, max, mean
non-numeric columns include count and null_count
default --limit is 50 to avoid flooding the terminal on very wide tables

`convert`

parq convert SOURCE OUTPUT
parq convert SOURCE OUTPUT --columns id,name,status

Converts a supported input file to another supported output format. The output format is determined by the OUTPUT suffix.

Notes:

current targets are .parquet, .csv, and .xlsx
conversion is streaming-based where possible
existing output files raise an error instead of being overwritten

`diff`

parq diff LEFT RIGHT --key id
parq diff LEFT RIGHT --key id1,id2 --columns status,amount
parq diff LEFT RIGHT --key id --summary-only

Compares two datasets by key and reports:

row count delta
rows only present on the left
rows only present on the right
changed rows for the selected columns
schema-only columns and same-name type mismatches

Notes:

--key is required
diff currently supports Parquet and CSV inputs
XLSX files should be converted first
duplicate keys on either side are treated as an error
--summary-only keeps the counts and omits sample payloads

`merge`

parq merge INPUT1 INPUT2 OUTPUT
parq merge chunks/*.parquet merged.parquet

Merges multiple compatible input files into a single output file. The last positional argument is the output path.

Notes:

schemas must be identical or safely unifiable by Arrow
existing output files are not overwritten
output format is inferred from the output suffix

Output Modes

Global options:

--version, -v: show version information
--output, -o: select output format
--help: show command help

Available output modes:

rich: human-friendly terminal rendering
plain: low-overhead tabular output for shell pipelines
json: machine-readable structured output

Examples:

parq meta data.parquet --output json
parq stats data.csv --output plain
parq diff left.parquet right.parquet --key id --summary-only --output json

On Windows terminals that cannot safely render emoji or extended characters, Rich headings automatically fall back to a safe plain style instead of crashing.

Large File Notes

Parquet metadata, row counts, and previews use Arrow metadata and row-group shortcuts where available.
CSV tail uses a fixed-size column window instead of materializing every row as Python dicts.
CSV/XLSX split --record-count streams in one pass.
meta --fast is the best option when you need quick metadata from large CSV/XLSX inputs.
XLSX schema inference samples the first 1000 rows instead of scanning the entire sheet up front.

For repeated heavy workflows, converting large CSV/XLSX files to Parquet is still the best path for throughput.

Development

Install development dependencies:

uv sync --extra dev

or:

pip install -e ".[dev]"

Useful commands:

python -m parq --help
pytest -m "not performance"
pytest tests/test_performance.py -m performance -q -s
ruff check parq tests
ruff check --fix parq tests
pytest --cov=parq --cov-report=html

Status

Implemented:

metadata and schema inspection
head and tail preview
row counting
file splitting
column statistics
format conversion
keyed dataset diff
compatible file merge

Planned improvements are now centered on deeper performance tuning, richer diff workflows, and broader reporting capabilities rather than adding the core commands from scratch.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
.github/workflows		.github/workflows
.tmp-bench		.tmp-bench
data		data
docs		docs
examples		examples
parq		parq
scripts		scripts
tests		tests
.gitignore		.gitignore
.python-version		.python-version
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
README_CN.md		README_CN.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

parq-cli

Overview

Installation

Quick Start

Supported Formats

Command Reference

`meta`

`schema`

`head` and `tail`

`count`

`split`

`stats`

`convert`

`diff`

`merge`

Output Modes

Large File Notes

Development

Status

License

About

Uh oh!

Releases 10

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

parq-cli

Overview

Installation

Quick Start

Supported Formats

Command Reference

meta

schema

head and tail

count

split

stats

convert

diff

merge

Output Modes

Large File Notes

Development

Status

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 10

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`meta`

`schema`

`head` and `tail`

`count`

`split`

`stats`

`convert`

`diff`

`merge`

Packages