A command-line tool for inspecting, transforming, and comparing tabular files.
parq focuses on the workflows that come up most often when working with .parquet, .csv, and .xlsx files:
- inspect metadata and schema
- preview the first or last rows
- count rows
- split large files
- compute lightweight column stats
- convert between supported formats
- diff two datasets by key
- merge compatible files
The CLI keeps startup light with lazy imports, preserves plain and json output modes for automation, and avoids unnecessary full-table materialization for large CSV/XLSX workflows where possible.
pip install parq-cliEnable .xlsx support with the optional dependency:
pip install "parq-cli[xlsx]"# Inspect metadata
parq meta data.parquet
parq meta --fast data.csv
# Show schema
parq schema data.xlsx
# Preview rows
parq head data.parquet
parq head -n 10 --columns id,name data.csv
parq tail -n 20 data.csv
# Count rows
parq count data.parquet
# Split files
parq split data.csv --record-count 100000 -n "chunks/part-%03d.csv"
parq split data.parquet --file-count 4 -n "chunks/part-%02d.parquet"
# Column statistics
parq stats sales.parquet --columns amount,discount --limit 10
# Format conversion
parq convert raw.xlsx cleaned.parquet
parq convert source.parquet export.csv --columns id,name,status
# Dataset diff
parq diff old.parquet new.parquet --key id --columns status,amount
parq diff left.csv right.csv --key id --summary-only
# Merge compatible inputs
parq merge part-001.parquet part-002.parquet merged.parquet| Command | Parquet | CSV | XLSX |
|---|---|---|---|
meta |
yes | yes | yes |
schema |
yes | yes | yes |
head / tail |
yes | yes | yes |
count |
yes | yes | yes |
split |
yes | yes | yes |
stats |
yes | yes | yes |
convert |
yes | yes | yes |
diff |
yes | yes | no, convert first |
merge |
yes | yes | yes |
XLSX support requires openpyxl.
parq meta FILE
parq meta --fast FILEShows file-level metadata such as path, format, column count, file size, row-group count, and when available, row count and Parquet-specific metadata.
Use --fast when you want a cheap metadata pass on CSV/XLSX files. In fast mode, expensive fields such as full row counts are skipped.
parq schema FILEShows column names, types, and nullable information.
parq head FILE
parq head -n 20 FILE
parq head -n 20 --columns id,name FILE
parq tail FILE
parq tail -n 20 FILE
parq tail -n 20 --columns id,name FILENotes:
- default preview size is
5 --columnsaccepts a comma-separated list- missing files return a friendly error with exit code
1 - empty header-only CSV/XLSX files return an empty preview with detected columns
- an empty csv with no header raises a friendly
Empty CSV fileerror
parq count FILEReturns the total row count.
parq split FILE --file-count N
parq split FILE --record-count N
parq split FILE --record-count 100000 -n "chunks/part-%03d.parquet"Splits one input file into multiple output files.
Rules:
- specify exactly one of
--file-countor--record-count - output format is inferred from
--name-format - existing target files are not overwritten
- in
--record-countmode, CSV/XLSX now stream in a single pass instead of pre-counting the entire file
parq stats FILE
parq stats FILE --columns amount,discount
parq stats FILE --limit 20Computes simple per-column statistics.
- numeric columns include
count,null_count,min,max,mean - non-numeric columns include
countandnull_count - default
--limitis50to avoid flooding the terminal on very wide tables
parq convert SOURCE OUTPUT
parq convert SOURCE OUTPUT --columns id,name,statusConverts a supported input file to another supported output format. The output format is determined by the OUTPUT suffix.
Notes:
- current targets are
.parquet,.csv, and.xlsx - conversion is streaming-based where possible
- existing output files raise an error instead of being overwritten
parq diff LEFT RIGHT --key id
parq diff LEFT RIGHT --key id1,id2 --columns status,amount
parq diff LEFT RIGHT --key id --summary-onlyCompares two datasets by key and reports:
- row count delta
- rows only present on the left
- rows only present on the right
- changed rows for the selected columns
- schema-only columns and same-name type mismatches
Notes:
--keyis requireddiffcurrently supports Parquet and CSV inputs- XLSX files should be converted first
- duplicate keys on either side are treated as an error
--summary-onlykeeps the counts and omits sample payloads
parq merge INPUT1 INPUT2 OUTPUT
parq merge chunks/*.parquet merged.parquetMerges multiple compatible input files into a single output file. The last positional argument is the output path.
Notes:
- schemas must be identical or safely unifiable by Arrow
- existing output files are not overwritten
- output format is inferred from the output suffix
Global options:
--version,-v: show version information--output,-o: select output format--help: show command help
Available output modes:
rich: human-friendly terminal renderingplain: low-overhead tabular output for shell pipelinesjson: machine-readable structured output
Examples:
parq meta data.parquet --output json
parq stats data.csv --output plain
parq diff left.parquet right.parquet --key id --summary-only --output jsonOn Windows terminals that cannot safely render emoji or extended characters, Rich headings automatically fall back to a safe plain style instead of crashing.
- Parquet metadata, row counts, and previews use Arrow metadata and row-group shortcuts where available.
- CSV
tailuses a fixed-size column window instead of materializing every row as Python dicts. - CSV/XLSX
split --record-countstreams in one pass. meta --fastis the best option when you need quick metadata from large CSV/XLSX inputs.- XLSX schema inference samples the first 1000 rows instead of scanning the entire sheet up front.
For repeated heavy workflows, converting large CSV/XLSX files to Parquet is still the best path for throughput.
Install development dependencies:
uv sync --extra devor:
pip install -e ".[dev]"Useful commands:
python -m parq --help
pytest -m "not performance"
pytest tests/test_performance.py -m performance -q -s
ruff check parq tests
ruff check --fix parq tests
pytest --cov=parq --cov-report=htmlImplemented:
- metadata and schema inspection
- head and tail preview
- row counting
- file splitting
- column statistics
- format conversion
- keyed dataset diff
- compatible file merge
Planned improvements are now centered on deeper performance tuning, richer diff workflows, and broader reporting capabilities rather than adding the core commands from scratch.