Skip to content

Mittal-Analytics/fast-pdf-extract

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

fast-pdf-extract

Rust backed PDF text extraction library for Python.

Features

  • Detect and remove headers and footers
  • Clean bilingual PDFs
  • Mark headings in bold (basic Markdown)
  • High accuracy
  • Performance

Development

uv sync --only-dev

# run tests (it rebuilds automatically)
uv run python -m unittest

# updating dependencies
cargo update
uv lock --upgrade

Publishing a new version

  1. Check the latest published version.
python - <<'PY'
import json
import urllib.request

with urllib.request.urlopen("https://pypi.org/pypi/fast-pdf-extract/json") as response:
    data = json.load(response)

print(data["info"]["version"])
PY
  1. Bump the version in Cargo.toml.
[package]
version = "0.6.1"
  1. Refresh lockfiles and run checks.
cargo check
uv lock
just test
  1. Build the release artifacts.
rm -rf target/wheels dist
uv run maturin build --release
  1. Publish to PyPI.
# MATURIN_PYPI_TOKEN must be set in the environment.
uv run maturin publish --skip-existing
  1. Verify PyPI shows the new version.
python - <<'PY'
import json
import urllib.request

with urllib.request.urlopen("https://pypi.org/pypi/fast-pdf-extract/json") as response:
    data = json.load(response)

print(data["info"]["version"])
PY
  1. Commit the version bump.
git add Cargo.toml Cargo.lock
git commit -m "Bump version to <version>"

Troubleshooting

If cargo build complains of missing python version.

cargo clean
cargo build

About

Our *fast* text-extraction library to extract texts from Annual Reports and other such documents.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors