TextSpitter is a lightweight Python library that extracts text from documents and source-code files with a single call. It normalises diverse input types β file paths, BytesIO streams, SpooledTemporaryFile objects, and raw bytes β into plain strings, making it ideal for pipelines that feed text into LLMs, search engines, or data-processing workflows.
Why TextSpitter?
- π Multi-format extraction β PDF (PyMuPDF + PyPDF fallback), DOCX, TXT, CSV, and 50 + programming-language file types.
- π Stream-first API β accepts file paths,
BytesIO,SpooledTemporaryFile, or rawbytes; no temp files required. - π οΈ Optional structured logging β install
textspitter[logging]to addloguru; falls back to stdlibloggingtransparently. - π₯οΈ CLI included β
uv tool install textspittergives you atextspittercommand for quick one-off extractions. - π Automated CI/CD β GitHub Actions run the test matrix (Python 3.12β3.14) and publish docs to GitHub Pages on every push.
| Component | Details | |
|---|---|---|
| βοΈ | Architecture |
|
| π© | Code Quality |
|
| π | Documentation |
|
| π | Integrations |
|
| π§© | Modularity |
|
| π§ͺ | Testing |
|
| β‘οΈ | Performance |
|
| π¦ | Dependencies |
|
TextSpitter/
βββ .github/
β βββ workflows/
β βββ docs.yml # pdoc β GitHub Pages
β βββ python-publish.yml # PyPI release
β βββ tests.yml # pytest matrix (3.12 β 3.14)
βββ TextSpitter/
β βββ __init__.py # TextSpitter() + WordLoader public API
β βββ cli.py # argparse CLI entry point
β βββ core.py # FileExtractor class
β βββ logger.py # Optional loguru / stdlib fallback
β βββ main.py # WordLoader dispatcher
β βββ py.typed # PEP 561 marker
β βββ guide/ # pdoc documentation pages (subpackage)
βββ tests/
β βββ conftest.py # shared fixtures (log_capture)
β βββ test_cli.py
β βββ test_file_extractor.py
β βββ test_txt.py
β βββ ...
βββ CHANGELOG.md
βββ CONTRIBUTING.md
βββ pyproject.toml
βββ uv.lock- Python β₯ 3.12
- uv (recommended) or pip
From PyPI:
pip install textspitter
# With optional loguru logging
pip install "textspitter[logging]"Using uv:
uv add textspitter
# With optional loguru logging
uv add "textspitter[logging]"As a standalone CLI tool:
uv tool install textspitterFrom source:
git clone https://github.com/fsecada01/TextSpitter.git
cd TextSpitter
uv sync --all-extras --devAs a library (one-liner):
from TextSpitter import TextSpitter
# From a file path
text = TextSpitter(filename="report.pdf")
print(text)
# From a BytesIO stream
from io import BytesIO
text = TextSpitter(file_obj=BytesIO(pdf_bytes), filename="report.pdf")
# From raw bytes
text = TextSpitter(file_obj=docx_bytes, filename="contract.docx")Using the WordLoader class directly:
from TextSpitter.main import WordLoader
loader = WordLoader(filename="data.csv")
text = loader.file_load()As a CLI tool:
# Extract a single file to stdout
textspitter report.pdf
# Extract multiple files and write to a combined output file
textspitter file1.pdf file2.docx notes.txt -o combined.txtuv run pytest tests/
# With coverage
uv run pytest tests/ --cov=TextSpitter --cov-report=term-missing- Stream-based API (
BytesIO,SpooledTemporaryFile, rawbytes) - CLI entry point (
uv tool install textspitter) - Optional loguru logging with stdlib fallback
- Programming-language file support (50 + extensions)
- CI matrix (Python 3.12 β 3.14) + GitHub Pages docs
- Async extraction API
- CSV β structured output (list of dicts)
- PPTX support
v2.0 β Rust backend (full roadmap)
- Rust splitting core via PyO3 + Maturin β 10xβ40x batch throughput
- Graceful Python fallback when Rust extension is unavailable
-
manylinuxwheels on PyPI β zero-compile install for Linux users - Memory-mapped file processing for very large PDFs (
memmap2) - SIMD-accelerated string search for separator detection
- Streaming iterator API (yield chunks instead of collecting all)
- Optional SIMD feature flag (
pip install "textspitter[simd]")
- π¬ Join the Discussions: Share insights, give feedback, or ask questions.
- π Report Issues: Submit bugs or log feature requests.
- π‘ Submit Pull Requests: Review open PRs or submit your own.
Contributing Guidelines
- Fork the Repository: Fork the project to your GitHub account.
- Clone Locally: Clone the forked repository.
git clone https://github.com/fsecada01/TextSpitter.git
- Create a New Branch: Always work on a new branch.
git checkout -b new-feature-x
- Make Your Changes: Develop and test your changes locally.
- Commit Your Changes: Commit with a clear message.
git commit -m 'Add new feature x.' - Push to GitHub: Push the changes to your fork.
git push origin new-feature-x
- Submit a Pull Request: Create a PR against
main. Describe the changes and motivation clearly. - Review: Once approved, your PR will be merged. Thanks for contributing!
TextSpitter is released under the MIT License.