Folder Tokenizer

Analyze token counts for all documents in a folder. Supports nested folders, zip files, and various document types.

Features

Multiple tokenizers: Use any HuggingFace tokenizer (GPT-2, BERT, LLaMA, etc.)
Recursive scanning: Handles nested folders and zip archives
Wide format support: Text, code, PDF, DOCX, images (OCR), and more
Interactive UI: Streamlit-based interface for easy use
Detailed reports: Export results as CSV or JSON

Supported File Types

Text: .txt, .md, .json, .csv, .xml, .yaml, .yml
Code: .py, .js, .ts, .java, .c, .cpp, .go, .rs, .rb, etc.
Documents: .pdf, .docx
Images (OCR): .png, .jpg, .jpeg, .tiff, .bmp
Archives: .zip (automatically extracted and processed)

Installation

# Clone the repository
git clone https://github.com/sgciv2805/folder-tokenizer
cd folder-tokenizer

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install core dependencies (CLI only)
pip install -e .

# Or install with the Streamlit web UI
pip install -e ".[ui]"

OCR Support (Optional)

For image OCR support, install Tesseract:

# macOS
brew install tesseract

# Ubuntu/Debian
sudo apt-get install tesseract-ocr

# Windows
# Download from: https://github.com/UB-Mannheim/tesseract/wiki

Usage

Web UI (Streamlit)

streamlit run src/folder_tokenizer/app.py

Command Line

folder-tokenizer /path/to/folder --model gpt2

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
public		public
src/folder_tokenizer		src/folder_tokenizer
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
server.js		server.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Folder Tokenizer

Features

Supported File Types

Installation

OCR Support (Optional)

Usage

Web UI (Streamlit)

Command Line

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

License

zero8dotdev/folder-tokenizer

Folders and files

Latest commit

History

Repository files navigation

Folder Tokenizer

Features

Supported File Types

Installation

OCR Support (Optional)

Usage

Web UI (Streamlit)

Command Line

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages