Analyze token counts for all documents in a folder. Supports nested folders, zip files, and various document types.
- Multiple tokenizers: Use any HuggingFace tokenizer (GPT-2, BERT, LLaMA, etc.)
- Recursive scanning: Handles nested folders and zip archives
- Wide format support: Text, code, PDF, DOCX, images (OCR), and more
- Interactive UI: Streamlit-based interface for easy use
- Detailed reports: Export results as CSV or JSON
- Text:
.txt,.md,.json,.csv,.xml,.yaml,.yml - Code:
.py,.js,.ts,.java,.c,.cpp,.go,.rs,.rb, etc. - Documents:
.pdf,.docx - Images (OCR):
.png,.jpg,.jpeg,.tiff,.bmp - Archives:
.zip(automatically extracted and processed)
# Clone the repository
git clone https://github.com/sgciv2805/folder-tokenizer
cd folder-tokenizer
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install core dependencies (CLI only)
pip install -e .
# Or install with the Streamlit web UI
pip install -e ".[ui]"For image OCR support, install Tesseract:
# macOS
brew install tesseract
# Ubuntu/Debian
sudo apt-get install tesseract-ocr
# Windows
# Download from: https://github.com/UB-Mannheim/tesseract/wikistreamlit run src/folder_tokenizer/app.pyfolder-tokenizer /path/to/folder --model gpt2MIT