A modular Python package for analyzing folder structures and PDF files with both GUI and CLI interfaces.
- Dual Interface: Use either GUI (tkinter) or CLI (argparse)
- Recursive Folder Scanning: Analyze entire directory trees
- File Analysis: Track file paths, names, sizes (in MB)
- PDF Analysis: Extract page counts and word counts from PDF files using PyMuPDF
- Comprehensive Logging:
- Summary reports with grouped counts and sizes
- Separate error log for exceptions
- Detailed CSV export of all files
- Flexible Filtering: Optional file extension filter
- Statistics:
- Total file counts and sizes
- Grouped by folder
- Grouped by extension
- PDF-specific totals (pages, words)
- Clone or download this repository
- Install dependencies:
pip install -r requirements.txtRun without arguments to launch the graphical interface:
python -m file_analyzerOr explicitly:
python -m file_analyzer.guiGUI Features:
- Browse buttons for easy folder selection
- Optional file extension filter
- Real-time log output
- Progress indicator
- Clear and intuitive interface
Run with arguments for command-line operation:
python -m file_analyzer -i /path/to/input -o /path/to/outputCLI Arguments:
-i, --input: Input folder to analyze (required)-o, --output: Output folder for logs and reports (required)-e, --extension: File extension filter, e.g.,.pdf,.txt(optional)
Examples:
# Analyze all files
python -m file_analyzer.cli -i C:\Documents -o C:\Reports
# Analyze only PDF files
python -m file_analyzer.cli -i C:\Documents -o C:\Reports -e .pdf
# Analyze only text files
python -m file_analyzer.cli -i C:\Documents -o C:\Reports -e .txtAll output files are saved to <output_folder>/logs/:
-
summary_YYYYMMDD_HHMMSS.log - Comprehensive summary report with:
- Overall statistics
- Statistics by folder
- Statistics by extension
- PDF statistics (if applicable)
-
errors_YYYYMMDD_HHMMSS.log - Error log with exception details
-
file_details_YYYYMMDD_HHMMSS.csv - Detailed CSV with all file information
file_analyzer/
├── __init__.py # Package initialization
├── __main__.py # Main entry point
├── scanner.py # File system scanning module
├── pdf_analyzer.py # PDF analysis module
├── logger.py # Logging and reporting module
├── cli.py # Command-line interface
└── gui.py # Graphical user interface
FileScanner: Recursively scans directories and collects file information- Supports optional file extension filtering
- Calculates statistics by folder and extension
PDFAnalyzer: Analyzes PDF files using PyMuPDF (fitz)- Extracts page counts and word counts
- Handles errors gracefully for corrupted PDFs
AnalysisLogger: Comprehensive logging system- Creates separate logs for summaries and errors
- Generates CSV reports with detailed file information
- Command-line interface using argparse
- Validates inputs and coordinates analysis workflow
- Graphical interface using tkinter
- Threaded analysis to prevent UI freezing
- Real-time log output and progress indication
The package includes comprehensive exception handling:
- Invalid paths are validated before processing
- File access errors are logged and don't stop the scan
- PDF processing errors are captured and reported separately
- All exceptions are logged to the error log with full details
- Python 3.7+
- PyMuPDF >= 1.23.0
- tkinter (usually included with Python)
This project is provided as-is for educational and practical use.
Feel free to submit issues, fork the repository, and create pull requests for any improvements.