Automated tool that reads job application emails from Outlook, summarizes them using AI, and syncs the information to a Notion database with intelligent deduplication and incremental sync.
- π§ Email Processing: Fetches emails from Outlook using Microsoft Graph API
- π€ AI Summarization: Extracts company, role, location, and application stage using OpenAI GPT-4o-mini
- π Smart Deduplication: Aggregates multiple emails for the same application into a single lifecycle entry
- π Notion Integration: Syncs to Notion databases with automatic company and application tracking
- πΎ Incremental Sync: 4-stage pipeline with intelligent caching - only processes new emails
- π§ͺ Fully Tested: Comprehensive pytest test suite for all modules
pip install -r requirements.txtCreate a .env file in the project root:
# Microsoft Azure App Registration
TENANT_ID=your-tenant-id
CLIENT_ID=your-client-id
# OpenAI API
OPENAI_API_KEY=your-openai-api-key
# Notion API
NOTION_API_KEY=your-notion-integration-token
NOTION_COMPANIES_DB_ID=your-companies-database-id
NOTION_APPLICATIONS_DB_ID=your-applications-database-id
# Optional: User email for contact filtering
USER_EMAIL=your-email@example.comNote: Using Public Client Flow (device code flow) does NOT require CLIENT_SECRET.
Create two databases in Notion:
Companies Database:
- Name (Title) - Required
- Location (Text) - Optional
- Industry (Text) - Optional
- Website (URL) - Optional
- Notes (Text) - Optional
Job Applications Database:
- Job Role (Title) - Required
- Company (Relation β Companies) - Required
- Location (Text) - Optional
- Contact (Text) - Optional
- Status (Status) - Required (options: Applied, Interview, Rejected, Offer, Withdrawn)
- Application Date (Date) - Optional
- Last Communication Date (Date) - Optional
- Last Communication Type (Select) - Optional
- Last Communication Notes (Text) - Optional
- Email IDs (Text) - Required (comma-separated list of synced email IDs)
- Notes (Text) - Optional
Important: Both databases must be connected to your Notion Integration.
# Process emails (default: 100 emails)
python process_emails.py
# Process specific number of emails
python process_emails.py --limit 50
# Process from specific folder
python process_emails.py --folder "Applications" --limit 20
# Dry run (test without writing to Notion)
python process_emails.py --limit 5 --dry-runThe system uses a 4-stage pipeline with intelligent caching:
- Stage 1 β Stage 2: Fetch emails from Outlook and cache raw email data
- Stage 2 β Stage 3: Summarize emails using AI and cache summaries
- Stage 3 β Stage 4: Deduplicate emails by Company + Role and sync to Notion
- Stage 4: Track synced emails using Email IDs stored in Notion
Each stage only processes items from the previous stage that haven't been processed yet, using cache files to track progress. This enables efficient incremental sync even after months of inactivity.
Outlook Application Summary New/
βββ src/ # Core modules (pure, no CLI)
β βββ email_summary.py # EmailSummary dataclass
β βββ read_outlook_emails.py # Outlook API integration
β βββ summarize_emails.py # AI summarization
β βββ deduplicate.py # Email deduplication logic
β βββ notion_writer.py # Notion API integration
β βββ cache.py # File-based caching
β βββ test/ # Pytest test suite
βββ cache/ # Cache storage (gitignored)
β βββ email/ # Cached raw emails (email_<id>.json)
β βββ summary/ # Cached summaries (summary_<id>.json)
βββ process_emails.py # Main orchestrator script
βββ requirements.txt # Python dependencies
βββ README.md # This file
src/: Pure modules with no CLI, file I/O, or environment variable loadingprocess_emails.py: Orchestrator script that ties everything togethersrc/test/: Pytest tests for all modules
Data structure representing a summarized email with company, role, location, application stage, and contact information.
Handles authentication with Microsoft Graph API and fetches emails from Outlook folders using device code flow.
Uses OpenAI GPT-4o-mini to extract structured information from emails. Cleans HTML content and extracts company, role, location, and application stage.
Aggregates multiple emails for the same job application (Company + Role) into a single ApplicationEntry with lifecycle tracking.
Manages Notion database operations: creating/updating companies and job applications, fetching existing entries, and tracking synced email IDs.
File-based caching system for emails and summaries, enabling incremental processing.
# Run all tests
pytest src/test/ -v
# Run specific test file
pytest src/test/test_email_summary.py -v
# Run with coverage
pytest src/test/ --cov=src --cov-report=htmlAll modules in src/ follow these principles:
- β No CLI code
- β No file I/O helpers
- β No environment variable loading
- β Single responsibility
- β Pure functions where possible
- β Explicit dependencies
- β Full type hints
- β Comprehensive tests
- Go to Azure Portal β Azure Active Directory β App registrations
- Create a new registration
- Note your Application (client) ID and Directory (tenant) ID
- Go to API permissions β Add permission β Microsoft Graph β Delegated permissions
- Add
Mail.Readpermission - Grant admin consent
- Go to Authentication β Enable "Allow public client flows" β Save
- Get API key from OpenAI Platform
- Add to
.envfile asOPENAI_API_KEY
- Create Integration at Notion Integrations
- Get your integration token
- Create Companies and Job Applications databases
- Connect both databases to your integration
- Extract database IDs from URLs (32 characters, remove hyphens)
AADSTS7000218 Error:
- Go to Azure Portal β App registrations β Your App β Authentication
- Enable "Allow public client flows" β Save
- Re-run the script
Permission Denied:
- Ensure
Mail.Readpermission is granted with admin consent - Verify you've clicked "Grant admin consent" in Azure Portal
- Verify database IDs are correct (32 characters, remove hyphens from URL)
- Ensure both databases are connected to your integration
- Check that property names match exactly (case-sensitive, e.g., "Status" not "status")
- Ensure "Email IDs" property exists in Job Applications database (Text type)
- Cache is stored in
cache/directory (automatically created) - Delete
cache/directory to force reprocessing all emails - Cache is purely file-based - existence of files determines cache status
- Each email gets two cache files:
email_<id>.jsonandsummary_<id>.json
Outlook Email β Cache (Stage 2)
β Summarize (Stage 3)
β Deduplicate by Company+Role
β Sync to Notion (Stage 4)
β Track Email IDs in Notion
This project is provided as-is for personal use.