Skip to content

octrow/stackpulse

Repository files navigation

stackpulse

Collects job postings across Europe, saves them as JSON, and analyzes the skill landscape to answer: "what do employers actually want in 2026?"

Default path uses a fast HTTP guest scraper (scrape_fast.py). Optional browser mode uses Patchright (patchright_shim.py) plus a custom job-page scraper that fixes the library's broken content extraction — use it when you need fields like applicant_count. Built to be source-agnostic — LinkedIn is just the first feed.


Project layout

stackpulse/
├── config.py               # search queries, delays, timeouts, paths, and LLM settings
├── cli.py                  # Typer + Rich CLI (stackpulse command)
├── setup_session.py        # one-time LinkedIn login → session file
├── scrape.py               # main scraper orchestrator + per-query/per-job helpers
├── patchright_shim.py      # map playwright.async_api → patchright (linkedin-scraper compatibility)
├── job_scraper_direct.py   # custom browser scraper (replaces broken library JobScraper)
├── analyze.py              # analysis orchestration/reporting + LLM extraction pipeline
├── analysis_db.py          # DB schema/migrations + shared canonical_linkedin_job_key helper
├── analysis_candidates.py  # candidate queue and promotion workflows
├── analysis_llm_cache.py   # LLM cache read/write helpers
├── pyproject.toml          # packaging + console script entrypoint
├── stackpulse              # repo launcher (auto bootstrap + CLI dispatch)
├── install.sh              # one-time symlink installer to ~/.local/bin/stackpulse
├── requirements.txt
├── .env                    # LinkedIn credentials (gitignored)
├── .env.example
└── data/
    ├── jobs_YYYY-MM-DD.json        # one file per scrape day
    ├── jobs_*_analysis.xlsx        # Excel export with one column per skill category
    ├── skills.db                   # SQLite: skills catalog + LLM cache + scrape dedupe ledger
    ├── scraper.log                 # timestamped run log
    └── debug/                      # screenshots + HTML dumped when a page fails to load

Setup

Recommended (repo launcher + one-time install):

cp .env.example .env
# fill in LINKEDIN_EMAIL and LINKEDIN_PASSWORD

./install.sh
stackpulse --help

What this does:

  • install.sh creates ~/.local/bin/stackpulse symlink to this repo launcher
  • launcher follows symlinks to resolve the real repo path
  • launcher auto-creates .venv on first run
  • launcher auto-installs dependencies when missing
  • launcher auto-installs Patchright Chromium when missing

Optional packaging-based install (alternative):

pip install -e .

Patchright has no patchright shell command. Install the Chromium build with the same Python that has the dependencies (e.g. your venv):

python3 -m patchright install chromium
# or, from this repo after ./stackpulse created .venv:
.venv/bin/python -m patchright install chromium

If stackpulse is still not found, add this to your shell profile (~/.bashrc or ~/.zshrc):

export PATH="$HOME/.local/bin:$PATH"

Then reload shell config (source ~/.bashrc or source ~/.zshrc).


CLI (Typer + Rich)

Long-running steps show a floral status spinner (Unicode frames) plus a whimsical verb paired with the real task (venv install, session setup, scrape, or analyze --no-verbose). The task name uses cyan so it stays readable next to the verb. To use Rich’s classic dot spinner instead, set STACKPULSE_STATUS_SPINNER=dots in the environment.

Running stackpulse with no arguments launches an interactive wizard that prompts for command and options:

StackPulse — what would you like to do?

  1  analyze        Analyze scraped jobs and export stats
  2  scrape         Scrape LinkedIn for new jobs
  3  auto           Bootstrap + scrape + analyze end-to-end
  4  setup-session  Create or refresh LinkedIn session
  5  review-skills  Accept or reject LLM skill candidates (queue)
  6  quit

Choice [1-6]:

All subcommands still work non-interactively via flags:

stackpulse --help
stackpulse setup-session
stackpulse scrape --limit 3
stackpulse analyze --all --llm
stackpulse analyze --all --title-contains "Backend" --location-contains "Berlin"
stackpulse analyze --candidates
stackpulse review-skills

Auto workflow (one command)

stackpulse auto
stackpulse auto --limit 3 --all
stackpulse auto --llm --promote 2

stackpulse auto behavior:

  • Creates virtualenv if missing (.venv by default)
  • Installs dependencies only when missing
  • Installs Patchright Chromium only when missing
  • Skips session setup if session.json already exists
  • Fails fast on the first failed step and prints a Rich summary table

You can still run legacy script entrypoints (py setup_session.py, py scrape.py, py analyze.py).

Shell completion

Requires a working stackpulse command in $PATH (via ./install.sh symlink setup or pip install -e .).

stackpulse --install-completion   # install tab completion for your current shell
stackpulse --show-completion      # print the completion script (for manual setup)

Restart your shell (or source ~/.zshrc / ~/.bashrc) after installing.


Workflow

1. Create a session (once, or when session expires)

py setup_session.py
# or
stackpulse setup-session
  • If credentials are in .env, logs in programmatically
  • Otherwise opens a browser window for manual login
  • Saves cookies/storage to SESSION_FILE from config.py (default session.json)
  • Re-run whenever LinkedIn shows you a login page again

2. Scrape jobs

py scrape.py              # full run — all queries in config.py
py scrape.py --limit 3    # quick test, 3 jobs per query
py scrape.py --fresh      # ignore all previous results, re-scrape everything

# Typer CLI equivalents (default mode is fast HTTP — no browser)
stackpulse scrape
stackpulse scrape --limit 3
stackpulse scrape --fresh
stackpulse scrape --mode browser   # optional: Patchright + session.json (e.g. applicant_count)

# Legacy scripts: scrape.py = browser only; scrape_fast.py = HTTP only
py scrape_fast.py --limit 3

Scrape modes: stackpulse scrape defaults to --mode fast (guest HTTP, fastest, no session.json). Use --mode browser when you need logged-in fields such as applicant_count. The interactive wizard asks [fast/browser] with fast as the default.

Resume is automatic — on every run the scraper loads previously scraped job keys from data/skills.db (table scraped_jobs) and skips them. Existing historical data/jobs_*.json files are backfilled once into the DB when the ledger is empty. No flag needed.

Query-level resume — if you stop mid-run (Ctrl+C), the next scrape continues from the same search query index (order of SEARCH_QUERIES in config.py), not from the beginning: fast mode writes data/scrape_resume_fast.json, browser mode writes data/scrape_resume_browser.json. Delete that file (or use --fresh) to start again from query 1.

Ctrl+C exits cleanly — progress is saved after every job. Re-run at any time to pick up where you left off.

Output is saved incrementally after each job to data/jobs_YYYY-MM-DD.json.

Deduplication key is canonicalized via shared analysis_db.canonical_linkedin_job_key (LinkedIn job ID when available; otherwise normalized URL path), and persisted in skills.db so the same posting is skipped across days even if URL query params differ.

3. Analyze skills

py analyze.py                              # analyze today's file
py analyze.py --file data/jobs_2026-04-01.json
py analyze.py --all                        # merge all collected files
py analyze.py --llm                        # + open LLM extraction (free, via 9router)
py analyze.py --all --llm

# Typer CLI equivalents
stackpulse analyze
stackpulse analyze --file data/jobs_2026-04-01.json
stackpulse analyze --all
stackpulse analyze --llm          # note: double-dash required; -llm and "analyze llm" are invalid
stackpulse analyze --all --llm

# Cohort filters (case-insensitive substring match)
stackpulse analyze --all --title-contains "Backend"
stackpulse analyze --all --location-contains "Berlin"
stackpulse analyze --all --llm --title-contains "Senior" --location-contains "Germany"

Prints a skill frequency table to stdout and saves data/jobs_*_analysis.xlsx with one column per skill category.

Report sections:

Section Description
Extraction quality Jobs with empty description and jobs with zero skills extracted (%)
Top skills Frequency + prevalence % + bar; uses merged regex+LLM metric when --llm was run
By category Top terms per skill category with prevalence % (regex + LLM unified)
Top locations Most frequent scraped location values
Skills by location Top 3 skills per search_location (only shown when >1 search location present)
Salary hints Postings where a salary pattern was regex-extracted (with company + location)
Coverage gaps Skills discovered by LLM but not yet in catalog (only with --llm)

--llm mode calls NINEROUTER_MODEL through your local 9router endpoint (NINEROUTER_BASE_URL, default http://localhost:20128/v1) with a skills-aware prompt — the full skills catalog is sent to the LLM so it matches against known terms first and only flags genuinely new discoveries. Results are cached in data/skills.db — repeat runs are instant with no API calls.

Regex vs LLM “skill counts”: the activity log line N skill row(s) to store counts LLM JSON terms written toward llm_results (matched catalog terms plus new_terms), not regex taxonomy hits. A typical posting can show many regex matches across the catalog (~8–12+) but only a few LLM rows — that is normal. To catch pathological under-extraction (e.g. the model always returns almost nothing), analyze.py tracks a rolling window of the last 5 jobs: if the combined stored row count is below LLM_LOW_SIGNAL_WARN_BELOW_SUM (default 24, half of LLM_LOW_SIGNAL_REFERENCE_SUM 48), it logs a one-shot warning per low episode in the Live panel and in data/analysis_activity.log (the episode resets when the rolling sum rises back to the threshold or above).

After each --llm run, newly discovered terms (seen in ≥ LLM_CANDIDATE_THRESHOLD jobs, default 2) are automatically queued in skill_candidates. Because the LLM is skills-aware, uncovered terms are genuinely new technologies/tools — not synonyms or generic concepts.

analyze.py now splits entrypoint orchestration into _build_parser(), _handle_mode_only_paths(), and _load_run_context() to keep mode routing, path resolution, and runtime setup easier to maintain.

--llm prints two gap metrics: raw uncovered terms and actionable uncovered terms. Actionable terms satisfy jobs_count >= threshold, are not in SKIP_TERMS, and are not already present in skill_candidates.

Rate-limit handling (429): analyze.py parses retry wait from the provider error. If wait ≤ LLM_RATE_LIMIT_MAX_WAIT_SECONDS (default 30), it sleeps and retries once. For longer waits (daily quota exhausted), it falls back to NINEROUTER_FALLBACK_MODEL if configured.

4. Promote LLM-discovered skills into catalog

py analyze.py --candidates                 # inspect the promotion queue (all statuses + pending)
py analyze.py --promote                    # promote pending terms (≥2 jobs) into skills catalog
py analyze.py --promote 3                  # same, threshold = 3 jobs
py analyze.py --all --promote              # promote first, then analyze with enriched skills

# Typer CLI equivalents
stackpulse analyze --candidates
stackpulse analyze --promote 2
stackpulse analyze --promote 3
stackpulse analyze --all --promote 2

Interactive review (Rich + prompts): after --llm runs have filled the queue, use stackpulse review-skills (or wizard option 5) to approve or reject each pending term, or bulk-promote by minimum job count — no SQL required.

Once promoted, terms are matched by regex in all future runs — no --llm flag needed.

To reject a term so it never reappears:

sqlite3 data/skills.db "UPDATE skill_candidates SET status='rejected' WHERE term='<term>'"

Configuration (config.py)

Core scrape settings

Variable Default Description
SEARCH_QUERIES 11 queries List of (keywords, location) tuples
JOBS_PER_QUERY 25 Max jobs fetched per query
DELAY_BETWEEN_JOBS 3 Pause (seconds) between individual job page scrapes
DELAY_BETWEEN_QUERIES 5 Pause (seconds) between search queries
OUTPUT_DIR "data" Directory for JSON output, logs, DB, and debug dumps
SESSION_FILE "session.json" Saved browser session

Scraper timeouts / extraction behavior

Variable Default Description
PAGE_LOAD_TIMEOUT_MS 60000 Timeout for page.goto(..., wait_until="domcontentloaded")
H1_WAIT_TIMEOUT_MS 5000 Timeout waiting for job title <h1>
BUTTON_CLICK_TIMEOUT_MS 3000 Timeout for clicking expand buttons
POST_CLICK_SETTLE_SECONDS 0.5 Sleep after each expand click
POST_EXPAND_SETTLE_SECONDS 1.0 Sleep before extraction starts
DEBUG_HTML_SNIPPET_CHARS 8000 Max HTML chars written to debug file
DESCRIPTION_MIN_CHARS 100 Minimum length for accepted job description

LLM / analysis settings

Variable Default Description
DB_FILENAME "skills.db" SQLite filename inside OUTPUT_DIR
NINEROUTER_BASE_URL "http://localhost:20128/v1" OpenAI-compatible 9router endpoint
NINEROUTER_MODEL "9router-combo" Primary model (use a concrete provider only if wired in 9router)
NINEROUTER_FALLBACK_MODEL "" Optional fallback when primary fails (e.g. after 429)
NINEROUTER_API_KEY "local" API key passed to the OpenAI-compatible 9router client
LLM_MAX_INPUT_CHARS 8000 Max characters sent from each posting to LLM
LLM_MAX_OUTPUT_TOKENS 1000 LLM completion token cap
LLM_RESPONSE_FORMAT_JSON_OBJECT True Request JSON-object mode; retries once without it if the proxy rejects the parameter
LLM_RATE_LIMIT_MAX_WAIT_SECONDS 30 Max retry sleep for 429 before fallback
LLM_LOW_SIGNAL_WINDOW_JOBS 5 Rolling window size for LLM stored-row health check
LLM_LOW_SIGNAL_REFERENCE_SUM 48 Reference total LLM rows for that window (~regex baseline)
LLM_LOW_SIGNAL_WARN_BELOW_SUM 24 Warn when rolling sum is below this (50% of reference by default)
RETRY_AFTER_BUFFER_SECONDS 2 Safety buffer added to parsed retry-after
LLM_CANDIDATE_THRESHOLD 2 Min job occurrences to promote a candidate term

Current search targets: Berlin, Hamburg, Munich, Germany (general), Vienna, Amsterdam, Luxembourg, Barcelona, Madrid, London, Remote — all for senior Python/FastAPI backend roles.

Recommended minimal config profile

Use these as a practical baseline in config.py:

# Fast local test profile (quick feedback)
JOBS_PER_QUERY = 3
DELAY_BETWEEN_JOBS = 1
DELAY_BETWEEN_QUERIES = 2
PAGE_LOAD_TIMEOUT_MS = 45_000
H1_WAIT_TIMEOUT_MS = 5_000
LLM_RATE_LIMIT_MAX_WAIT_SECONDS = 15
# Stable full-run profile (fewer limits/blocks)
JOBS_PER_QUERY = 25
DELAY_BETWEEN_JOBS = 3
DELAY_BETWEEN_QUERIES = 5
PAGE_LOAD_TIMEOUT_MS = 60_000
H1_WAIT_TIMEOUT_MS = 15_000
LLM_RATE_LIMIT_MAX_WAIT_SECONDS = 30
# Optional second route if primary exhausts: NINEROUTER_FALLBACK_MODEL = "groq/llama-3.3-70b-versatile"

Tip: use the fast profile for selector/debug iteration, then switch back to the stable profile for production collection runs.


Collected fields

Field Source
linkedin_url Job URL
job_title Page <h1>
company Company link near title
company_linkedin_url Company /company/ href
location Location text in top card
posted_date "X days ago" text
applicant_count "N applicants" text
job_description Full description text
salary_extracted Regex over description (best-effort)
search_keywords Which query found this job
search_location Which location was searched
scraped_date ISO date of the scrape

LinkedIn does not expose salary as a structured field. salary_extracted is regex-based and best effort.


Skills catalog (data/skills.db)

17 categories, 227+ terms, stored in data/skills.db (SQLite). The DB is auto-created and seeded from SKILLS_SEED in analyze.py on first run. Terms are stored as plain lowercase text; regex escaping is applied at load time only. Additional terms accumulate automatically via the LLM promotion pipeline. The same DB also stores scraped_jobs, used by scrape.py as the persistent dedupe ledger across daily JSON outputs.

Category Examples
Languages python, go, rust, java, kotlin, typescript
Python Frameworks fastapi, django, flask, aiohttp, starlette
Python Libraries sqlalchemy, pydantic, celery, asyncpg, boto3
Databases — Relational postgresql, mysql, cockroachdb, aurora
Databases — NoSQL/Search mongodb, redis, elasticsearch, cassandra, dynamodb
Databases — Analytical clickhouse, bigquery, snowflake, dbt
Cloud aws, gcp, azure, lambda, s3, step functions, bedrock
Containers & Orchestration kubernetes, docker, helm, argo, istio
IaC & CI/CD terraform, pulumi, github actions, gitlab ci, argocd
Messaging & Streaming kafka, rabbitmq, sqs, kinesis, nats
API & Architecture rest, graphql, grpc, microservices, solid, cqrs, ddd
Auth & Security oauth2, jwt, keycloak, auth0, vault
Monitoring & Observability prometheus, grafana, datadog, opentelemetry
Testing pytest, tdd, testcontainers, hypothesis, coverage.py
AI / ML (in JD) ai, llm, langchain, pgvector, rag, generative ai, cursor
Soft / Process agile, mentoring, tech lead, staff engineer
Languages (non-technical) english, german, french, dutch

Add term without code change:

sqlite3 data/skills.db "INSERT OR IGNORE INTO skills(category_id,term) SELECT id,'hetzner' FROM categories WHERE name='Cloud'"

Add alias (e.g. multilingual synonym):

sqlite3 data/skills.db \
  "INSERT INTO skill_aliases(skill_id,alias,canonical,lang,alias_type)
   SELECT id,'Node.js','node.js','en','variant' FROM skills WHERE term='node.js'"

Top LLM-discovered skills across all jobs:

sqlite3 data/skills.db "SELECT skill, COUNT(DISTINCT url_key) jobs FROM llm_results GROUP BY skill ORDER BY jobs DESC LIMIT 20"

Promotion queue view:

sqlite3 data/skills.db "SELECT sc.term, c.name AS category, sc.jobs_count, sc.status FROM skill_candidates sc JOIN categories c ON c.id=sc.category_id ORDER BY sc.jobs_count DESC"

Known issues & limitations

  • LinkedIn SPA timing: library JobScraper is unreliable for React-rendered content. job_scraper_direct.py waits for <h1> and uses selector fallbacks.
  • No structured salary data: salary extraction is regex over free text.
  • Session expiry: LinkedIn sessions expire; rerun setup_session.py.
  • LinkedIn UI drift: CSS selectors can break due to A/B tests. If fields become null, inspect data/debug/ snapshots and update selector lists.
  • LLM quota limits: long quota windows skip sleep-retry and use fallback model if configured.
  • LLM invalid JSON: some models occasionally emit prose inside JSON arrays; the prompt forbids that, response_format JSON mode is requested when supported, and the analyzer retries once with a stricter instruction. If logs still show LLM output was not valid JSON, set NINEROUTER_FALLBACK_MODEL or set LLM_RESPONSE_FORMAT_JSON_OBJECT = False if your OpenAI-compatible proxy errors on that parameter (the client also retries without JSON mode when the error looks like an unsupported parameter).
  • Single-letter language matching: c is matched as \bc\b which can false-positive on the English word. LLM extraction correctly disambiguates C language from prose.
  • Best-effort extraction paths: scraper field extractors fail soft by design and continue fallback traversal; debug logs now include selector/button context to make drift diagnosis faster.

About

LinkedIn scraper + skill analyzer for Python backend roles. Playwright extraction, SQLite taxonomy, optional 9router LLM candidate pipeline.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors