Collects job postings across Europe, saves them as JSON, and analyzes the skill landscape to answer: "what do employers actually want in 2026?"
Default path uses a fast HTTP guest scraper (scrape_fast.py). Optional browser mode uses Patchright (patchright_shim.py) plus a custom job-page scraper that fixes the library's broken content
extraction — use it when you need fields like applicant_count. Built to be source-agnostic — LinkedIn is just the first feed.
stackpulse/
├── config.py # search queries, delays, timeouts, paths, and LLM settings
├── cli.py # Typer + Rich CLI (stackpulse command)
├── setup_session.py # one-time LinkedIn login → session file
├── scrape.py # main scraper orchestrator + per-query/per-job helpers
├── patchright_shim.py # map playwright.async_api → patchright (linkedin-scraper compatibility)
├── job_scraper_direct.py # custom browser scraper (replaces broken library JobScraper)
├── analyze.py # analysis orchestration/reporting + LLM extraction pipeline
├── analysis_db.py # DB schema/migrations + shared canonical_linkedin_job_key helper
├── analysis_candidates.py # candidate queue and promotion workflows
├── analysis_llm_cache.py # LLM cache read/write helpers
├── pyproject.toml # packaging + console script entrypoint
├── stackpulse # repo launcher (auto bootstrap + CLI dispatch)
├── install.sh # one-time symlink installer to ~/.local/bin/stackpulse
├── requirements.txt
├── .env # LinkedIn credentials (gitignored)
├── .env.example
└── data/
├── jobs_YYYY-MM-DD.json # one file per scrape day
├── jobs_*_analysis.xlsx # Excel export with one column per skill category
├── skills.db # SQLite: skills catalog + LLM cache + scrape dedupe ledger
├── scraper.log # timestamped run log
└── debug/ # screenshots + HTML dumped when a page fails to load
Recommended (repo launcher + one-time install):
cp .env.example .env
# fill in LINKEDIN_EMAIL and LINKEDIN_PASSWORD
./install.sh
stackpulse --helpWhat this does:
install.shcreates~/.local/bin/stackpulsesymlink to this repo launcher- launcher follows symlinks to resolve the real repo path
- launcher auto-creates
.venvon first run - launcher auto-installs dependencies when missing
- launcher auto-installs Patchright Chromium when missing
Optional packaging-based install (alternative):
pip install -e .Patchright has no patchright shell command. Install the Chromium build with the same Python that has the dependencies (e.g. your venv):
python3 -m patchright install chromium
# or, from this repo after ./stackpulse created .venv:
.venv/bin/python -m patchright install chromiumIf stackpulse is still not found, add this to your shell profile (~/.bashrc or ~/.zshrc):
export PATH="$HOME/.local/bin:$PATH"Then reload shell config (source ~/.bashrc or source ~/.zshrc).
Long-running steps show a floral status spinner (Unicode frames) plus a whimsical verb paired with the real task (venv install, session setup, scrape, or analyze --no-verbose). The task name uses cyan so it stays readable next to the verb. To use Rich’s classic dot spinner instead, set STACKPULSE_STATUS_SPINNER=dots in the environment.
Running stackpulse with no arguments launches an interactive wizard that prompts for command and options:
StackPulse — what would you like to do?
1 analyze Analyze scraped jobs and export stats
2 scrape Scrape LinkedIn for new jobs
3 auto Bootstrap + scrape + analyze end-to-end
4 setup-session Create or refresh LinkedIn session
5 review-skills Accept or reject LLM skill candidates (queue)
6 quit
Choice [1-6]:
All subcommands still work non-interactively via flags:
stackpulse --help
stackpulse setup-session
stackpulse scrape --limit 3
stackpulse analyze --all --llm
stackpulse analyze --all --title-contains "Backend" --location-contains "Berlin"
stackpulse analyze --candidates
stackpulse review-skillsstackpulse auto
stackpulse auto --limit 3 --all
stackpulse auto --llm --promote 2stackpulse auto behavior:
- Creates virtualenv if missing (
.venvby default) - Installs dependencies only when missing
- Installs Patchright Chromium only when missing
- Skips session setup if
session.jsonalready exists - Fails fast on the first failed step and prints a Rich summary table
You can still run legacy script entrypoints (py setup_session.py, py scrape.py, py analyze.py).
Requires a working stackpulse command in $PATH (via ./install.sh symlink setup or pip install -e .).
stackpulse --install-completion # install tab completion for your current shell
stackpulse --show-completion # print the completion script (for manual setup)Restart your shell (or source ~/.zshrc / ~/.bashrc) after installing.
py setup_session.py
# or
stackpulse setup-session- If credentials are in
.env, logs in programmatically - Otherwise opens a browser window for manual login
- Saves cookies/storage to
SESSION_FILEfromconfig.py(defaultsession.json) - Re-run whenever LinkedIn shows you a login page again
py scrape.py # full run — all queries in config.py
py scrape.py --limit 3 # quick test, 3 jobs per query
py scrape.py --fresh # ignore all previous results, re-scrape everything
# Typer CLI equivalents (default mode is fast HTTP — no browser)
stackpulse scrape
stackpulse scrape --limit 3
stackpulse scrape --fresh
stackpulse scrape --mode browser # optional: Patchright + session.json (e.g. applicant_count)
# Legacy scripts: scrape.py = browser only; scrape_fast.py = HTTP only
py scrape_fast.py --limit 3Scrape modes: stackpulse scrape defaults to --mode fast (guest HTTP, fastest, no session.json). Use --mode browser when you need logged-in fields such as applicant_count. The interactive wizard asks [fast/browser] with fast as the default.
Resume is automatic — on every run the scraper loads previously scraped job keys from data/skills.db
(table scraped_jobs) and skips them. Existing historical data/jobs_*.json files are backfilled once into the DB
when the ledger is empty. No flag needed.
Query-level resume — if you stop mid-run (Ctrl+C), the next scrape continues from the same search query index
(order of SEARCH_QUERIES in config.py), not from the beginning: fast mode writes data/scrape_resume_fast.json,
browser mode writes data/scrape_resume_browser.json. Delete that file (or use --fresh) to start again from query 1.
Ctrl+C exits cleanly — progress is saved after every job. Re-run at any time to pick up where you left off.
Output is saved incrementally after each job to data/jobs_YYYY-MM-DD.json.
Deduplication key is canonicalized via shared analysis_db.canonical_linkedin_job_key (LinkedIn job ID when
available; otherwise normalized URL path), and persisted in skills.db so the same posting is skipped across days
even if URL query params differ.
py analyze.py # analyze today's file
py analyze.py --file data/jobs_2026-04-01.json
py analyze.py --all # merge all collected files
py analyze.py --llm # + open LLM extraction (free, via 9router)
py analyze.py --all --llm
# Typer CLI equivalents
stackpulse analyze
stackpulse analyze --file data/jobs_2026-04-01.json
stackpulse analyze --all
stackpulse analyze --llm # note: double-dash required; -llm and "analyze llm" are invalid
stackpulse analyze --all --llm
# Cohort filters (case-insensitive substring match)
stackpulse analyze --all --title-contains "Backend"
stackpulse analyze --all --location-contains "Berlin"
stackpulse analyze --all --llm --title-contains "Senior" --location-contains "Germany"Prints a skill frequency table to stdout and saves data/jobs_*_analysis.xlsx with one column per skill category.
Report sections:
| Section | Description |
|---|---|
| Extraction quality | Jobs with empty description and jobs with zero skills extracted (%) |
| Top skills | Frequency + prevalence % + bar; uses merged regex+LLM metric when --llm was run |
| By category | Top terms per skill category with prevalence % (regex + LLM unified) |
| Top locations | Most frequent scraped location values |
| Skills by location | Top 3 skills per search_location (only shown when >1 search location present) |
| Salary hints | Postings where a salary pattern was regex-extracted (with company + location) |
| Coverage gaps | Skills discovered by LLM but not yet in catalog (only with --llm) |
--llm mode calls NINEROUTER_MODEL through your local 9router endpoint (NINEROUTER_BASE_URL, default
http://localhost:20128/v1) with a skills-aware prompt — the full skills catalog is sent to the LLM so it matches
against known terms first and only flags genuinely new discoveries. Results are cached in data/skills.db — repeat
runs are instant with no API calls.
Regex vs LLM “skill counts”: the activity log line N skill row(s) to store counts LLM JSON terms written toward
llm_results (matched catalog terms plus new_terms), not regex taxonomy hits. A typical posting can show many regex
matches across the catalog (~8–12+) but only a few LLM rows — that is normal. To catch pathological under-extraction
(e.g. the model always returns almost nothing), analyze.py tracks a rolling window of the last 5 jobs: if the
combined stored row count is below LLM_LOW_SIGNAL_WARN_BELOW_SUM (default 24, half of LLM_LOW_SIGNAL_REFERENCE_SUM
48), it logs a one-shot warning per low episode in the Live panel and in data/analysis_activity.log (the episode
resets when the rolling sum rises back to the threshold or above).
After each --llm run, newly discovered terms (seen in ≥ LLM_CANDIDATE_THRESHOLD jobs, default 2) are automatically
queued in skill_candidates. Because the LLM is skills-aware, uncovered terms are genuinely new technologies/tools —
not synonyms or generic concepts.
analyze.py now splits entrypoint orchestration into _build_parser(), _handle_mode_only_paths(), and
_load_run_context() to keep mode routing, path resolution, and runtime setup easier to maintain.
--llm prints two gap metrics: raw uncovered terms and actionable uncovered terms. Actionable terms satisfy
jobs_count >= threshold, are not in SKIP_TERMS, and are not already present in skill_candidates.
Rate-limit handling (429): analyze.py parses retry wait from the provider error. If wait ≤
LLM_RATE_LIMIT_MAX_WAIT_SECONDS (default 30), it sleeps and retries once. For longer waits (daily quota exhausted),
it falls back to NINEROUTER_FALLBACK_MODEL if configured.
py analyze.py --candidates # inspect the promotion queue (all statuses + pending)
py analyze.py --promote # promote pending terms (≥2 jobs) into skills catalog
py analyze.py --promote 3 # same, threshold = 3 jobs
py analyze.py --all --promote # promote first, then analyze with enriched skills
# Typer CLI equivalents
stackpulse analyze --candidates
stackpulse analyze --promote 2
stackpulse analyze --promote 3
stackpulse analyze --all --promote 2Interactive review (Rich + prompts): after --llm runs have filled the queue, use stackpulse review-skills (or
wizard option 5) to approve or reject each pending term, or bulk-promote by minimum job count — no SQL required.
Once promoted, terms are matched by regex in all future runs — no --llm flag needed.
To reject a term so it never reappears:
sqlite3 data/skills.db "UPDATE skill_candidates SET status='rejected' WHERE term='<term>'"| Variable | Default | Description |
|---|---|---|
SEARCH_QUERIES |
11 queries | List of (keywords, location) tuples |
JOBS_PER_QUERY |
25 |
Max jobs fetched per query |
DELAY_BETWEEN_JOBS |
3 |
Pause (seconds) between individual job page scrapes |
DELAY_BETWEEN_QUERIES |
5 |
Pause (seconds) between search queries |
OUTPUT_DIR |
"data" | Directory for JSON output, logs, DB, and debug dumps |
SESSION_FILE |
"session.json" | Saved browser session |
| Variable | Default | Description |
|---|---|---|
PAGE_LOAD_TIMEOUT_MS |
60000 |
Timeout for page.goto(..., wait_until="domcontentloaded") |
H1_WAIT_TIMEOUT_MS |
5000 |
Timeout waiting for job title <h1> |
BUTTON_CLICK_TIMEOUT_MS |
3000 |
Timeout for clicking expand buttons |
POST_CLICK_SETTLE_SECONDS |
0.5 |
Sleep after each expand click |
POST_EXPAND_SETTLE_SECONDS |
1.0 |
Sleep before extraction starts |
DEBUG_HTML_SNIPPET_CHARS |
8000 |
Max HTML chars written to debug file |
DESCRIPTION_MIN_CHARS |
100 |
Minimum length for accepted job description |
| Variable | Default | Description |
|---|---|---|
DB_FILENAME |
"skills.db" | SQLite filename inside OUTPUT_DIR |
NINEROUTER_BASE_URL |
"http://localhost:20128/v1" | OpenAI-compatible 9router endpoint |
NINEROUTER_MODEL |
"9router-combo" | Primary model (use a concrete provider only if wired in 9router) |
NINEROUTER_FALLBACK_MODEL |
"" | Optional fallback when primary fails (e.g. after 429) |
NINEROUTER_API_KEY |
"local" | API key passed to the OpenAI-compatible 9router client |
LLM_MAX_INPUT_CHARS |
8000 |
Max characters sent from each posting to LLM |
LLM_MAX_OUTPUT_TOKENS |
1000 |
LLM completion token cap |
LLM_RESPONSE_FORMAT_JSON_OBJECT |
True |
Request JSON-object mode; retries once without it if the proxy rejects the parameter |
LLM_RATE_LIMIT_MAX_WAIT_SECONDS |
30 |
Max retry sleep for 429 before fallback |
LLM_LOW_SIGNAL_WINDOW_JOBS |
5 |
Rolling window size for LLM stored-row health check |
LLM_LOW_SIGNAL_REFERENCE_SUM |
48 |
Reference total LLM rows for that window (~regex baseline) |
LLM_LOW_SIGNAL_WARN_BELOW_SUM |
24 |
Warn when rolling sum is below this (50% of reference by default) |
RETRY_AFTER_BUFFER_SECONDS |
2 |
Safety buffer added to parsed retry-after |
LLM_CANDIDATE_THRESHOLD |
2 |
Min job occurrences to promote a candidate term |
Current search targets: Berlin, Hamburg, Munich, Germany (general), Vienna, Amsterdam, Luxembourg, Barcelona, Madrid, London, Remote — all for senior Python/FastAPI backend roles.
Use these as a practical baseline in config.py:
# Fast local test profile (quick feedback)
JOBS_PER_QUERY = 3
DELAY_BETWEEN_JOBS = 1
DELAY_BETWEEN_QUERIES = 2
PAGE_LOAD_TIMEOUT_MS = 45_000
H1_WAIT_TIMEOUT_MS = 5_000
LLM_RATE_LIMIT_MAX_WAIT_SECONDS = 15# Stable full-run profile (fewer limits/blocks)
JOBS_PER_QUERY = 25
DELAY_BETWEEN_JOBS = 3
DELAY_BETWEEN_QUERIES = 5
PAGE_LOAD_TIMEOUT_MS = 60_000
H1_WAIT_TIMEOUT_MS = 15_000
LLM_RATE_LIMIT_MAX_WAIT_SECONDS = 30
# Optional second route if primary exhausts: NINEROUTER_FALLBACK_MODEL = "groq/llama-3.3-70b-versatile"Tip: use the fast profile for selector/debug iteration, then switch back to the stable profile for production collection runs.
| Field | Source |
|---|---|
linkedin_url |
Job URL |
job_title |
Page <h1> |
company |
Company link near title |
company_linkedin_url |
Company /company/ href |
location |
Location text in top card |
posted_date |
"X days ago" text |
applicant_count |
"N applicants" text |
job_description |
Full description text |
salary_extracted |
Regex over description (best-effort) |
search_keywords |
Which query found this job |
search_location |
Which location was searched |
scraped_date |
ISO date of the scrape |
LinkedIn does not expose salary as a structured field.
salary_extractedis regex-based and best effort.
17 categories, 227+ terms, stored in data/skills.db (SQLite). The DB is auto-created and seeded from SKILLS_SEED in
analyze.py on first run. Terms are stored as plain lowercase text; regex escaping is applied at load time only.
Additional terms accumulate automatically via the LLM promotion pipeline. The same DB also stores scraped_jobs, used
by
scrape.py as the persistent dedupe ledger across daily JSON outputs.
| Category | Examples |
|---|---|
| Languages | python, go, rust, java, kotlin, typescript |
| Python Frameworks | fastapi, django, flask, aiohttp, starlette |
| Python Libraries | sqlalchemy, pydantic, celery, asyncpg, boto3 |
| Databases — Relational | postgresql, mysql, cockroachdb, aurora |
| Databases — NoSQL/Search | mongodb, redis, elasticsearch, cassandra, dynamodb |
| Databases — Analytical | clickhouse, bigquery, snowflake, dbt |
| Cloud | aws, gcp, azure, lambda, s3, step functions, bedrock |
| Containers & Orchestration | kubernetes, docker, helm, argo, istio |
| IaC & CI/CD | terraform, pulumi, github actions, gitlab ci, argocd |
| Messaging & Streaming | kafka, rabbitmq, sqs, kinesis, nats |
| API & Architecture | rest, graphql, grpc, microservices, solid, cqrs, ddd |
| Auth & Security | oauth2, jwt, keycloak, auth0, vault |
| Monitoring & Observability | prometheus, grafana, datadog, opentelemetry |
| Testing | pytest, tdd, testcontainers, hypothesis, coverage.py |
| AI / ML (in JD) | ai, llm, langchain, pgvector, rag, generative ai, cursor |
| Soft / Process | agile, mentoring, tech lead, staff engineer |
| Languages (non-technical) | english, german, french, dutch |
Add term without code change:
sqlite3 data/skills.db "INSERT OR IGNORE INTO skills(category_id,term) SELECT id,'hetzner' FROM categories WHERE name='Cloud'"Add alias (e.g. multilingual synonym):
sqlite3 data/skills.db \
"INSERT INTO skill_aliases(skill_id,alias,canonical,lang,alias_type)
SELECT id,'Node.js','node.js','en','variant' FROM skills WHERE term='node.js'"Top LLM-discovered skills across all jobs:
sqlite3 data/skills.db "SELECT skill, COUNT(DISTINCT url_key) jobs FROM llm_results GROUP BY skill ORDER BY jobs DESC LIMIT 20"Promotion queue view:
sqlite3 data/skills.db "SELECT sc.term, c.name AS category, sc.jobs_count, sc.status FROM skill_candidates sc JOIN categories c ON c.id=sc.category_id ORDER BY sc.jobs_count DESC"- LinkedIn SPA timing: library
JobScraperis unreliable for React-rendered content.job_scraper_direct.pywaits for<h1>and uses selector fallbacks. - No structured salary data: salary extraction is regex over free text.
- Session expiry: LinkedIn sessions expire; rerun
setup_session.py. - LinkedIn UI drift: CSS selectors can break due to A/B tests. If fields become
null, inspectdata/debug/snapshots and update selector lists. - LLM quota limits: long quota windows skip sleep-retry and use fallback model if configured.
- LLM invalid JSON: some models occasionally emit prose inside JSON arrays; the prompt forbids that,
response_formatJSON mode is requested when supported, and the analyzer retries once with a stricter instruction. If logs still showLLM output was not valid JSON, setNINEROUTER_FALLBACK_MODELor setLLM_RESPONSE_FORMAT_JSON_OBJECT = Falseif your OpenAI-compatible proxy errors on that parameter (the client also retries without JSON mode when the error looks like an unsupported parameter). - Single-letter language matching:
cis matched as\bc\bwhich can false-positive on the English word. LLM extraction correctly disambiguates C language from prose. - Best-effort extraction paths: scraper field extractors fail soft by design and continue fallback traversal; debug logs now include selector/button context to make drift diagnosis faster.