stackpulse

Collects job postings across Europe, saves them as JSON, and analyzes the skill landscape to answer: "what do employers actually want in 2026?"

Default path uses a fast HTTP guest scraper (scrape_fast.py). Optional browser mode uses Patchright (patchright_shim.py) plus a custom job-page scraper that fixes the library's broken content extraction — use it when you need fields like applicant_count. Built to be source-agnostic — LinkedIn is just the first feed.

Project layout

stackpulse/
├── config.py               # search queries, delays, timeouts, paths, and LLM settings
├── cli.py                  # Typer + Rich CLI (stackpulse command)
├── setup_session.py        # one-time LinkedIn login → session file
├── scrape.py               # main scraper orchestrator + per-query/per-job helpers
├── patchright_shim.py      # map playwright.async_api → patchright (linkedin-scraper compatibility)
├── job_scraper_direct.py   # custom browser scraper (replaces broken library JobScraper)
├── analyze.py              # analysis orchestration/reporting + LLM extraction pipeline
├── analysis_db.py          # DB schema/migrations + shared canonical_linkedin_job_key helper
├── analysis_candidates.py  # candidate queue and promotion workflows
├── analysis_llm_cache.py   # LLM cache read/write helpers
├── pyproject.toml          # packaging + console script entrypoint
├── stackpulse              # repo launcher (auto bootstrap + CLI dispatch)
├── install.sh              # one-time symlink installer to ~/.local/bin/stackpulse
├── requirements.txt
├── .env                    # LinkedIn credentials (gitignored)
├── .env.example
└── data/
    ├── jobs_YYYY-MM-DD.json        # one file per scrape day
    ├── jobs_*_analysis.xlsx        # Excel export with one column per skill category
    ├── skills.db                   # SQLite: skills catalog + LLM cache + scrape dedupe ledger
    ├── scraper.log                 # timestamped run log
    └── debug/                      # screenshots + HTML dumped when a page fails to load

Setup

Recommended (repo launcher + one-time install):

cp .env.example .env
# fill in LINKEDIN_EMAIL and LINKEDIN_PASSWORD

./install.sh
stackpulse --help

What this does:

install.sh creates ~/.local/bin/stackpulse symlink to this repo launcher
launcher follows symlinks to resolve the real repo path
launcher auto-creates .venv on first run
launcher auto-installs dependencies when missing
launcher auto-installs Patchright Chromium when missing

Optional packaging-based install (alternative):

pip install -e .

Patchright has no patchright shell command. Install the Chromium build with the same Python that has the dependencies (e.g. your venv):

python3 -m patchright install chromium
# or, from this repo after ./stackpulse created .venv:
.venv/bin/python -m patchright install chromium

If stackpulse is still not found, add this to your shell profile (~/.bashrc or ~/.zshrc):

export PATH="$HOME/.local/bin:$PATH"

Then reload shell config (source ~/.bashrc or source ~/.zshrc).

CLI (Typer + Rich)

Long-running steps show a floral status spinner (Unicode frames) plus a whimsical verb paired with the real task (venv install, session setup, scrape, or analyze --no-verbose). The task name uses cyan so it stays readable next to the verb. To use Rich’s classic dot spinner instead, set STACKPULSE_STATUS_SPINNER=dots in the environment.

Running stackpulse with no arguments launches an interactive wizard that prompts for command and options:

StackPulse — what would you like to do?

  1  analyze        Analyze scraped jobs and export stats
  2  scrape         Scrape LinkedIn for new jobs
  3  auto           Bootstrap + scrape + analyze end-to-end
  4  setup-session  Create or refresh LinkedIn session
  5  review-skills  Accept or reject LLM skill candidates (queue)
  6  quit

Choice [1-6]:

All subcommands still work non-interactively via flags:

stackpulse --help
stackpulse setup-session
stackpulse scrape --limit 3
stackpulse analyze --all --llm
stackpulse analyze --all --title-contains "Backend" --location-contains "Berlin"
stackpulse analyze --candidates
stackpulse review-skills

Auto workflow (one command)

stackpulse auto
stackpulse auto --limit 3 --all
stackpulse auto --llm --promote 2

stackpulse auto behavior:

Creates virtualenv if missing (.venv by default)
Installs dependencies only when missing
Installs Patchright Chromium only when missing
Skips session setup if session.json already exists
Fails fast on the first failed step and prints a Rich summary table

You can still run legacy script entrypoints (py setup_session.py, py scrape.py, py analyze.py).

Shell completion

Requires a working stackpulse command in $PATH (via ./install.sh symlink setup or pip install -e .).

stackpulse --install-completion   # install tab completion for your current shell
stackpulse --show-completion      # print the completion script (for manual setup)

Restart your shell (or source ~/.zshrc / ~/.bashrc) after installing.

Workflow

1. Create a session (once, or when session expires)

py setup_session.py
# or
stackpulse setup-session

If credentials are in .env, logs in programmatically
Otherwise opens a browser window for manual login
Saves cookies/storage to SESSION_FILE from config.py (default session.json)
Re-run whenever LinkedIn shows you a login page again

2. Scrape jobs

py scrape.py              # full run — all queries in config.py
py scrape.py --limit 3    # quick test, 3 jobs per query
py scrape.py --fresh      # ignore all previous results, re-scrape everything

# Typer CLI equivalents (default mode is fast HTTP — no browser)
stackpulse scrape
stackpulse scrape --limit 3
stackpulse scrape --fresh
stackpulse scrape --mode browser   # optional: Patchright + session.json (e.g. applicant_count)

# Legacy scripts: scrape.py = browser only; scrape_fast.py = HTTP only
py scrape_fast.py --limit 3

Scrape modes: stackpulse scrape defaults to --mode fast (guest HTTP, fastest, no session.json). Use --mode browser when you need logged-in fields such as applicant_count. The interactive wizard asks [fast/browser] with fast as the default.

Resume is automatic — on every run the scraper loads previously scraped job keys from data/skills.db (table scraped_jobs) and skips them. Existing historical data/jobs_*.json files are backfilled once into the DB when the ledger is empty. No flag needed.

Query-level resume — if you stop mid-run (Ctrl+C), the next scrape continues from the same search query index (order of SEARCH_QUERIES in config.py), not from the beginning: fast mode writes data/scrape_resume_fast.json, browser mode writes data/scrape_resume_browser.json. Delete that file (or use --fresh) to start again from query 1.

Ctrl+C exits cleanly — progress is saved after every job. Re-run at any time to pick up where you left off.

Output is saved incrementally after each job to data/jobs_YYYY-MM-DD.json.

Deduplication key is canonicalized via shared analysis_db.canonical_linkedin_job_key (LinkedIn job ID when available; otherwise normalized URL path), and persisted in skills.db so the same posting is skipped across days even if URL query params differ.

3. Analyze skills

py analyze.py                              # analyze today's file
py analyze.py --file data/jobs_2026-04-01.json
py analyze.py --all                        # merge all collected files
py analyze.py --llm                        # + open LLM extraction (free, via 9router)
py analyze.py --all --llm

# Typer CLI equivalents
stackpulse analyze
stackpulse analyze --file data/jobs_2026-04-01.json
stackpulse analyze --all
stackpulse analyze --llm          # note: double-dash required; -llm and "analyze llm" are invalid
stackpulse analyze --all --llm

# Cohort filters (case-insensitive substring match)
stackpulse analyze --all --title-contains "Backend"
stackpulse analyze --all --location-contains "Berlin"
stackpulse analyze --all --llm --title-contains "Senior" --location-contains "Germany"

Prints a skill frequency table to stdout and saves data/jobs_*_analysis.xlsx with one column per skill category.

Report sections:

Section	Description
Extraction quality	Jobs with empty description and jobs with zero skills extracted (%)
Top skills	Frequency + prevalence % + bar; uses merged regex+LLM metric when `--llm` was run
By category	Top terms per skill category with prevalence % (regex + LLM unified)
Top locations	Most frequent scraped `location` values
Skills by location	Top 3 skills per `search_location` (only shown when >1 search location present)
Salary hints	Postings where a salary pattern was regex-extracted (with company + location)
Coverage gaps	Skills discovered by LLM but not yet in catalog (only with `--llm`)

--llm mode calls NINEROUTER_MODEL through your local 9router endpoint (NINEROUTER_BASE_URL, default http://localhost:20128/v1) with a skills-aware prompt — the full skills catalog is sent to the LLM so it matches against known terms first and only flags genuinely new discoveries. Results are cached in data/skills.db — repeat runs are instant with no API calls.

Regex vs LLM “skill counts”: the activity log line N skill row(s) to store counts LLM JSON terms written toward llm_results (matched catalog terms plus new_terms), not regex taxonomy hits. A typical posting can show many regex matches across the catalog (~8–12+) but only a few LLM rows — that is normal. To catch pathological under-extraction (e.g. the model always returns almost nothing), analyze.py tracks a rolling window of the last 5 jobs: if the combined stored row count is below LLM_LOW_SIGNAL_WARN_BELOW_SUM (default 24, half of LLM_LOW_SIGNAL_REFERENCE_SUM 48), it logs a one-shot warning per low episode in the Live panel and in data/analysis_activity.log (the episode resets when the rolling sum rises back to the threshold or above).

After each --llm run, newly discovered terms (seen in ≥ LLM_CANDIDATE_THRESHOLD jobs, default 2) are automatically queued in skill_candidates. Because the LLM is skills-aware, uncovered terms are genuinely new technologies/tools — not synonyms or generic concepts.

analyze.py now splits entrypoint orchestration into _build_parser(), _handle_mode_only_paths(), and _load_run_context() to keep mode routing, path resolution, and runtime setup easier to maintain.

--llm prints two gap metrics: raw uncovered terms and actionable uncovered terms. Actionable terms satisfy jobs_count >= threshold, are not in SKIP_TERMS, and are not already present in skill_candidates.

Rate-limit handling (429): analyze.py parses retry wait from the provider error. If wait ≤ LLM_RATE_LIMIT_MAX_WAIT_SECONDS (default 30), it sleeps and retries once. For longer waits (daily quota exhausted), it falls back to NINEROUTER_FALLBACK_MODEL if configured.

4. Promote LLM-discovered skills into catalog

py analyze.py --candidates                 # inspect the promotion queue (all statuses + pending)
py analyze.py --promote                    # promote pending terms (≥2 jobs) into skills catalog
py analyze.py --promote 3                  # same, threshold = 3 jobs
py analyze.py --all --promote              # promote first, then analyze with enriched skills

# Typer CLI equivalents
stackpulse analyze --candidates
stackpulse analyze --promote 2
stackpulse analyze --promote 3
stackpulse analyze --all --promote 2

Interactive review (Rich + prompts): after --llm runs have filled the queue, use stackpulse review-skills (or wizard option 5) to approve or reject each pending term, or bulk-promote by minimum job count — no SQL required.

Once promoted, terms are matched by regex in all future runs — no --llm flag needed.

To reject a term so it never reappears:

sqlite3 data/skills.db "UPDATE skill_candidates SET status='rejected' WHERE term='<term>'"

Configuration (`config.py`)

Core scrape settings

Variable	Default	Description
`SEARCH_QUERIES`	11 queries	List of `(keywords, location)` tuples
`JOBS_PER_QUERY`	`25`	Max jobs fetched per query
`DELAY_BETWEEN_JOBS`	`3`	Pause (seconds) between individual job page scrapes
`DELAY_BETWEEN_QUERIES`	`5`	Pause (seconds) between search queries
`OUTPUT_DIR`	"data"	Directory for JSON output, logs, DB, and debug dumps
`SESSION_FILE`	"session.json"	Saved browser session

Scraper timeouts / extraction behavior

Variable	Default	Description
`PAGE_LOAD_TIMEOUT_MS`	`60000`	Timeout for `page.goto(..., wait_until="domcontentloaded")`
`H1_WAIT_TIMEOUT_MS`	`5000`	Timeout waiting for job title `<h1>`
`BUTTON_CLICK_TIMEOUT_MS`	`3000`	Timeout for clicking expand buttons
`POST_CLICK_SETTLE_SECONDS`	`0.5`	Sleep after each expand click
`POST_EXPAND_SETTLE_SECONDS`	`1.0`	Sleep before extraction starts
`DEBUG_HTML_SNIPPET_CHARS`	`8000`	Max HTML chars written to debug file
`DESCRIPTION_MIN_CHARS`	`100`	Minimum length for accepted job description

LLM / analysis settings

Variable	Default	Description
`DB_FILENAME`	"skills.db"	SQLite filename inside `OUTPUT_DIR`
`NINEROUTER_BASE_URL`	"http://localhost:20128/v1"	OpenAI-compatible 9router endpoint
`NINEROUTER_MODEL`	"9router-combo"	Primary model (use a concrete provider only if wired in 9router)
`NINEROUTER_FALLBACK_MODEL`	""	Optional fallback when primary fails (e.g. after 429)
`NINEROUTER_API_KEY`	"local"	API key passed to the OpenAI-compatible 9router client
`LLM_MAX_INPUT_CHARS`	`8000`	Max characters sent from each posting to LLM
`LLM_MAX_OUTPUT_TOKENS`	`1000`	LLM completion token cap
`LLM_RESPONSE_FORMAT_JSON_OBJECT`	`True`	Request JSON-object mode; retries once without it if the proxy rejects the parameter
`LLM_RATE_LIMIT_MAX_WAIT_SECONDS`	`30`	Max retry sleep for 429 before fallback
`LLM_LOW_SIGNAL_WINDOW_JOBS`	`5`	Rolling window size for LLM stored-row health check
`LLM_LOW_SIGNAL_REFERENCE_SUM`	`48`	Reference total LLM rows for that window (~regex baseline)
`LLM_LOW_SIGNAL_WARN_BELOW_SUM`	`24`	Warn when rolling sum is below this (50% of reference by default)
`RETRY_AFTER_BUFFER_SECONDS`	`2`	Safety buffer added to parsed retry-after
`LLM_CANDIDATE_THRESHOLD`	`2`	Min job occurrences to promote a candidate term

Current search targets: Berlin, Hamburg, Munich, Germany (general), Vienna, Amsterdam, Luxembourg, Barcelona, Madrid, London, Remote — all for senior Python/FastAPI backend roles.

Recommended minimal config profile

Use these as a practical baseline in config.py:

# Fast local test profile (quick feedback)
JOBS_PER_QUERY = 3
DELAY_BETWEEN_JOBS = 1
DELAY_BETWEEN_QUERIES = 2
PAGE_LOAD_TIMEOUT_MS = 45_000
H1_WAIT_TIMEOUT_MS = 5_000
LLM_RATE_LIMIT_MAX_WAIT_SECONDS = 15

# Stable full-run profile (fewer limits/blocks)
JOBS_PER_QUERY = 25
DELAY_BETWEEN_JOBS = 3
DELAY_BETWEEN_QUERIES = 5
PAGE_LOAD_TIMEOUT_MS = 60_000
H1_WAIT_TIMEOUT_MS = 15_000
LLM_RATE_LIMIT_MAX_WAIT_SECONDS = 30
# Optional second route if primary exhausts: NINEROUTER_FALLBACK_MODEL = "groq/llama-3.3-70b-versatile"

Tip: use the fast profile for selector/debug iteration, then switch back to the stable profile for production collection runs.

Collected fields

Field	Source
`linkedin_url`	Job URL
`job_title`	Page `<h1>`
`company`	Company link near title
`company_linkedin_url`	Company `/company/` href
`location`	Location text in top card
`posted_date`	"X days ago" text
`applicant_count`	"N applicants" text
`job_description`	Full description text
`salary_extracted`	Regex over description (best-effort)
`search_keywords`	Which query found this job
`search_location`	Which location was searched
`scraped_date`	ISO date of the scrape

LinkedIn does not expose salary as a structured field. salary_extracted is regex-based and best effort.

Skills catalog (`data/skills.db`)

17 categories, 227+ terms, stored in data/skills.db (SQLite). The DB is auto-created and seeded from SKILLS_SEED in analyze.py on first run. Terms are stored as plain lowercase text; regex escaping is applied at load time only. Additional terms accumulate automatically via the LLM promotion pipeline. The same DB also stores scraped_jobs, used by scrape.py as the persistent dedupe ledger across daily JSON outputs.

Category	Examples
Languages	python, go, rust, java, kotlin, typescript
Python Frameworks	fastapi, django, flask, aiohttp, starlette
Python Libraries	sqlalchemy, pydantic, celery, asyncpg, boto3
Databases — Relational	postgresql, mysql, cockroachdb, aurora
Databases — NoSQL/Search	mongodb, redis, elasticsearch, cassandra, dynamodb
Databases — Analytical	clickhouse, bigquery, snowflake, dbt
Cloud	aws, gcp, azure, lambda, s3, step functions, bedrock
Containers & Orchestration	kubernetes, docker, helm, argo, istio
IaC & CI/CD	terraform, pulumi, github actions, gitlab ci, argocd
Messaging & Streaming	kafka, rabbitmq, sqs, kinesis, nats
API & Architecture	rest, graphql, grpc, microservices, solid, cqrs, ddd
Auth & Security	oauth2, jwt, keycloak, auth0, vault
Monitoring & Observability	prometheus, grafana, datadog, opentelemetry
Testing	pytest, tdd, testcontainers, hypothesis, coverage.py
AI / ML (in JD)	ai, llm, langchain, pgvector, rag, generative ai, cursor
Soft / Process	agile, mentoring, tech lead, staff engineer
Languages (non-technical)	english, german, french, dutch

Add term without code change:

sqlite3 data/skills.db "INSERT OR IGNORE INTO skills(category_id,term) SELECT id,'hetzner' FROM categories WHERE name='Cloud'"

Add alias (e.g. multilingual synonym):

sqlite3 data/skills.db \
  "INSERT INTO skill_aliases(skill_id,alias,canonical,lang,alias_type)
   SELECT id,'Node.js','node.js','en','variant' FROM skills WHERE term='node.js'"

Top LLM-discovered skills across all jobs:

sqlite3 data/skills.db "SELECT skill, COUNT(DISTINCT url_key) jobs FROM llm_results GROUP BY skill ORDER BY jobs DESC LIMIT 20"

Promotion queue view:

sqlite3 data/skills.db "SELECT sc.term, c.name AS category, sc.jobs_count, sc.status FROM skill_candidates sc JOIN categories c ON c.id=sc.category_id ORDER BY sc.jobs_count DESC"

Known issues & limitations

LinkedIn SPA timing: library JobScraper is unreliable for React-rendered content. job_scraper_direct.py waits for <h1> and uses selector fallbacks.
No structured salary data: salary extraction is regex over free text.
Session expiry: LinkedIn sessions expire; rerun setup_session.py.
LinkedIn UI drift: CSS selectors can break due to A/B tests. If fields become null, inspect data/debug/ snapshots and update selector lists.
LLM quota limits: long quota windows skip sleep-retry and use fallback model if configured.
LLM invalid JSON: some models occasionally emit prose inside JSON arrays; the prompt forbids that, response_format JSON mode is requested when supported, and the analyzer retries once with a stricter instruction. If logs still show LLM output was not valid JSON, set NINEROUTER_FALLBACK_MODEL or set LLM_RESPONSE_FORMAT_JSON_OBJECT = False if your OpenAI-compatible proxy errors on that parameter (the client also retries without JSON mode when the error looks like an unsupported parameter).
Single-letter language matching: c is matched as \bc\b which can false-positive on the English word. LLM extraction correctly disambiguates C language from prose.
Best-effort extraction paths: scraper field extractors fail soft by design and continue fallback traversal; debug logs now include selector/button context to make drift diagnosis faster.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

stackpulse

Project layout

Setup

CLI (Typer + Rich)

Auto workflow (one command)

Shell completion

Workflow

1. Create a session (once, or when session expires)

2. Scrape jobs

3. Analyze skills

4. Promote LLM-discovered skills into catalog

Configuration (`config.py`)

Core scrape settings

Scraper timeouts / extraction behavior

LLM / analysis settings

Recommended minimal config profile

Collected fields

Skills catalog (`data/skills.db`)

Known issues & limitations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
analysis_candidates.py		analysis_candidates.py
analysis_db.py		analysis_db.py
analysis_llm_cache.py		analysis_llm_cache.py
analyze.py		analyze.py
cli.py		cli.py
config.py		config.py
install.sh		install.sh
job_scraper_direct.py		job_scraper_direct.py
job_search_browser.py		job_search_browser.py
patchright_shim.py		patchright_shim.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
scrape.py		scrape.py
scrape_fast.py		scrape_fast.py
setup_session.py		setup_session.py
stackpulse		stackpulse
ui_rich.py		ui_rich.py

Folders and files

Latest commit

History

Repository files navigation

stackpulse

Project layout

Setup

CLI (Typer + Rich)

Auto workflow (one command)

Shell completion

Workflow

1. Create a session (once, or when session expires)

2. Scrape jobs

3. Analyze skills

4. Promote LLM-discovered skills into catalog

Configuration (config.py)

Core scrape settings

Scraper timeouts / extraction behavior

LLM / analysis settings

Recommended minimal config profile

Collected fields

Skills catalog (data/skills.db)

Known issues & limitations

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Configuration (`config.py`)

Skills catalog (`data/skills.db`)

Packages