Skip to content

chore: promote develop to staging#36

Closed
spideystreet wants to merge 338 commits into
stagingfrom
develop
Closed

chore: promote develop to staging#36
spideystreet wants to merge 338 commits into
stagingfrom
develop

Conversation

@spideystreet
Copy link
Copy Markdown
Collaborator

Promotion of develop to staging to cut a release candidate.

What's included

PR Subject
#34 feat(api): optional X-Service-Token auth for backend-only access
#35 chore(ci): remove claude-code-review and claude mention workflows

Deployment notes

  • The service-token check in feat(api): optional X-Service-Token auth for backend-only access #34 is env-gated. When OST_LINKER_SERVICE_TOKEN is unset the API behaves exactly as today (open).
  • Do not set OST_LINKER_SERVICE_TOKEN on linker prod until ost-backend (with the paired sending code) is also set to the same value. Order to avoid 401s: set on backend first, then on linker.

spideystreet and others added 30 commits December 8, 2025 16:06
…odels

- Rename 'analytics' schema to 'github'
- Implement upsert logic in Python assets
- Consolidate dbt models into 'pvt_github_project'
- Add 'clean_text' macro for context preparation
- Filter rejected projects via INNER JOIN
- Rename prod_github_project to prd_github_project
- Update .env.example with ML and Scraper variables
…_embedding_job

project classification pipeline
spideystreet and others added 27 commits March 6, 2026 21:02
- Rename project_classification_job.py -> project_enrichment_job.py
- Rename run_all_schedule.py -> project_enrichment_schedule.py
- Delete classification_sensor.py (no longer registered in definitions)
- Fix architecture.md data flow to use current group names
- Update all imports accordingly

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>
- Add contract enforcement (data_type + constraints) on all 4 marts
- Add relationship tests on match models (FK to Project and User)
- Add not_null/unique tests on key columns
- Create clamp() macro for score bounding
- Create safe_divide() macro for zero-safe division

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>
…iate schema

- Replace manual greatest/least with clamp() macro in match_user_recommendation
- Replace manual ::float/nullif patterns with safe_divide() macro
- Add missing column descriptions to int_user_enriched.yml

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>
Document all macros in _macros.yml with descriptions and typed arguments:
build_project_context, build_user_context, clamp, clean_text,
deduplicate, generate_schema_name, jsonb_to_list, safe_divide

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>
Replace monolithic _macros.yml with individual yml files matching each .sql:
build_project_context, build_user_context, clamp, clean_text,
deduplicate, generate_schema_name, jsonb_to_list, safe_divide

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>
Add yml documentation for each custom SQL test:
- unique_user_project_recommendation: no duplicate (user_id, project_id) pairs
- valid_hybrid_score_bounds: all scores within [0, 1] range

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>
…s, and fixed issues

- Add .sql = .yml file convention as review checklist item #1
- Update Dagster group mappings (project_ml/user_ml replace ml_preparation/matching)
- Add data contracts and dbt 1.10 arguments syntax to checklist
- Move resolved issues to "Fixed" section (clamp, relationships, O(n³), passwords)
- Update score bounds to reference {{ clamp() }} macro

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>
Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>
Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>
- fix(go): bound io.ReadAll with 10MB LimitReader in fetcher/common.go
- fix(dbt): wrap popularity_score in {{ clamp() }} macro
- fix(dbt): add missing updatedAt column to stg_public__project.yml
- fix(ci): add setup-buildx-action to publish-develop.yml
- style: fix line-too-long in run_all_job.py description

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>
Ingestion is now part of the enrichment flow instead of a separate
manual-only job. This ensures the full project pipeline runs atomically:
scrape → classify → sync → embed → recommend.

- Add "ingestion" group to project_enrichment_job selection
- Delete project_scraper_job.py (no longer needed)
- Remove from definitions.py and test expectations
- Update docs submodule

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>
Query match.project_classification to get existing projectIds and
filter them out before calling the LLM. This avoids redundant API
calls on subsequent runs — only new/unclassified projects are sent
to OpenRouter.

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>
…iance

The clamp macro returns numeric (DECIMAL) due to literal 1.0, but the
data contract expects double precision (FLOAT). Also increase Dagster
boot timeout from 30s to 60s for the integration test.

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>
…ging

- Extract language_detection and serialization helpers into src/linker/utils/
- Harden IO manager and LLM classifier resource error handling
- Fix int_project_enriched dbt model
- Improve Go scraper structured logging and error handling

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>
- Unit tests: IO manager, LLM classifier, language detection, serialization, Docker infra
- Integration test: Dagster startup smoke test
- Go tests: scraper URL building, fetcher common utilities
- Update CI workflow to run Go tests and pytest markers

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>
- Add dbt file convention rule, update Docker compose services docs
- Add Go test and integration test commands to CLAUDE.md
- Add .mcp.json to gitignore
- Initialize agent memory files

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>
feat: test strategy, pipeline hardening, and dbt contracts
Co-authored-by: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>
)

- Add env_var defaults for POSTGRES_USER/POSTGRES_PASSWORD in dbt profiles
- Skip test_dagster_definitions when dbt manifest is missing in CI
- Update docs submodule to latest ost-docs/main commit

Co-authored-by: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>
* fix(ci): resolve dbt-check, quality, and docs-submodule CI failures

- Add env_var defaults for POSTGRES_USER/POSTGRES_PASSWORD in dbt profiles
- Skip test_dagster_definitions when dbt manifest is missing in CI
- Update docs submodule to latest ost-docs/main commit

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>

* chore(ci): unify sync tokens and add security contact email

- Replace OST_DOCS_TOKEN and OST_BACKEND_TOKEN with single OST_SYNC_TOKEN
- Update all workflows: publish-develop, publish-prod, sync-docs, sync-prisma, quality-checks
- Update SECURITY.md with contact@opensource-together.com for vulnerability reports

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>

---------

Co-authored-by: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>
…50% (#31)

* fix(ci): resolve dbt-check, quality, and docs-submodule CI failures

- Add env_var defaults for POSTGRES_USER/POSTGRES_PASSWORD in dbt profiles
- Skip test_dagster_definitions when dbt manifest is missing in CI
- Update docs submodule to latest ost-docs/main commit

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>

* chore(ci): unify sync tokens and add security contact email

- Replace OST_DOCS_TOKEN and OST_BACKEND_TOKEN with single OST_SYNC_TOKEN
- Update all workflows: publish-develop, publish-prod, sync-docs, sync-prisma, quality-checks
- Update SECURITY.md with contact@opensource-together.com for vulnerability reports

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>

* fix(ci): rename token to OST_LINKER_SYNC_TOKEN and lower coverage to 50%

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>

---------

Co-authored-by: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>
Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>
* chore(deps): add fastapi, uvicorn, and slowapi
* feat(api): add API config module with pydantic-settings
* feat(api): add connection pool with psycopg2 SimpleConnectionPool
* feat(api): add pydantic response schemas
* feat(api): add FastAPI app with health endpoint and rate limiting
* feat(api): add categories, domains, and techstacks endpoints
* feat(api): add project search, detail, and similarity endpoints
* feat(api): add trending recommendations endpoint
* fix(api): escape ILIKE wildcards and add consistent type::text cast
* chore(docker): add FastAPI service to compose stack

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>

* test(api): add auto-marker for api test directory

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>

* docs(env): add API configuration variables to .env.example

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>

* style(api): fix lint and type issues

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>

* test(api): verify SQL params and response relations in project tests

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>

* fix(api): harden security for deployment

- Isolate API container env (no Dagster/GitHub/LLM secrets)
- Move healthcheck to production compose (not just override)
- Remove shared volumes from API service
- Add rate limiting (60 req/min/IP) on all routes via slowapi decorators
- Extract limiter to rate_limit.py to avoid circular imports

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>

* docs: update CLAUDE.md and architecture for REST API & MCP

- CLAUDE.md: add API run command, env vars (API_HOST, API_PORT, API_RATE_LIMIT)
- architecture.md: add REST API section, update Docker services count, add serving layer to data flow
- Bump docs submodule with rest-api.mdx page

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>

* test(api): add MCP contract tests for all endpoints

Verify API response shapes match ost-mcp TypeScript types exactly.
Catches breaking changes in field names, types, or structure before
they reach the MCP server.

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>

---------

Co-authored-by: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>
When OST_LINKER_SERVICE_TOKEN is set, protected endpoints require
X-Service-Token matching (constant-time compare). When unset, the API
behaves as before — preserves backward compat for gradual rollout.
/health stays open for uptime monitors.

The token is read directly from os.environ in the dependency so tests
stay simple and the check never touches the pydantic config path.
* chore(deps): add fastapi, uvicorn, and slowapi

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>

* feat(api): add API config module with pydantic-settings

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>

* feat(api): add connection pool with psycopg2 SimpleConnectionPool

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>

* feat(api): add pydantic response schemas

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>

* feat(api): add FastAPI app with health endpoint and rate limiting

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>

* feat(api): add categories, domains, and techstacks endpoints

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>

* feat(api): add project search, detail, and similarity endpoints

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>

* feat(api): add trending recommendations endpoint

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>

* fix(api): escape ILIKE wildcards and add consistent type::text cast

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>

* chore(docker): add FastAPI service to compose stack

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>

* test(api): add auto-marker for api test directory

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>

* docs(env): add API configuration variables to .env.example

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>

* style(api): fix lint and type issues

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>

* test(api): verify SQL params and response relations in project tests

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>

* fix(api): harden security for deployment

- Isolate API container env (no Dagster/GitHub/LLM secrets)
- Move healthcheck to production compose (not just override)
- Remove shared volumes from API service
- Add rate limiting (60 req/min/IP) on all routes via slowapi decorators
- Extract limiter to rate_limit.py to avoid circular imports

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>

* docs: update CLAUDE.md and architecture for REST API & MCP

- CLAUDE.md: add API run command, env vars (API_HOST, API_PORT, API_RATE_LIMIT)
- architecture.md: add REST API section, update Docker services count, add serving layer to data flow
- Bump docs submodule with rest-api.mdx page

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>

* test(api): add MCP contract tests for all endpoints

Verify API response shapes match ost-mcp TypeScript types exactly.
Catches breaking changes in field names, types, or structure before
they reach the MCP server.

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>

* docs: add GitHub Trending scraper design spec

* docs: add GitHub Trending scraper implementation plan

* feat(trending): initialize Go module with dependencies

* feat(trending): implement HTML parsing for GitHub Trending page

* feat(trending): add GitHub API client with retry logic

* feat(trending): implement main orchestration — scrape, enrich, upsert

* feat(trending): add RawTrendingProject to Prisma schema

* feat(trending): add Dagster asset and config for trending scraper

* feat(api): add /recommendations/github-trending endpoint

* chore(docker): add trending scraper to build and compose

* chore: add trending scraper to build scripts, env, and CI

* fix(trending): align SQL columns with Prisma schema

Use snake_case column names (created_at) matching the Prisma model,
add gen_random_uuid() for id, remove non-existent updatedAt column.

* feat(trending): dual-upsert into raw_github_project for standard pipeline

Trending repos now also upsert into github.raw_github_project alongside
raw_trending_project, so they flow through the standard pipeline
(fetcher → dbt → classification → sync) and land in public.Project.

* feat(reco): exclude bookmarked projects from user recommendations

Adds stg_public__project_bookmark staging and a LEFT JOIN-based
exclusion in match_user_recommendation so users never receive a
recommendation for a project they have already bookmarked. A singular
test enforces the invariant in CI.

* refactor(llm): migrate classifier from OpenRouter to Mistral official SDK

Replace the OpenAI-client-over-OpenRouter wrapper with the official
mistralai Python SDK. Simpler stack, direct provider, typed messages
for mypy strict.

Default model is now `mistral-small-latest` (tracks current Small
release). Env var renamed from OPENROUTER_API_KEY to MISTRAL_API_KEY
and updated across code, compose, docs, and tests.

* feat(api): add /projects/search-natural for semantic NL search

New endpoint embeds free-text queries with the same SentenceTransformer
used by the pipeline (MiniLM-L6-v2, 384d) and ranks projects by pgvector
cosine similarity. Optional hard filters (language, domain, category,
techstack) narrow the candidate set before ranking.

Model is eagerly loaded in the FastAPI lifespan to avoid cold-request
latency. Unit tests cover query validation, ranking shape, and filter
propagation to SQL.

* feat(reco): exclude shown-but-ignored projects from user recommendations

New RecommendationEvent Prisma model (mirrors the one in ost-backend)
and dbt staging expose a feedback signal: projects shown ≥N times in
the lookback window without being clicked or starred are excluded from
the mart. Keeps the reco surface fresh instead of repeating projects
the user has already dismissed implicitly.

Lookback and threshold are configurable via dbt vars
(ignored_lookback_days=30, ignored_min_shown=3). A singular test
enforces the invariant in CI.

* fix(reco-events): promote eventType to enum, add source/rank columns, add user FK

Addresses 3 schema gaps flagged in review:

1. `eventType` is now a Postgres enum (RecommendationEventType) —
   database rejects typos and unknown event types at INSERT time
   instead of relying on application-level validation alone.
2. `source` promoted from `context.source` to a dedicated enum column
   (RecommendationSource: PERSONALIZED | TRENDING | SIMILAR |
   SEMANTIC_SEARCH) + indexed `(source, occurredAt)` for analytics.
   `rank` promoted to an Int column too. `context` stays jsonb for
   unstructured metadata (A/B variant, session_id, etc.).
3. Added FK on `userId` with ON DELETE CASCADE — prevents orphan events
   and ensures RGPD-compliant deletion when users leave.

dbt staging, mart exclusion, and singular test updated to match
uppercase enum values. 46 dbt tests pass.

* feat(classification): parallel classify + cost tracking + DLQ

Hardens the LLM classification harness for production scale:

1. Parallelization via ThreadPoolExecutor (5 workers default) — ~7×
   speedup measured on 23 projects (4.9s vs ~35s sequential).
2. Cost tracking: ClassificationResult now carries token usage + model;
   asset aggregates into Output metadata (prompt_tokens,
   completion_tokens, estimated_cost_usd, model_version) so a bad prompt
   change is visible at the next run.
3. DLQ (`match.project_classification_failure`): persistent failures
   stop consuming LLM budget on each run. Exponential backoff (2h, 4h,
   8h, ..., capped at 7d), max 5 attempts. RateLimitError is a distinct
   exception so 429s don't look like ordinary errors in the logs.

Also: modelVersion column added to match.project_classification — every
classification now carries the model that produced it, so future model
migrations are auditable.

* feat(classification): retry on rate-limit before sending to DLQ

When a worker hits Mistral's 429, the previous behavior was to DLQ the
project immediately. A 1-minute 429 window hitting 5 concurrent workers
would poison the DLQ with 5+ projects whose only fault was being in the
wrong second.

_classify_one now sleeps _RATE_LIMIT_COOLDOWN_SECONDS + jitter on the
first RateLimitError and retries once. Only the second failure (or any
non-429 error) is routed to the DLQ. Successful retries are logged as
warnings via the `rate_limit_hits` counter so operators can see 429
pressure at a glance.

* feat(classification): harden harness, add prompt registry, and streaming IO manager

* docs(dagster): up cfg_resource definitions

* refactor(config): dagster use pydantic settings

* refactor(dagster): add helper functions for composition, add docstring for docs

* refactor(dagster): GO_TRENDING_PATH mandatory, commentary about CPU usage for embeddings

* docs(dagster): cpu usage info

---------

Co-authored-by: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>
* docs(api): docstring & comms

* fix(docker): OST stack labels, dbt target volume, entrypoint perms

- Name compose project ost-linker; add org.open-source-together.* labels for Docker Desktop filter/search (stack=ost, app=linker, per-role).
- Share dbt_target volume between webserver and daemon; init runs dbt parse before build and daemon ensures manifest when missing.
- Run container entrypoint as root, chown named volumes, then gosu appuser (fixes Permission denied on /app/dbt/target).
- Configurable DAGSTER_HOST_PORT when host :3000 is busy; document in .env.example.
- Label dev db service in compose.override for the same OST metadata.

Made-with: Cursor

* chore: remove tracked Claude agent and rules from repo

Made-with: Cursor

* chore(git): simplify gitignore for prisma tooling and node

Made-with: Cursor

* chore(prisma): add committed package.json and lockfile for Node tooling

Made-with: Cursor

* chore(submodule): rename docs submodule path to ost-docs

Made-with: Cursor

* ci: guard fork PRs and use ost-docs path in submodule checks

Made-with: Cursor

* feat(api): optional strict service token validation at startup

Made-with: Cursor

* refactor(linker): use logging for ML model load messages

Made-with: Cursor

* chore(deps): drop redundant PyPI dotenv dependency

Made-with: Cursor

* docs: add agents guide and refresh README

Made-with: Cursor

* docs: align contributing guide with npm ci and make db-init

Made-with: Cursor

* chore: source env in make targets and pass API token in compose

Made-with: Cursor

* feat(dbt): extend source freshness and add recommendation mart tests

Made-with: Cursor

* feat(prisma): add recommendation_event migration

Made-with: Cursor

* fix(dagster): align cleanup job with storage dirs and add unit tests

Made-with: Cursor

* refactor(api): tighten session lifecycle and sync route tests

Made-with: Cursor

* chore: ignore specs/ and stop tracking planning docs

Made-with: Cursor

* chore(env): shorten .env.example

Made-with: Cursor

* docs(agents): remove Claude CI workflow rows

Made-with: Cursor

* chore(github): align PR checklist with CI (typecheck, format, dbt)

Made-with: Cursor

* ci: run API pytest suite in quality checks

Made-with: Cursor

* chore: add pre-commit hooks mirroring ruff and mypy

Made-with: Cursor

* feat(make): add doctor and ci-check targets

Made-with: Cursor

* chore(prisma): declare prisma db seed script

Made-with: Cursor

* docs: document ci-check, seed scope, and optional pre-commit

Made-with: Cursor

* style: ruff format and fix SQLAlchemy result typing for mypy

Made-with: Cursor

* feat(api): apply API_RATE_LIMIT to SlowAPI and toggle OpenAPI with API_ENABLE_OPENAPI

- Build RATE_LIMIT from API_RATE_LIMIT (same semantics as pydantic defaults)
- When API_ENABLE_OPENAPI is false, disable /openapi.json, /docs, /redoc
- Add unit tests for rate_limit_per_minute()

Made-with: Cursor

* chore(docker,docs): document API auth/OpenAPI env and add readiness audit

- Pass OST_LINKER_REQUIRE_SERVICE_TOKEN and API_ENABLE_OPENAPI via compose
- Extend .env.example and AGENTS.md for new variables
- README: clarify CC BY-NC vs OSI open source
- Add docs/READINESS-AUDIT.md (audit follow-up note)

Made-with: Cursor
@spideystreet
Copy link
Copy Markdown
Collaborator Author

Closing stale promotion PR; replaced by a new develop → staging PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant