chore: promote develop to staging#36
Closed
spideystreet wants to merge 338 commits into
Closed
Conversation
…odels - Rename 'analytics' schema to 'github' - Implement upsert logic in Python assets - Consolidate dbt models into 'pvt_github_project' - Add 'clean_text' macro for context preparation - Filter rejected projects via INNER JOIN
- Rename prod_github_project to prd_github_project - Update .env.example with ML and Scraper variables
…_embedding_job project classification pipeline
…ps://github.com/opensource-together/ost-linker into ost-408-feat-embeddings-for-cosine-similarities
- Rename project_classification_job.py -> project_enrichment_job.py - Rename run_all_schedule.py -> project_enrichment_schedule.py - Delete classification_sensor.py (no longer registered in definitions) - Fix architecture.md data flow to use current group names - Update all imports accordingly Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>
- Add contract enforcement (data_type + constraints) on all 4 marts - Add relationship tests on match models (FK to Project and User) - Add not_null/unique tests on key columns - Create clamp() macro for score bounding - Create safe_divide() macro for zero-safe division Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>
…iate schema - Replace manual greatest/least with clamp() macro in match_user_recommendation - Replace manual ::float/nullif patterns with safe_divide() macro - Add missing column descriptions to int_user_enriched.yml Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>
Document all macros in _macros.yml with descriptions and typed arguments: build_project_context, build_user_context, clamp, clean_text, deduplicate, generate_schema_name, jsonb_to_list, safe_divide Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>
Replace monolithic _macros.yml with individual yml files matching each .sql: build_project_context, build_user_context, clamp, clean_text, deduplicate, generate_schema_name, jsonb_to_list, safe_divide Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>
Add yml documentation for each custom SQL test: - unique_user_project_recommendation: no duplicate (user_id, project_id) pairs - valid_hybrid_score_bounds: all scores within [0, 1] range Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>
…s, and fixed issues - Add .sql = .yml file convention as review checklist item #1 - Update Dagster group mappings (project_ml/user_ml replace ml_preparation/matching) - Add data contracts and dbt 1.10 arguments syntax to checklist - Move resolved issues to "Fixed" section (clamp, relationships, O(n³), passwords) - Update score bounds to reference {{ clamp() }} macro Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>
Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>
Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>
- fix(go): bound io.ReadAll with 10MB LimitReader in fetcher/common.go
- fix(dbt): wrap popularity_score in {{ clamp() }} macro
- fix(dbt): add missing updatedAt column to stg_public__project.yml
- fix(ci): add setup-buildx-action to publish-develop.yml
- style: fix line-too-long in run_all_job.py description
Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>
Ingestion is now part of the enrichment flow instead of a separate manual-only job. This ensures the full project pipeline runs atomically: scrape → classify → sync → embed → recommend. - Add "ingestion" group to project_enrichment_job selection - Delete project_scraper_job.py (no longer needed) - Remove from definitions.py and test expectations - Update docs submodule Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>
Query match.project_classification to get existing projectIds and filter them out before calling the LLM. This avoids redundant API calls on subsequent runs — only new/unclassified projects are sent to OpenRouter. Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>
…iance The clamp macro returns numeric (DECIMAL) due to literal 1.0, but the data contract expects double precision (FLOAT). Also increase Dagster boot timeout from 30s to 60s for the integration test. Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>
…ging - Extract language_detection and serialization helpers into src/linker/utils/ - Harden IO manager and LLM classifier resource error handling - Fix int_project_enriched dbt model - Improve Go scraper structured logging and error handling Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>
- Unit tests: IO manager, LLM classifier, language detection, serialization, Docker infra - Integration test: Dagster startup smoke test - Go tests: scraper URL building, fetcher common utilities - Update CI workflow to run Go tests and pytest markers Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>
- Add dbt file convention rule, update Docker compose services docs - Add Go test and integration test commands to CLAUDE.md - Add .mcp.json to gitignore - Initialize agent memory files Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>
feat: test strategy, pipeline hardening, and dbt contracts
Co-authored-by: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>
* fix(ci): resolve dbt-check, quality, and docs-submodule CI failures - Add env_var defaults for POSTGRES_USER/POSTGRES_PASSWORD in dbt profiles - Skip test_dagster_definitions when dbt manifest is missing in CI - Update docs submodule to latest ost-docs/main commit Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com> * chore(ci): unify sync tokens and add security contact email - Replace OST_DOCS_TOKEN and OST_BACKEND_TOKEN with single OST_SYNC_TOKEN - Update all workflows: publish-develop, publish-prod, sync-docs, sync-prisma, quality-checks - Update SECURITY.md with contact@opensource-together.com for vulnerability reports Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com> --------- Co-authored-by: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>
…50% (#31) * fix(ci): resolve dbt-check, quality, and docs-submodule CI failures - Add env_var defaults for POSTGRES_USER/POSTGRES_PASSWORD in dbt profiles - Skip test_dagster_definitions when dbt manifest is missing in CI - Update docs submodule to latest ost-docs/main commit Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com> * chore(ci): unify sync tokens and add security contact email - Replace OST_DOCS_TOKEN and OST_BACKEND_TOKEN with single OST_SYNC_TOKEN - Update all workflows: publish-develop, publish-prod, sync-docs, sync-prisma, quality-checks - Update SECURITY.md with contact@opensource-together.com for vulnerability reports Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com> * fix(ci): rename token to OST_LINKER_SYNC_TOKEN and lower coverage to 50% Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com> --------- Co-authored-by: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>
Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>
* chore(deps): add fastapi, uvicorn, and slowapi * feat(api): add API config module with pydantic-settings * feat(api): add connection pool with psycopg2 SimpleConnectionPool * feat(api): add pydantic response schemas * feat(api): add FastAPI app with health endpoint and rate limiting * feat(api): add categories, domains, and techstacks endpoints * feat(api): add project search, detail, and similarity endpoints * feat(api): add trending recommendations endpoint * fix(api): escape ILIKE wildcards and add consistent type::text cast * chore(docker): add FastAPI service to compose stack Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com> * test(api): add auto-marker for api test directory Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com> * docs(env): add API configuration variables to .env.example Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com> * style(api): fix lint and type issues Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com> * test(api): verify SQL params and response relations in project tests Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com> * fix(api): harden security for deployment - Isolate API container env (no Dagster/GitHub/LLM secrets) - Move healthcheck to production compose (not just override) - Remove shared volumes from API service - Add rate limiting (60 req/min/IP) on all routes via slowapi decorators - Extract limiter to rate_limit.py to avoid circular imports Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com> * docs: update CLAUDE.md and architecture for REST API & MCP - CLAUDE.md: add API run command, env vars (API_HOST, API_PORT, API_RATE_LIMIT) - architecture.md: add REST API section, update Docker services count, add serving layer to data flow - Bump docs submodule with rest-api.mdx page Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com> * test(api): add MCP contract tests for all endpoints Verify API response shapes match ost-mcp TypeScript types exactly. Catches breaking changes in field names, types, or structure before they reach the MCP server. Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com> --------- Co-authored-by: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>
When OST_LINKER_SERVICE_TOKEN is set, protected endpoints require X-Service-Token matching (constant-time compare). When unset, the API behaves as before — preserves backward compat for gradual rollout. /health stays open for uptime monitors. The token is read directly from os.environ in the dependency so tests stay simple and the check never touches the pydantic config path.
* chore(deps): add fastapi, uvicorn, and slowapi Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com> * feat(api): add API config module with pydantic-settings Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com> * feat(api): add connection pool with psycopg2 SimpleConnectionPool Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com> * feat(api): add pydantic response schemas Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com> * feat(api): add FastAPI app with health endpoint and rate limiting Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com> * feat(api): add categories, domains, and techstacks endpoints Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com> * feat(api): add project search, detail, and similarity endpoints Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com> * feat(api): add trending recommendations endpoint Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com> * fix(api): escape ILIKE wildcards and add consistent type::text cast Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com> * chore(docker): add FastAPI service to compose stack Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com> * test(api): add auto-marker for api test directory Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com> * docs(env): add API configuration variables to .env.example Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com> * style(api): fix lint and type issues Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com> * test(api): verify SQL params and response relations in project tests Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com> * fix(api): harden security for deployment - Isolate API container env (no Dagster/GitHub/LLM secrets) - Move healthcheck to production compose (not just override) - Remove shared volumes from API service - Add rate limiting (60 req/min/IP) on all routes via slowapi decorators - Extract limiter to rate_limit.py to avoid circular imports Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com> * docs: update CLAUDE.md and architecture for REST API & MCP - CLAUDE.md: add API run command, env vars (API_HOST, API_PORT, API_RATE_LIMIT) - architecture.md: add REST API section, update Docker services count, add serving layer to data flow - Bump docs submodule with rest-api.mdx page Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com> * test(api): add MCP contract tests for all endpoints Verify API response shapes match ost-mcp TypeScript types exactly. Catches breaking changes in field names, types, or structure before they reach the MCP server. Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com> * docs: add GitHub Trending scraper design spec * docs: add GitHub Trending scraper implementation plan * feat(trending): initialize Go module with dependencies * feat(trending): implement HTML parsing for GitHub Trending page * feat(trending): add GitHub API client with retry logic * feat(trending): implement main orchestration — scrape, enrich, upsert * feat(trending): add RawTrendingProject to Prisma schema * feat(trending): add Dagster asset and config for trending scraper * feat(api): add /recommendations/github-trending endpoint * chore(docker): add trending scraper to build and compose * chore: add trending scraper to build scripts, env, and CI * fix(trending): align SQL columns with Prisma schema Use snake_case column names (created_at) matching the Prisma model, add gen_random_uuid() for id, remove non-existent updatedAt column. * feat(trending): dual-upsert into raw_github_project for standard pipeline Trending repos now also upsert into github.raw_github_project alongside raw_trending_project, so they flow through the standard pipeline (fetcher → dbt → classification → sync) and land in public.Project. * feat(reco): exclude bookmarked projects from user recommendations Adds stg_public__project_bookmark staging and a LEFT JOIN-based exclusion in match_user_recommendation so users never receive a recommendation for a project they have already bookmarked. A singular test enforces the invariant in CI. * refactor(llm): migrate classifier from OpenRouter to Mistral official SDK Replace the OpenAI-client-over-OpenRouter wrapper with the official mistralai Python SDK. Simpler stack, direct provider, typed messages for mypy strict. Default model is now `mistral-small-latest` (tracks current Small release). Env var renamed from OPENROUTER_API_KEY to MISTRAL_API_KEY and updated across code, compose, docs, and tests. * feat(api): add /projects/search-natural for semantic NL search New endpoint embeds free-text queries with the same SentenceTransformer used by the pipeline (MiniLM-L6-v2, 384d) and ranks projects by pgvector cosine similarity. Optional hard filters (language, domain, category, techstack) narrow the candidate set before ranking. Model is eagerly loaded in the FastAPI lifespan to avoid cold-request latency. Unit tests cover query validation, ranking shape, and filter propagation to SQL. * feat(reco): exclude shown-but-ignored projects from user recommendations New RecommendationEvent Prisma model (mirrors the one in ost-backend) and dbt staging expose a feedback signal: projects shown ≥N times in the lookback window without being clicked or starred are excluded from the mart. Keeps the reco surface fresh instead of repeating projects the user has already dismissed implicitly. Lookback and threshold are configurable via dbt vars (ignored_lookback_days=30, ignored_min_shown=3). A singular test enforces the invariant in CI. * fix(reco-events): promote eventType to enum, add source/rank columns, add user FK Addresses 3 schema gaps flagged in review: 1. `eventType` is now a Postgres enum (RecommendationEventType) — database rejects typos and unknown event types at INSERT time instead of relying on application-level validation alone. 2. `source` promoted from `context.source` to a dedicated enum column (RecommendationSource: PERSONALIZED | TRENDING | SIMILAR | SEMANTIC_SEARCH) + indexed `(source, occurredAt)` for analytics. `rank` promoted to an Int column too. `context` stays jsonb for unstructured metadata (A/B variant, session_id, etc.). 3. Added FK on `userId` with ON DELETE CASCADE — prevents orphan events and ensures RGPD-compliant deletion when users leave. dbt staging, mart exclusion, and singular test updated to match uppercase enum values. 46 dbt tests pass. * feat(classification): parallel classify + cost tracking + DLQ Hardens the LLM classification harness for production scale: 1. Parallelization via ThreadPoolExecutor (5 workers default) — ~7× speedup measured on 23 projects (4.9s vs ~35s sequential). 2. Cost tracking: ClassificationResult now carries token usage + model; asset aggregates into Output metadata (prompt_tokens, completion_tokens, estimated_cost_usd, model_version) so a bad prompt change is visible at the next run. 3. DLQ (`match.project_classification_failure`): persistent failures stop consuming LLM budget on each run. Exponential backoff (2h, 4h, 8h, ..., capped at 7d), max 5 attempts. RateLimitError is a distinct exception so 429s don't look like ordinary errors in the logs. Also: modelVersion column added to match.project_classification — every classification now carries the model that produced it, so future model migrations are auditable. * feat(classification): retry on rate-limit before sending to DLQ When a worker hits Mistral's 429, the previous behavior was to DLQ the project immediately. A 1-minute 429 window hitting 5 concurrent workers would poison the DLQ with 5+ projects whose only fault was being in the wrong second. _classify_one now sleeps _RATE_LIMIT_COOLDOWN_SECONDS + jitter on the first RateLimitError and retries once. Only the second failure (or any non-429 error) is routed to the DLQ. Successful retries are logged as warnings via the `rate_limit_hits` counter so operators can see 429 pressure at a glance. * feat(classification): harden harness, add prompt registry, and streaming IO manager * docs(dagster): up cfg_resource definitions * refactor(config): dagster use pydantic settings * refactor(dagster): add helper functions for composition, add docstring for docs * refactor(dagster): GO_TRENDING_PATH mandatory, commentary about CPU usage for embeddings * docs(dagster): cpu usage info --------- Co-authored-by: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>
* docs(api): docstring & comms * fix(docker): OST stack labels, dbt target volume, entrypoint perms - Name compose project ost-linker; add org.open-source-together.* labels for Docker Desktop filter/search (stack=ost, app=linker, per-role). - Share dbt_target volume between webserver and daemon; init runs dbt parse before build and daemon ensures manifest when missing. - Run container entrypoint as root, chown named volumes, then gosu appuser (fixes Permission denied on /app/dbt/target). - Configurable DAGSTER_HOST_PORT when host :3000 is busy; document in .env.example. - Label dev db service in compose.override for the same OST metadata. Made-with: Cursor * chore: remove tracked Claude agent and rules from repo Made-with: Cursor * chore(git): simplify gitignore for prisma tooling and node Made-with: Cursor * chore(prisma): add committed package.json and lockfile for Node tooling Made-with: Cursor * chore(submodule): rename docs submodule path to ost-docs Made-with: Cursor * ci: guard fork PRs and use ost-docs path in submodule checks Made-with: Cursor * feat(api): optional strict service token validation at startup Made-with: Cursor * refactor(linker): use logging for ML model load messages Made-with: Cursor * chore(deps): drop redundant PyPI dotenv dependency Made-with: Cursor * docs: add agents guide and refresh README Made-with: Cursor * docs: align contributing guide with npm ci and make db-init Made-with: Cursor * chore: source env in make targets and pass API token in compose Made-with: Cursor * feat(dbt): extend source freshness and add recommendation mart tests Made-with: Cursor * feat(prisma): add recommendation_event migration Made-with: Cursor * fix(dagster): align cleanup job with storage dirs and add unit tests Made-with: Cursor * refactor(api): tighten session lifecycle and sync route tests Made-with: Cursor * chore: ignore specs/ and stop tracking planning docs Made-with: Cursor * chore(env): shorten .env.example Made-with: Cursor * docs(agents): remove Claude CI workflow rows Made-with: Cursor * chore(github): align PR checklist with CI (typecheck, format, dbt) Made-with: Cursor * ci: run API pytest suite in quality checks Made-with: Cursor * chore: add pre-commit hooks mirroring ruff and mypy Made-with: Cursor * feat(make): add doctor and ci-check targets Made-with: Cursor * chore(prisma): declare prisma db seed script Made-with: Cursor * docs: document ci-check, seed scope, and optional pre-commit Made-with: Cursor * style: ruff format and fix SQLAlchemy result typing for mypy Made-with: Cursor * feat(api): apply API_RATE_LIMIT to SlowAPI and toggle OpenAPI with API_ENABLE_OPENAPI - Build RATE_LIMIT from API_RATE_LIMIT (same semantics as pydantic defaults) - When API_ENABLE_OPENAPI is false, disable /openapi.json, /docs, /redoc - Add unit tests for rate_limit_per_minute() Made-with: Cursor * chore(docker,docs): document API auth/OpenAPI env and add readiness audit - Pass OST_LINKER_REQUIRE_SERVICE_TOKEN and API_ENABLE_OPENAPI via compose - Extend .env.example and AGENTS.md for new variables - README: clarify CC BY-NC vs OSI open source - Add docs/READINESS-AUDIT.md (audit follow-up note) Made-with: Cursor
Collaborator
Author
|
Closing stale promotion PR; replaced by a new develop → staging PR. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Promotion of
developtostagingto cut a release candidate.What's included
Deployment notes
OST_LINKER_SERVICE_TOKENis unset the API behaves exactly as today (open).OST_LINKER_SERVICE_TOKENon linker prod untilost-backend(with the paired sending code) is also set to the same value. Order to avoid 401s: set on backend first, then on linker.