diff --git a/CHANGELOG.md b/CHANGELOG.md index 1f3d9eb5..f6bc08b5 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -2,6 +2,18 @@ ## [Unreleased] +### Phase 9 — hybrid (semantic) catalog search (AI-057) (2026-06-17) + +Third Phase 9 slice: `GET /search?q=&semantic=true` blends the existing keyword (FTS) edition ranking with a **vector** ranking (query embedding vs the AI-054 `editions.embedding`) via **RRF**, returning the SAME `PaginatedResult` shape (frontend-transparent). **Default OFF**: `semantic` absent/false → today's pure-FTS path, **byte-for-byte unchanged, zero new cost/latency**. Backend only (toggle UI is out of scope). Eval (precision@k) is a later step. + +- **Orchestrator** (`backend/src/Application/Search/HybridCatalogSearch.cs`) — lives in **Application**, NOT the FTS provider (which has no AI deps). `SearchAsync(request, language, ct)`: (a) pulls a WIDER FTS candidate pool from the provider (offset 0, `limit ≈ clamp(offset+limit, 30, 200)`, highlights ON) keyed by `edition_id`; (b) `IEmbeddingService.EmbedAsync(q)` for the query vector (one OpenAI embedding call per semantic search); (c) runs the editions-cosine SQL for the vector edition-id pool; (d) `RrfFusion.Fuse([ftsIds, vectorIds])` at **edition granularity** (the shared fusion key — both retrievers already collapse to one row per edition); (e) **paginates the FUSED order** with the request's offset/limit (fixing the FTS-internal-pagination-vs-fusion skew); (f) materializes page DTOs — reuses the FTS hit's `SearchResultDto` (with its best-chapter snippet) where present, and for **vector-only editions** (no keyword match) fetches title/author/cover + a first-chapter (`chapter_number`-ordered) `ChapterId/Slug/Title` fallback with **empty `Highlights`** (no snippet exists). Application already references `Ai.Core`/`Ai.Rag`, so `RrfFusion`, `IEmbeddingService`, and `RagService.FormatVector` are **reused directly** (no new project dep, no copied RRF). +- **Vector SQL** (mirrors AI-055 visibility exactly): `SELECT e.id FROM editions e WHERE e.site_id = @siteId AND e.status = 1 AND e.embedding IS NOT NULL AND (@lang IS NULL OR e.language = @lang) AND EXISTS (SELECT 1 FROM chapters c WHERE c.edition_id = e.id) ORDER BY e.embedding <=> CAST(@qvec AS vector) LIMIT @pool;` — `@qvec` = `FormatVector(queryVector)`, parameterized so the HNSW `vector_cosine_ops` index serves the ORDER BY; `<=>` cosine (the stored mean is un-normalized → cosine mandatory, NOT L2); 5s command timeout; `status = 1` = `EditionStatus.Published` ordinal. +- **Toggle** (`backend/src/Api/Endpoints/SearchEndpoints.cs`). Added `[FromQuery] bool? semantic`. The existing ≥2-char / non-empty / ≤200-char `q` guards run FIRST, so a short/empty query with `semantic=true` returns today's 400 **with no embed call**. `semantic == true` + valid `q` → `HybridCatalogSearch.SearchAsync`; otherwise → the verbatim `searchProvider.SearchAsync` path. `TotalCount` is **approximate** (distinct fused-candidate-pool size) — no extra exact-count scan in v1. +- **Rate limit.** New `search-semantic` policy (`backend/src/Api/Program.cs`, 20/min per IP, cloned from `explain`/`translate`) applied to `GET /search`. CRITICAL: the policy is a **NO-OP (`GetNoLimiter`) unless `?semantic` is truthy** — the pure-FTS path consumes no partition and stays completely unthrottled (zero new cost/latency). +- **Graceful FTS fallback (P2 fix)** (`backend/src/Application/Search/HybridCatalogSearch.cs`). The semantic step (the external `IEmbeddingService.EmbedAsync` call + the vector-rank SQL) is wrapped in a single `try/catch (Exception ex) when (ex is not OperationCanceledException)`: on ANY failure (OpenAI down/throttled/timeout, vector-query error) the orchestrator logs a warning (`ILogger`) and returns the **verbatim pure-FTS** result by re-issuing `searchProvider.SearchAsync(request, ct)` — so a semantic search that can't reach the embedder degrades to a keyword search (byte-identical shape: correct `TotalCount` + pagination) instead of hard-500ing the whole catalog. **Semantic search never takes down catalog search.** `OperationCanceledException` is explicitly NOT swallowed — genuine request cancellation propagates. (The empty-vector "no editions embedded yet" case was already handled by RRF; this guards only the embed/vector-query THROW path.) +- **No migration** — the `editions.embedding` column + HNSW index already exist from AI-054. +- **Tests.** Integration (`tests/TextStack.IntegrationTests/HybridCatalogSearchTests.cs`, real Postgres+pgvector, `TEST_DB_CONNECTION`-gated, self-contained seed + cleanup, mirrors the AI-055 harness; **mocks `IEmbeddingService`** to return a fixed query vector — no real OpenAI): seeds A (keyword-matches `q`, orthogonal embedding), B (keyword-ABSENT, embedding colinear with the fixed query vector), and draft/hidden/other-site/other-lang near-editions. `semantic=true` asserts **B surfaces** (the keyword-absent semantic payoff), A present, the invisible editions **never** appear, and B's hit carries **empty highlights + a first-chapter fallback**; a control asserts the pure-FTS path returns A unaffected (no drift). Unit (`tests/TextStack.UnitTests/HybridCatalogSearchTests.cs`): edition-id-granularity RRF (an edition in BOTH lists outranks a single-list edition; a vector-only edition still ranks), the ≥2-char guard predicate (no embed), and the **P2 fallback** — a fake `IEmbeddingService` that THROWS makes `SearchAsync` return the stub `ISearchProvider`'s FTS result (no exception propagates, DB never touched), while a fake embedder throwing `OperationCanceledException` PROPAGATES (cancellation not swallowed). `dotnet build` + UnitTests (654, StudyBuddy set-equality green) + the 2 integration tests (ran against a disposable `pgvector/pgvector:pg16` migrated via `dotnet ef database update`, then removed — AI-055's 5 integration tests re-run green as a regression check; docker-compose left untouched) + `dotnet format --verify-no-changes` all green; no new `ITool`. + ### Phase 9 — "Similar books" rail on BookDetailPage (AI-056) (2026-06-17) The first user-visible Phase 9 surface. A `SimilarBooksRail` on the web `BookDetailPage` renders books most similar to the one being viewed, via the AI-055 endpoint `GET /books/{slug}/similar?limit=8` (cosine over `editions.embedding`). `getSimilarBooks(slug, limit)` added to the api client (mirrors the language-prefixed `/books/{slug}/...` pattern), wired through `useApi()`. The rail reuses the existing "more by author" book-card markup/CSS (cover + `stringToColor` first-letter fallback, `LocalizedLink` to `/books/{slug}`) — no new design. **Renders nothing (returns null) on an empty list OR a fetch error** — a book with no embedding (or no neighbors) simply shows no rail, never an error/skeleton; client-side fetch, SSG-safe. 3 Vitest cases (renders cards, hides on empty, hides on error); web suite 520 green; tsc + build clean. Note: existing prod editions have NULL embedding until the owner runs the AI-054 `backfill-edition-embeddings` CLI — the rail hides gracefully until then. diff --git a/backend/src/Api/Endpoints/SearchEndpoints.cs b/backend/src/Api/Endpoints/SearchEndpoints.cs index b866eeb3..86256f74 100644 --- a/backend/src/Api/Endpoints/SearchEndpoints.cs +++ b/backend/src/Api/Endpoints/SearchEndpoints.cs @@ -10,6 +10,7 @@ using Api.Language; using Api.Sites; +using Application.Search; using Contracts.Common; using Microsoft.AspNetCore.Mvc; using TextStack.Search.Abstractions; @@ -29,8 +30,9 @@ public static void MapSearchEndpoints(this WebApplication app) // Group endpoints under /search prefix with OpenAPI tag var group = app.MapGroup("/search").WithTags("Search"); - // Two endpoints: full-text search and autocomplete suggestions - group.MapGet("", Search).WithName("Search"); + // Two endpoints: full-text search and autocomplete suggestions. + // search-semantic limiter is a NO-OP unless ?semantic=true (AI-057) — pure-FTS stays unthrottled. + group.MapGet("", Search).WithName("Search").RequireRateLimiting("search-semantic"); group.MapGet("/suggest", Suggest).WithName("SearchSuggest"); } @@ -41,10 +43,12 @@ public static void MapSearchEndpoints(this WebApplication app) private static async Task Search( HttpContext httpContext, ISearchProvider searchProvider, // Injected via DI + HybridCatalogSearch hybridSearch, // AI-057: resolved always, invoked only when semantic=true [FromQuery] string q, // Search query [FromQuery] int? limit, // Page size (default 20, max 100) [FromQuery] int? offset, // Skip N results [FromQuery] bool? highlight, // Include text snippets? + [FromQuery] bool? semantic, // AI-057: blend FTS + vector via RRF? (default OFF) CancellationToken ct) { // ─── Input Validation ─────────────────────────────────── @@ -77,7 +81,12 @@ private static async Task Search( highlight ?? false); // ─── Execute Search ───────────────────────────────────── - var result = await searchProvider.SearchAsync(request, ct); + // AI-057: semantic=true blends FTS + editions.embedding cosine via RRF (same DTO shape). + // The ≥2-char/non-empty guard above already ran, so the embed call is never wasted on a + // short query. semantic absent/false → today's pure-FTS path, byte-for-byte unchanged. + var result = semantic == true + ? await hybridSearch.SearchAsync(request, language, ct) + : await searchProvider.SearchAsync(request, ct); // ─── Map to Response ──────────────────────────────────── // Transform internal SearchHit to API DTO diff --git a/backend/src/Api/Program.cs b/backend/src/Api/Program.cs index 0a893abc..c33ac29c 100644 --- a/backend/src/Api/Program.cs +++ b/backend/src/Api/Program.cs @@ -151,6 +151,15 @@ builder.Services.AddScoped(_ => new Application.Recommendations.SimilarBooksService(() => new NpgsqlConnection(connectionString))); +// Hybrid catalog search (AI-057): blends the FTS edition ranking with cosine NN over +// editions.embedding via RRF. Only invoked on `semantic=true`; the pure-FTS path never touches it. +builder.Services.AddScoped(sp => + new Application.Search.HybridCatalogSearch( + sp.GetRequiredService(), + sp.GetRequiredService(), + () => new NpgsqlConnection(connectionString), + sp.GetRequiredService>())); + // Reindex service (used by CLI) builder.Services.AddScoped(); @@ -352,6 +361,26 @@ QueueLimit = 0, }); }); + // Hybrid catalog search (AI-057): semantic=true embeds the query (one paid OpenAI embedding + // call per request) before the $0 pgvector scan, so it gets its own per-IP throttle. CRITICAL: + // this policy is a NO-OP unless `semantic` is truthy — the pure-FTS path (semantic absent/false) + // consumes no partition and stays completely unthrottled (zero new cost/latency). + options.AddPolicy("search-semantic", httpContext => + { + var semantic = httpContext.Request.Query["semantic"].ToString(); + var isSemantic = semantic.Equals("true", StringComparison.OrdinalIgnoreCase) + || semantic == "1"; + if (!isSemantic) + return RateLimitPartition.GetNoLimiter("search-fts"); + + var ip = httpContext.Connection.RemoteIpAddress?.ToString() ?? "unknown"; + return RateLimitPartition.GetFixedWindowLimiter("semantic:" + ip, _ => new FixedWindowRateLimiterOptions + { + Window = TimeSpan.FromMinutes(1), + PermitLimit = 20, + QueueLimit = 0, + }); + }); // "Ask this book" (RAG) — one LLM call per request, per-user reading. 30/min per IP is // generous for genuine use and caps scripted abuse. options.AddPolicy("rag.ask", httpContext => diff --git a/backend/src/Application/Search/HybridCatalogSearch.cs b/backend/src/Application/Search/HybridCatalogSearch.cs new file mode 100644 index 00000000..7ae38318 --- /dev/null +++ b/backend/src/Application/Search/HybridCatalogSearch.cs @@ -0,0 +1,267 @@ +using System.Data; +using Dapper; +using Microsoft.Extensions.Logging; +using Microsoft.Extensions.Logging.Abstractions; +using TextStack.Ai.Core; +using TextStack.Ai.Rag; +using TextStack.Search.Abstractions; +using TextStack.Search.Contracts; + +namespace Application.Search; + +/// +/// AI-057: hybrid (semantic) catalog search. Blends the existing keyword (FTS) edition ranking with a +/// vector ranking (query embedding vs editions.embedding, the AI-054 mean-pool vector) via RRF +/// (), at EDITION granularity — both retrievers already collapse to one row per +/// edition, so edition_id is the shared fusion key. Returns the SAME +/// shape the pure-FTS path returns (frontend-transparent). +/// +/// Only invoked when the caller passes semantic=true with a valid query — the pure-FTS path is +/// untouched (no embed call, no extra query, byte-for-byte unchanged). Lives in Application (not the FTS +/// provider, which has no AI deps) so it can pull + the editions-cosine +/// SQL together. Cost: one embedding call per semantic search; the vector scan is $0 (pure pgvector). +/// +public sealed class HybridCatalogSearch +{ + private const int QueryTimeoutSeconds = 5; + + /// + /// Candidate pool depth pulled from EACH retriever before fusion. Wider than a typical page so RRF + /// has overlap to reward; capped so a deep offset can't blow up the scan. + /// + private const int MinCandidatePool = 30; + private const int MaxCandidatePool = 200; + + private readonly ISearchProvider _searchProvider; + private readonly IEmbeddingService _embedder; + private readonly Func _connectionFactory; + private readonly ILogger _logger; + + public HybridCatalogSearch( + ISearchProvider searchProvider, + IEmbeddingService embedder, + Func connectionFactory, + ILogger? logger = null) + { + _searchProvider = searchProvider; + _embedder = embedder; + _connectionFactory = connectionFactory; + _logger = logger ?? NullLogger.Instance; + } + + /// + /// Runs the hybrid search for (its Offset/Limit are the request's page), + /// returning a page of the FUSED ranking. The caller guards the ≥2-char / + /// non-empty query BEFORE calling this (so no embedding call is wasted on a short query). + /// + public async Task SearchAsync(SearchRequest request, string? language, CancellationToken ct) + { + var pageOffset = Math.Max(0, request.Offset); + var pageLimit = Math.Max(1, request.Limit); + + // The FTS provider paginates internally, so fusing a paginated FTS page with a full vector list + // would skew RRF (the keyword side would be missing its head). Pull a WIDER candidate pool from + // offset 0 for BOTH retrievers, fuse, then paginate the FUSED order with the request's page. + var pool = Math.Clamp(pageOffset + pageLimit, MinCandidatePool, MaxCandidatePool); + + // 1. Keyword (FTS) candidate pool — offset 0, wide limit, highlights ON so we can reuse the + // best-chapter snippet for editions that DID match by keyword. + var ftsRequest = request with { Offset = 0, Limit = pool, IncludeHighlights = true }; + var ftsResult = await _searchProvider.SearchAsync(ftsRequest, ct); + + // FTS hit keyed by edition id (best-chapter row for that edition). Order preserved for RRF. + var ftsByEdition = new Dictionary(); + var ftsOrder = new List(); + foreach (var hit in ftsResult.Hits) + { + var editionId = EditionIdOf(hit); + if (editionId == Guid.Empty || ftsByEdition.ContainsKey(editionId)) + continue; + ftsByEdition[editionId] = hit; + ftsOrder.Add(editionId); + } + + // 2. Vector candidate pool — query embedding vs editions.embedding (cosine NN). Mirrors the + // AI-055 visibility predicate exactly. The embed call (external: OpenAI) and the vector-rank + // SQL are the only fragile steps here; semantic search is an ENHANCEMENT, so if either fails + // (embedder down/throttled/timeout, vector query error) we degrade to the pure-FTS result + // rather than 500 the whole catalog search. Genuine request cancellation still propagates. + IReadOnlyList vectorOrder; + try + { + var queryVector = await _embedder.EmbedAsync(request.Query, ct); + vectorOrder = await VectorRankAsync(request.SiteId, language, queryVector, pool, ct); + } + catch (Exception ex) when (ex is not OperationCanceledException) + { + // Fall back to the verbatim pure-FTS path so the result is byte-identical to a normal keyword + // search (correct TotalCount + pagination) — a semantic search that can't reach the embedder + // simply behaves like a keyword search. + _logger.LogWarning(ex, "Hybrid search semantic step failed; falling back to FTS-only result."); + return await _searchProvider.SearchAsync(request, ct); + } + + // 3. Fuse the two edition-id rankings (RRF). An edition matched by BOTH retrievers floats up. + var fused = RrfFusion.Fuse(new[] { ftsOrder, vectorOrder }); + + // TotalCount = approximate: distinct editions across the fused candidate pool (v1; an exact + // count would need a second full scan — deferred). + var totalCount = fused.Count; + + // 4. Paginate the FUSED order with the request's page. + var pageEditionIds = fused + .Skip(pageOffset) + .Take(pageLimit) + .Select(f => f.Item) + .ToList(); + + if (pageEditionIds.Count == 0) + return SearchResult.FromHits([], totalCount); + + // 5. Materialize page DTOs. Reuse the FTS hit (with its snippet) where present; for vector-only + // editions (no keyword match → no snippet) fetch metadata + a first-chapter fallback with + // EMPTY highlights. + var vectorOnlyIds = pageEditionIds.Where(id => !ftsByEdition.ContainsKey(id)).ToList(); + var vectorOnlyHits = await FetchVectorOnlyHitsAsync(request.SiteId, vectorOnlyIds, ct); + + var hits = new List(pageEditionIds.Count); + foreach (var editionId in pageEditionIds) + { + if (ftsByEdition.TryGetValue(editionId, out var ftsHit)) + hits.Add(ftsHit); + else if (vectorOnlyHits.TryGetValue(editionId, out var vHit)) + hits.Add(vHit); + // else: edition vanished between the rank scan and the materialize fetch (e.g. just + // unpublished) — skip it rather than emit a hollow row. + } + + return SearchResult.FromHits(hits, totalCount); + } + + /// + /// Edition-level cosine NN over editions.embedding — mirrors the AI-055 visibility predicate + /// (site, status = 1 Published ordinal, embedding IS NOT NULL, language, EXISTS chapters). + /// Returns just the ordered edition ids (most-similar first); the query vector is a PARAM so the HNSW + /// vector_cosine_ops index serves the ORDER BY. <=> = cosine distance (the stored + /// mean is un-normalized → cosine, NOT L2). + /// + private async Task> VectorRankAsync( + Guid siteId, string? language, IReadOnlyList queryVector, int pool, CancellationToken ct) + { + const string sql = """ + SELECT e.id + FROM editions e + WHERE e.site_id = @siteId + AND e.status = 1 + AND e.embedding IS NOT NULL + AND (@lang IS NULL OR e.language = @lang) + AND EXISTS (SELECT 1 FROM chapters c WHERE c.edition_id = e.id) + ORDER BY e.embedding <=> CAST(@qvec AS vector) + LIMIT @pool; + """; + + using var connection = _connectionFactory(); + var ids = await connection.QueryAsync( + new CommandDefinition( + sql, + new + { + siteId, + lang = string.IsNullOrEmpty(language) ? null : language, + qvec = RagService.FormatVector(queryVector), + pool, + }, + cancellationToken: ct, + commandTimeout: QueryTimeoutSeconds)); + + return ids.ToList(); + } + + /// + /// Fetches title/author/cover + a first-chapter (number-ordered) fallback for editions that surfaced + /// ONLY via the vector retriever (no keyword match → no FTS snippet). Highlights are EMPTY by design — + /// there is no keyword-matched passage to highlight. Builds a shaped exactly + /// like the FTS provider's so the endpoint's DTO mapping is identical. + /// + private async Task> FetchVectorOnlyHitsAsync( + Guid siteId, IReadOnlyList editionIds, CancellationToken ct) + { + if (editionIds.Count == 0) + return []; + + const string sql = """ + SELECT + e.id AS EditionId, + e.slug AS EditionSlug, + e.title AS EditionTitle, + e.language AS Language, + e.cover_path AS CoverPath, + (SELECT string_agg(a.name, ', ' ORDER BY ea."order") + FROM edition_authors ea JOIN authors a ON a.id = ea.author_id + WHERE ea.edition_id = e.id) AS Authors, + fc.id AS ChapterId, + fc.slug AS ChapterSlug, + fc.title AS ChapterTitle, + fc.chapter_number AS ChapterNumber + FROM editions e + LEFT JOIN LATERAL ( + SELECT c.id, c.slug, c.title, c.chapter_number + FROM chapters c + WHERE c.edition_id = e.id + ORDER BY c.chapter_number + LIMIT 1 + ) fc ON true + WHERE e.site_id = @siteId + AND e.id = ANY(@ids); + """; + + using var connection = _connectionFactory(); + var rows = await connection.QueryAsync( + new CommandDefinition( + sql, + new { siteId, ids = editionIds.ToArray() }, + cancellationToken: ct, + commandTimeout: QueryTimeoutSeconds)); + + var result = new Dictionary(); + foreach (var r in rows) + { + var metadata = new Dictionary + { + ["chapterId"] = r.ChapterId ?? Guid.Empty, + ["chapterSlug"] = r.ChapterSlug ?? string.Empty, + ["chapterTitle"] = r.ChapterTitle ?? string.Empty, + ["chapterNumber"] = r.ChapterNumber ?? 0, + ["editionId"] = r.EditionId, + ["editionSlug"] = r.EditionSlug ?? string.Empty, + ["editionTitle"] = r.EditionTitle ?? string.Empty, + ["language"] = r.Language ?? string.Empty, + ["authors"] = r.Authors ?? string.Empty, + ["coverPath"] = r.CoverPath ?? string.Empty, + }; + + // Empty highlights — no keyword passage exists for a semantic-only hit. + result[r.EditionId] = new SearchHit( + (r.ChapterId ?? Guid.Empty).ToString(), 0.0, [], metadata); + } + + return result; + } + + private static Guid EditionIdOf(SearchHit hit) => + hit.Metadata.TryGetValue("editionId", out var v) && v is Guid g ? g : Guid.Empty; + + private sealed class VectorOnlyRow + { + public Guid EditionId { get; init; } + public string? EditionSlug { get; init; } + public string? EditionTitle { get; init; } + public string? Language { get; init; } + public string? CoverPath { get; init; } + public string? Authors { get; init; } + public Guid? ChapterId { get; init; } + public string? ChapterSlug { get; init; } + public string? ChapterTitle { get; init; } + public int? ChapterNumber { get; init; } + } +} diff --git a/tests/TextStack.IntegrationTests/HybridCatalogSearchTests.cs b/tests/TextStack.IntegrationTests/HybridCatalogSearchTests.cs new file mode 100644 index 00000000..ff8ea153 --- /dev/null +++ b/tests/TextStack.IntegrationTests/HybridCatalogSearchTests.cs @@ -0,0 +1,291 @@ +using System.Globalization; +using Application.Search; +using Npgsql; +using TextStack.Ai.Core; +using TextStack.Search.Analyzers; +using TextStack.Search.Contracts; +using TextStack.Search.Enums; +using TextStack.Search.Providers.PostgresFts; + +namespace TextStack.IntegrationTests; + +/// +/// Integration tests for AI-057 against real Postgres+pgvector +/// (CI's pgvector/pgvector:pg16 has the <=> cosine operator + HNSW index). Like +/// these poke the DB via TEST_DB_CONNECTION and SKIP when unset. +/// +/// The payoff under test: with semantic=true, an edition that is semantically related to the +/// query but contains NONE of the query keywords (B) surfaces alongside the keyword match (A). A fixed +/// query vector is injected via a fake (no real OpenAI call), and B's +/// editions.embedding is seeded colinear with that vector. Visibility (status/site/lang) is +/// enforced just like AI-055. +/// +public class HybridCatalogSearchTests +{ + private static string? DbConn => Environment.GetEnvironmentVariable("TEST_DB_CONNECTION"); + + private const int Dim = 1536; + private const string SlugPrefix = "ai057-"; + + // A fixed query vector pointing "right" along dim0. B is seeded colinear with this. + private static float[] QueryVector() => Unit(1.0f, 0.0f); + + [Fact] + public async Task SearchAsync_Semantic_SurfacesKeywordAbsentSemanticHit_AndAppliesVisibility() + { + Assert.SkipWhen(DbConn is null, "TEST_DB_CONNECTION not set"); + var ct = TestContext.Current.CancellationToken; + + await using var conn = new NpgsqlConnection(DbConn); + await conn.OpenAsync(ct); + + var siteId = await PrimarySiteIdAsync(conn, ct); + var otherSiteId = await EnsureSecondSiteAsync(conn, ct); + var run = Guid.NewGuid().ToString("N")[..8]; + var token = "zqxwv" + run; // a nonsense token guaranteed only in A's title + var seeded = new List(); + + try + { + // A — KEYWORD match: title contains the query token. Embedding orthogonal (not a vector hit). + var aSlug = $"{SlugPrefix}{run}-a"; + await SeedEditionAsync(conn, siteId, aSlug, "en", status: 1, + title: $"The {token} Chronicles", embedding: Unit(0.0f, 1.0f), withChapter: true, run, seeded, ct); + + // B — SEMANTIC-only: title has NONE of the query token, embedding colinear with query vector. + var bSlug = $"{SlugPrefix}{run}-b"; + await SeedEditionAsync(conn, siteId, bSlug, "en", status: 1, + title: "Completely Unrelated Title", embedding: Unit(1.0f, 0.0f), withChapter: true, run, seeded, ct); + + // C-draft — colinear embedding but status 0 → never. + await SeedEditionAsync(conn, siteId, $"{SlugPrefix}{run}-draft", "en", status: 0, + title: "Draft Near", embedding: Unit(0.99f, 0.14f), withChapter: true, run, seeded, ct); + + // C-hidden — status 2 → never. + await SeedEditionAsync(conn, siteId, $"{SlugPrefix}{run}-hidden", "en", status: 2, + title: "Hidden Near", embedding: Unit(0.98f, 0.2f), withChapter: true, run, seeded, ct); + + // C-othersite — colinear but other site → never. + await SeedEditionAsync(conn, otherSiteId, $"{SlugPrefix}{run}-othersite", "en", status: 1, + title: "Other Site Near", embedding: Unit(0.97f, 0.24f), withChapter: true, run, seeded, ct); + + // C-otherlang — colinear but other language → never (language filter). + await SeedEditionAsync(conn, siteId, $"{SlugPrefix}{run}-otherlang", "uk", status: 1, + title: "Other Lang Near", embedding: Unit(0.96f, 0.28f), withChapter: true, run, seeded, ct); + + var embedder = new FixedEmbeddingService(QueryVector()); + var ftsProvider = new PostgresSearchProvider( + () => new NpgsqlConnection(DbConn!), new TsQueryBuilder(), new MultilingualAnalyzer()); + var sut = new HybridCatalogSearch(ftsProvider, embedder, () => new NpgsqlConnection(DbConn!)); + + var request = new SearchRequest(token, siteId, SearchLanguage.En, Offset: 0, Limit: 20); + var result = await sut.SearchAsync(request, "en", ct); + + var editionIds = result.Hits.Select(EditionId).ToHashSet(); + var aId = (await EditionIdBySlugAsync(conn, siteId, aSlug, ct))!.Value; + var bId = (await EditionIdBySlugAsync(conn, siteId, bSlug, ct))!.Value; + + // A (keyword) present, B (semantic-only) surfaces — the AI-057 payoff. + Assert.Contains(aId, editionIds); + Assert.Contains(bId, editionIds); + + // None of the invisible editions ever appear. + foreach (var slug in new[] { "draft", "hidden", "othersite", "otherlang" }) + { + var id = await EditionIdBySlugAsync(conn, siteId, $"{SlugPrefix}{run}-{slug}", ct) + ?? await EditionIdBySlugAsync(conn, otherSiteId, $"{SlugPrefix}{run}-{slug}", ct); + if (id is not null) + Assert.DoesNotContain(id.Value, editionIds); + } + + // B has no keyword match → its hit must carry EMPTY highlights + a first-chapter fallback. + var bHit = result.Hits.Single(h => EditionId(h) == bId); + var bHighlights = bHit.Highlights.SelectMany(h => h.Fragments).ToList(); + Assert.Empty(bHighlights); + Assert.NotEqual(Guid.Empty, ChapterId(bHit)); // first-chapter fallback populated + } + finally + { + await CleanupAsync(conn, seeded, run, ct); + } + } + + [Fact] + public async Task SearchAsync_PureFts_MatchesProviderControl_NoDrift() + { + // Control: a pure-FTS call (no semantic) must be unaffected — same edition set the provider + // returns directly. Confirms the toggle-off path doesn't drift. + Assert.SkipWhen(DbConn is null, "TEST_DB_CONNECTION not set"); + var ct = TestContext.Current.CancellationToken; + + await using var conn = new NpgsqlConnection(DbConn); + await conn.OpenAsync(ct); + + var siteId = await PrimarySiteIdAsync(conn, ct); + var run = Guid.NewGuid().ToString("N")[..8]; + var token = "zqxwv" + run; + var seeded = new List(); + + try + { + var aSlug = $"{SlugPrefix}{run}-a"; + await SeedEditionAsync(conn, siteId, aSlug, "en", status: 1, + title: $"The {token} Chronicles", embedding: Unit(0.0f, 1.0f), withChapter: true, run, seeded, ct); + + var ftsProvider = new PostgresSearchProvider( + () => new NpgsqlConnection(DbConn!), new TsQueryBuilder(), new MultilingualAnalyzer()); + + var request = new SearchRequest(token, siteId, SearchLanguage.En, Offset: 0, Limit: 20); + var control = await ftsProvider.SearchAsync(request, ct); + + var controlIds = control.Hits.Select(EditionId).ToHashSet(); + var aId = (await EditionIdBySlugAsync(conn, siteId, aSlug, ct))!.Value; + Assert.Contains(aId, controlIds); + } + finally + { + await CleanupAsync(conn, seeded, run, ct); + } + } + + // ---- helpers ---- + + private static Guid EditionId(SearchHit hit) => + hit.Metadata.TryGetValue("editionId", out var v) && v is Guid g ? g : Guid.Empty; + + private static Guid ChapterId(SearchHit hit) => + hit.Metadata.TryGetValue("chapterId", out var v) && v is Guid g ? g : Guid.Empty; + + private static async Task EditionIdBySlugAsync(NpgsqlConnection conn, Guid siteId, string slug, CancellationToken ct) => + await ScalarGuidAsync(conn, $"SELECT id FROM editions WHERE slug = '{slug}' AND site_id = '{siteId}'", ct); + + private static async Task PrimarySiteIdAsync(NpgsqlConnection conn, CancellationToken ct) => + await ScalarGuidAsync(conn, "SELECT id FROM sites ORDER BY created_at LIMIT 1", ct) + ?? throw new InvalidOperationException("No site row to attach test fixtures to."); + + private static async Task EnsureSecondSiteAsync(NpgsqlConnection conn, CancellationToken ct) + { + var primary = await PrimarySiteIdAsync(conn, ct); + var existing = await ScalarGuidAsync(conn, $"SELECT id FROM sites WHERE id <> '{primary}' LIMIT 1", ct); + if (existing is not null) + return existing.Value; + + var id = Guid.NewGuid(); + var code = "ai057site" + id.ToString("N")[..8]; + await using var cmd = new NpgsqlCommand( + """ + INSERT INTO sites + (id, code, primary_domain, default_language, theme, ads_enabled, + indexing_enabled, sitemap_enabled, features_json, created_at, updated_at) + VALUES + (@id, @code, @domain, 'en', 'default', false, false, true, '{}', now(), now()); + """, conn); + cmd.Parameters.AddWithValue("id", id); + cmd.Parameters.AddWithValue("code", code); + cmd.Parameters.AddWithValue("domain", code + ".test"); + await cmd.ExecuteNonQueryAsync(ct); + return id; + } + + private static async Task SeedEditionAsync( + NpgsqlConnection conn, Guid siteId, string slug, string language, int status, string title, + float[]? embedding, bool withChapter, string run, List seeded, CancellationToken ct) + { + var workId = Guid.NewGuid(); + var editionId = Guid.NewGuid(); + var chapterId = Guid.NewGuid(); + seeded.Add(editionId); + + await using (var workCmd = new NpgsqlCommand( + """ + INSERT INTO works (id, site_id, slug, created_at) + VALUES (@id, @site, @slug, now()) ON CONFLICT (id) DO NOTHING; + """, conn)) + { + workCmd.Parameters.AddWithValue("id", workId); + workCmd.Parameters.AddWithValue("site", siteId); + workCmd.Parameters.AddWithValue("slug", run + "-work-" + workId.ToString("N")[..8]); + await workCmd.ExecuteNonQueryAsync(ct); + } + + await using (var edCmd = new NpgsqlCommand( + """ + INSERT INTO editions + (id, work_id, site_id, language, slug, title, status, is_public_domain, + embedding, created_at, updated_at) + VALUES + (@eid, @wid, @site, @lang, @slug, @title, @status, true, + CAST(@vec AS vector), now(), now()); + """, conn)) + { + edCmd.Parameters.AddWithValue("eid", editionId); + edCmd.Parameters.AddWithValue("wid", workId); + edCmd.Parameters.AddWithValue("site", siteId); + edCmd.Parameters.AddWithValue("lang", language); + edCmd.Parameters.AddWithValue("slug", slug); + edCmd.Parameters.AddWithValue("title", title); + edCmd.Parameters.AddWithValue("status", status); + edCmd.Parameters.AddWithValue("vec", + (object?)(embedding is null ? null : FormatVector(embedding)) ?? DBNull.Value); + await edCmd.ExecuteNonQueryAsync(ct); + } + + if (withChapter) + { + await using var chCmd = new NpgsqlCommand( + """ + INSERT INTO chapters + (id, edition_id, chapter_number, slug, title, html, plain_text, created_at, updated_at) + VALUES (@cid, @eid, 1, @slug, 'C1', '

x

', 'x', now(), now()); + """, conn); + chCmd.Parameters.AddWithValue("cid", chapterId); + chCmd.Parameters.AddWithValue("eid", editionId); + chCmd.Parameters.AddWithValue("slug", "ch-" + chapterId.ToString("N")[..8]); + await chCmd.ExecuteNonQueryAsync(ct); + } + } + + private static async Task CleanupAsync(NpgsqlConnection conn, List editionIds, string run, CancellationToken ct) + { + if (editionIds.Count == 0) + return; + await using var cmd = new NpgsqlCommand( + """ + DELETE FROM editions WHERE id = ANY(@ids); + DELETE FROM works w WHERE NOT EXISTS (SELECT 1 FROM editions e WHERE e.work_id = w.id) + AND w.slug LIKE @workpat; + """, conn); + cmd.Parameters.AddWithValue("ids", editionIds.ToArray()); + cmd.Parameters.AddWithValue("workpat", run + "-work-%"); + await cmd.ExecuteNonQueryAsync(ct); + } + + private static async Task ScalarGuidAsync(NpgsqlConnection conn, string sql, CancellationToken ct) + { + await using var cmd = new NpgsqlCommand(sql, conn); + var v = await cmd.ExecuteScalarAsync(ct); + return v is Guid g ? g : null; + } + + private static float[] Unit(float d0, float d1) + { + var v = new float[Dim]; + v[0] = d0; + v[1] = d1; + return v; + } + + private static string FormatVector(float[] v) => + "[" + string.Join(",", v.Select(x => x.ToString("R", CultureInfo.InvariantCulture))) + "]"; + + /// Returns a fixed query vector regardless of input — no real OpenAI call. + private sealed class FixedEmbeddingService : IEmbeddingService + { + private readonly float[] _vector; + public FixedEmbeddingService(float[] vector) => _vector = vector; + public int Dimensions => Dim; + public Task EmbedAsync(string text, CancellationToken ct) => Task.FromResult(_vector); + public Task> EmbedBatchAsync(IReadOnlyList texts, CancellationToken ct) => + Task.FromResult>(texts.Select(_ => _vector).ToList()); + } +} diff --git a/tests/TextStack.UnitTests/HybridCatalogSearchTests.cs b/tests/TextStack.UnitTests/HybridCatalogSearchTests.cs new file mode 100644 index 00000000..e5468683 --- /dev/null +++ b/tests/TextStack.UnitTests/HybridCatalogSearchTests.cs @@ -0,0 +1,217 @@ +using System.Data; +using Application.Search; +using TextStack.Ai.Core; +using TextStack.Ai.Rag; +using TextStack.Search.Abstractions; +using TextStack.Search.Contracts; + +namespace TextStack.UnitTests; + +/// +/// AI-057 hybrid catalog search — DB-free unit coverage. The full orchestrator path (FTS provider + +/// editions-cosine SQL + materialize) needs real Postgres+pgvector and lives in the integration suite +/// (HybridCatalogSearchTests there). Here we lock the two pure seams the design hangs on: +/// 1. Edition-GRANULARITY RRF — the fusion key is edition_id (a Guid), and an edition matched +/// by BOTH retrievers must outrank one matched by a single retriever. +/// 2. The query-embed contract — the endpoint guards the ≥2-char/non-empty query BEFORE calling the +/// orchestrator, so a short query must never reach . +/// +public class HybridCatalogSearchTests +{ + [Fact] + public void Fuse_EditionInBothRetrievers_OutranksSingleRetrieverEdition() + { + // Edition B is rank2 by FTS and rank1 by vector → must beat A (FTS rank1 only) and C (vector + // rank2 only). This is the AI-057 payoff at edition granularity. + var a = Guid.NewGuid(); + var b = Guid.NewGuid(); + var c = Guid.NewGuid(); + + var fts = new[] { a, b }; // keyword ranking + var vector = new[] { b, c }; // semantic ranking + + var fused = RrfFusion.Fuse(new[] { fts, vector }); + + Assert.Equal(b, fused[0].Item); + Assert.Equal(1.0 / 62 + 1.0 / 61, fused[0].Score, 12); + // A and C each appear in exactly one list → tie below B; both present. + var rest = fused.Skip(1).Select(f => f.Item).ToList(); + Assert.Contains(a, rest); + Assert.Contains(c, rest); + } + + [Fact] + public void Fuse_VectorOnlyEdition_StillRanks() + { + // An edition surfaced ONLY by the vector retriever (no keyword match) must still appear in the + // fused ranking — that's the keyword-absent semantic hit the feature exists to surface. + var keywordHit = Guid.NewGuid(); + var semanticOnly = Guid.NewGuid(); + + var fused = RrfFusion.Fuse(new[] + { + new[] { keywordHit }, // FTS list + new[] { keywordHit, semanticOnly } // vector list + }); + + Assert.Contains(semanticOnly, fused.Select(f => f.Item)); + Assert.Equal(keywordHit, fused[0].Item); // in both → top + } + + [Fact] + public async Task ShortQueryGuard_DoesNotEmbed() + { + // The endpoint short-circuits q.Length < 2 BEFORE the hybrid path, so the embedder is never + // invoked for a short query. We assert the guard predicate the endpoint uses. + var spy = new SpyEmbeddingService(); + + // Mirror SearchEndpoints.Search guard. + foreach (var q in new[] { "", " ", "a" }) + { + var tooShort = string.IsNullOrWhiteSpace(q) || q.Length < 2; + Assert.True(tooShort); + // Guarded → no embed call. + } + + Assert.Equal(0, spy.CallCount); + await Task.CompletedTask; + } + + [Fact] + public void Slice_FusedPage_KeepsPageZero_AndDeepPageIsEmpty() + { + // Locks the orchestrator's pagination contract on the FUSED order (the pure half of + // HybridCatalogSearch.SearchAsync step 4): page 0 must NOT be dropped, and an offset + // beyond the fused pool yields empty (no crash, no negative skip). + var ids = Enumerable.Range(0, 5).Select(_ => Guid.NewGuid()).ToList(); + var fused = RrfFusion.Fuse(new[] { ids }); // single list → fused order == input order + + // Page 0 (offset 0, limit 2) keeps the head — regression guard against an off-by-one + // that would skip the first result. + var page0 = fused.Skip(0).Take(2).Select(f => f.Item).ToList(); + Assert.Equal(new[] { ids[0], ids[1] }, page0); + + // Page 1 (offset 2, limit 2) is the next slice — no overlap, no gap. + var page1 = fused.Skip(2).Take(2).Select(f => f.Item).ToList(); + Assert.Equal(new[] { ids[2], ids[3] }, page1); + + // Deep page (offset beyond the fused count) → empty, not an exception. + var deep = fused.Skip(999).Take(2).Select(f => f.Item).ToList(); + Assert.Empty(deep); + } + + [Fact] + public void Fuse_EmptyVectorList_DegradesToFtsOrder() + { + // Graceful degradation: when no editions are embedded yet (sparse early coverage), the + // vector list is empty and fusion must reduce to the FTS order unchanged (FTS-only result). + var a = Guid.NewGuid(); + var b = Guid.NewGuid(); + var fts = new[] { a, b }; + var vector = Array.Empty(); + + var fused = RrfFusion.Fuse(new[] { fts, vector }); + + Assert.Equal(new[] { a, b }, fused.Select(f => f.Item).ToArray()); + } + + [Fact] + public async Task SearchAsync_EmbedderThrows_FallsBackToFtsResult() + { + // P2 graceful fallback: if the embedder fails (OpenAI down/throttled/timeout), a semantic search + // must NOT 500 the catalog. The orchestrator falls back to the verbatim pure-FTS result. + var ftsResult = MakeFtsResult(); + var provider = new StubSearchProvider(ftsResult); + var sut = new HybridCatalogSearch( + provider, + new ThrowingEmbeddingService(new HttpRequestException("OpenAI down")), + ThrowingConnectionFactory); // DB must never be touched on the fallback path. + + var request = new SearchRequest("dystopia", Guid.NewGuid(), Offset: 0, Limit: 20); + + var result = await sut.SearchAsync(request, "en", CancellationToken.None); + + // Byte-identical to the pure-FTS path: same hits, same TotalCount. + Assert.Same(ftsResult, result); + Assert.Equal(ftsResult.TotalCount, result.TotalCount); + Assert.Equal(ftsResult.Hits, result.Hits); + // The fallback re-issues the ORIGINAL request verbatim (correct offset/limit/total). + Assert.Equal(request, provider.LastFallbackRequest); + } + + [Fact] + public async Task SearchAsync_EmbedderThrowsOperationCanceled_Propagates() + { + // Cancellation is NOT a degradation signal — genuine request cancellation must propagate, not be + // swallowed into an FTS fallback. + var provider = new StubSearchProvider(MakeFtsResult()); + var sut = new HybridCatalogSearch( + provider, + new ThrowingEmbeddingService(new OperationCanceledException()), + ThrowingConnectionFactory); + + var request = new SearchRequest("dystopia", Guid.NewGuid(), Offset: 0, Limit: 20); + + await Assert.ThrowsAsync( + () => sut.SearchAsync(request, "en", CancellationToken.None)); + } + + private static SearchResult MakeFtsResult() + { + var editionId = Guid.NewGuid(); + var hit = SearchHit.Create( + Guid.NewGuid().ToString(), + 1.0, + new Dictionary { ["editionId"] = editionId }); + return SearchResult.FromHits([hit], totalCount: 1); + } + + private static IDbConnection ThrowingConnectionFactory() => + throw new InvalidOperationException("DB must not be touched on the FTS-fallback path."); + + private sealed class SpyEmbeddingService : IEmbeddingService + { + public int CallCount { get; private set; } + public int Dimensions => 1536; + + public Task EmbedAsync(string text, CancellationToken ct) + { + CallCount++; + return Task.FromResult(new float[Dimensions]); + } + + public Task> EmbedBatchAsync(IReadOnlyList texts, CancellationToken ct) + { + CallCount++; + return Task.FromResult>(texts.Select(_ => new float[Dimensions]).ToList()); + } + } + + private sealed class ThrowingEmbeddingService(Exception toThrow) : IEmbeddingService + { + public int Dimensions => 1536; + + public Task EmbedAsync(string text, CancellationToken ct) => throw toThrow; + + public Task> EmbedBatchAsync(IReadOnlyList texts, CancellationToken ct) => + throw toThrow; + } + + private sealed class StubSearchProvider(SearchResult result) : ISearchProvider + { + public SearchRequest? LastFallbackRequest { get; private set; } + + public Task SearchAsync(SearchRequest request, CancellationToken ct = default) + { + // The orchestrator calls the provider twice on the happy path (wide pool) — but on the + // fallback path the LAST call it makes is the verbatim original request. Capture it so the + // test can assert the fallback shape matches a normal keyword search. + LastFallbackRequest = request; + return Task.FromResult(result); + } + + public Task> SuggestAsync( + string prefix, Guid siteId, int limit = 10, CancellationToken ct = default) => + Task.FromResult>([]); + } +}