upgrade: real LLM SDKs, multi-model tenants, RAG visualizer, 506-company FinanceBench demo, full production polish by danrixd · Pull Request #34 · danrixd/smartbaseai

danrixd · 2026-04-15T06:42:58Z

Long-running upgrade branch — 19 commits that take SmartBaseAI from a half-documented starter kit with mocked model wrappers into a working multi-tenant knowledge platform with real LLM providers, interactive retrieval visualization, a 506-company fundamental-analysis demo, and a full operational surface (audit log, usage tracking, cross-tenant search, session timeout handling, rate limiting, structured logging).

Headline highlights

6 seeded knowledge vaults (personal, company / Acme Analytics, organization / Project Lunar Harbor, relativity, saas-ai, smartbase-docs) plus one large-scale vault financebench with 506 S&P 500 companies, 1,241 markdown files, 197,201 Chroma vectors, 1.23 M daily bar rows, and 150 ground-truth Q&A pulled from the FinanceBench open-source benchmark.
Real LLM providers — AnthropicModel / OpenAIModel / OllamaModel now call the live SDKs (was mocked). Anthropic defaults to claude-opus-4-6 with cache_control: ephemeral on the system prompt per the claude-api skill guidance, logs cache_read_input_tokens per response for verification. Tenants can declare multiple providers; users pick at chat time via a dropdown, with per-request override threaded through /chat/message and /chat/trace.
RAG Visualizer (new page) — live view of a single query flowing through the three-source orchestrator. Shows the vector-store info strip (tenant, collection, vector count, embedding model, device), full semantic candidate ranking with L2 distance bars, keyword hits, fusion block, assembled prompt, and LLM reply. Persist traces with 💾 Save trace, reload later without re-running the LLM (zero-token replay for demos). Export to JSON / Markdown.
Interactive Vault editor — 1,744-file tree browser with live search (components/FileTree.jsx), collapsible per-ticker folders, inline match highlighting, markdown edit/split/preview toggle, bulk upload, automatic section-aware re-chunk + re-embed on save.
Generalised exact_lookup — the DB-preferred fusion fast path now dispatches by tenant and table. Financebench queries with a ticker hint (sniffed from the message) hit daily_bars(ticker, date, open, high, low, close, volume) and return the exact row. Verified live: "AAPL closing price on 2024-06-14" → close=212.49, volume=70.1M directly from SQLite, RAG retrieval runs as supplemental context only.
FinanceBench eval harness (scripts/eval_financebench.py) — runs the 150 ground-truth questions through /chat/message, scores with three strategies (numeric 2% tolerance, substring, optional Claude-as-judge), writes docs/financebench_eval.md with headline accuracy, per-question-type breakdown, latency stats, and sample failures.
Production surface — structured logging with request-id middleware, per-tenant daily token cap with 429 on overage, usage rollup (/admin/usage), audit log viewer (/admin/audit-log), cross-tenant search (/admin/search), session-timeout redirect, error boundary, strict .env loader that refuses to leak shell env vars.

Commits (newest first)

911f140  fix(eval): ASCII-only progress output to avoid Windows cp1252 crash
e91e3c7  feat: T2/T3/T4 sweep — 16 of the 17 outstanding gap-analysis items
d5c312d  fix(privacy): don't leak absolute filesystem paths into vault content
94738da  feat: persist RAG traces so they can be replayed without spending tokens
b7f04bc  feat(ui): interactive file tree with search + collapsible folders for Vault
c026ab0  fix(files): recursive vault listing + nested-path routes + financebench mapping
b0730c1  fix(loader): standalone ingest script to sidestep native-state segfault
fd04398  fix(loader): resilient SP500 fetch + bounded chunker + CPU ingest fallback
83835a1  feat(demo): load_financebench.py — SEC 10-Ks + S&P 500 bars/profiles loader
8007c34  docs+test: recruiter README, CHANGELOG, architecture doc, smoke scripts, test adapts
5a8a702  feat(frontend): RAG visualizer, Vault editor, Settings page, live API status, vault-selection gate
6a5db9e  feat(demo): six seeded knowledge vaults
1b0e600  feat(backend): real LLM SDKs, multi-model per tenant, settings store, vault CRUD
7b5eddd  chore(repo): gitignore runtime state, untrack vector_store + sqlite dbs

Test state

pytest: 24 / 24 green. Deprecation warning count dropped from 83 to 12 after the datetime.utcnow() cleanup.
ui_smoke.py: 35 / 39 PASS + 4 NEEDS_KEY (providers without keys), 0 FAIL.
Frontend vite build: 270 modules, clean.

CI (pending — needs a token with `workflow` scope)

A .github/workflows/test.yml is ready locally (backend pytest + frontend vite build, both on Python 3.12 / Node 20 with pip/npm caching). GitHub rejected the push because the current PAT lacks workflow scope. Run gh auth refresh -s workflow, then the file can be committed to complete CI setup.

Verified live before opening this PR

AAPL 2024-06-14 exact_lookup → close=212.49, volume=70,122,700 straight from daily_bars
Audit log endpoint: 58 historic events, filterable
Cross-tenant search "quokka" → 29 hits across 7 tenants
Vault listing on financebench: 1,744 files visible recursively
Privacy scrub: 0 absolute paths in 1,241 markdown files and 0 in the top-5 semantic hits against "Program_Research smartbaseai full-submission.txt"
Saved-trace round-trip: run → save → list → fetch → delete, all 200
Three-source orchestrator verified across all 7 tenants

What is still outstanding

Demo GIF + screenshots (docs/screenshots/ empty)
CI file push (PAT scope)
Live deploy target (Fly.io / Railway)
FinanceBench eval run to completion — harness running now with Ollama for 50 questions, docs/financebench_eval.md will be generated separately

How to run locally

# Backend
pip install -r requirements.txt
cp .env.example .env
# Optional: paste ANTHROPIC_API_KEY / OPENAI_API_KEY into .env
python scripts/run_server.py --reload             # -> http://localhost:8000

# Frontend
cd frontend && npm install && npm run dev         # -> http://localhost:5173

# Seed the 6 hand-curated vaults
python scripts/seed_demo.py

# (optional) load the 506-company FinanceBench vault (~2h first run)
CUDA_VISIBLE_DEVICES= python scripts/load_financebench.py --scale large \
  --sec-user-agent "Your Name <you@example.com>"
CUDA_VISIBLE_DEVICES= python scripts/ingest_financebench.py

Default logins

super_admin: admin / ChangeThis123! (cross-tenant)
personal: alice / Alice123!
company (Acme): demo / Demo123!
organization (Lunar Harbor): orion / Orion123!
relativity: einat / Einat123!
saas-ai: nadav / Nadav123!
smartbase-docs: docs / Docs1234!
financebench: fbuser / FbUser123!

🤖 Generated with Claude Code

The vector_store/ Chroma collections and data/*.db files are runtime state that change on every query; committing them produced noisy diffs and risked leaking real tenant content through git history. Move them out of tracking and gitignore their parent directories going forward. Also ignore frontend/.env.local (per-dev API base URL) and frontend/dist/ (vite build artifact). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… vault CRUD Replaces the mocked AnthropicModel/OpenAIModel/OllamaModel wrappers with real SDK calls (anthropic, openai, requests) and threads them through the orchestrator so tenants actually talk to live model providers. Why: the previous wrappers returned deterministic "[Anthropic] Response to: <prompt>" stubs; none of the chat or trace surfaces could produce a grounded answer even with valid API keys. This lands the missing wire. Major changes: - api/config.py: new strict .env loader. Provider keys (ANTHROPIC_API_KEY/OPENAI_API_KEY/OLLAMA_BASE_URL) come from .env ONLY; shell env vars that are NOT in .env are wiped from os.environ so they can't leak through. Fixes "why does OpenAI show green when I didn't add it to .env?" surprise. - api/auth_middleware.py, api/routes_auth.py: stop hardcoding SECRET_KEY="super_secret"; pull from api.config. - ai/models/anthropic_model.py: real Anthropic SDK. Defaults to claude-opus-4-6, applies cache_control: ephemeral on the stable system prompt so repeated turns hit the prompt cache. Typed exceptions for auth/rate-limit/api errors. Clearly-marked unavailability stub when SDK missing or key unset. - ai/models/openai_model.py: real OpenAI SDK, gpt-4o-mini default. - ai/models/ollama_model.py: real /api/generate call (already had one) plus /api/tags ping that actually tests connectivity and reads base_url from the settings store. - Each model class now exposes a .ping() classmethod used by the live status indicator; replaces the hardcoded green-dot UI. - chatbot/response_generator.py: register "anthropic" in MODELS so tenants can actually select it. Add generate_response_trace() that returns the full pipeline breakdown (history + db_lookup + hybrid retrieval + fusion + prompt + llm) for the RAG visualizer. - ai/vector_stores/chroma_store.py: add store_info() (collection size, embedding model, dim, device) and hybrid_query_trace() that returns the full ranked candidate list with L2 distances plus keyword hits plus the post-fusion kept set. Enough metadata for the UI to render a real vector-DB panel. - api/routes_chat.py: ChatRequest gains optional model_provider + model_name override so users can pick any of the tenant's configured models per-request. New POST /chat/trace endpoint is read-only wrt conversation history and returns the full trace for the visualizer. - api/routes_files.py: new vault endpoints — GET /files/vault (list), GET/PUT /files/vault/{filename} (read + edit with auto re-ingest into Chroma under a stable doc_id), POST /files/upload now lands in the vault dir and ingests the same way. Falls back to data/vaults/{tenant} for non-seeded tenants. Proper path-traversal protection. - api/routes_admin.py: GET/PUT /admin/settings, GET /admin/models/status (real per-provider ping), POST /admin/models/test (test with override key before saving). Duplicate tenant/user -> 409 Conflict instead of 500. Fix PATCH /admin/tenants/{id} which was calling manager.create() and always raising -- now goes through TenantManager.update(). - tenants/tenant_manager.py: stateless — re-reads tenants.json on every call. Fixes the bug where routes_chat and routes_admin each held separate cached instances, so tenants created via admin were invisible to chat until process restart (404 "tenant not found"). Also adds a real update() method. - db/settings_repository.py: new SQLite-backed kv store. Stored values take precedence over env; empty string clears back to env. Secrets are masked on read (GET /admin/settings). - requirements.txt: + anthropic, + openai, + python-dotenv. - .env.example: documents ANTHROPIC_API_KEY, OPENAI_API_KEY, OLLAMA_BASE_URL; explains the Settings-UI override story. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ation / relativity / saas-ai / smartbase-docs) Scripted corpus so the RAG visualizer has something meaningful to search. Every vault ships with hand-written content containing at least one "needle" fact that cannot be googled, so retrieval quality is verifiable. Vaults: - personal/ — single-user notes: habits, 2026 goals, contacts. - demo/ (tenant_id: company) — Acme Analytics product docs (Pulse, Atlas, policies, FAQ) plus a market-data CSV. The FAQ file contains the fake "secret internal codeword" needle. - organization/ — fictional Project Lunar Harbor moon program. A 7-step work plan (with owners, dates, gates), non-public contingency procedures C-7 "Quokka" / C-12 "Bramble" / C-19 "Kingfisher", and a lab reference code LH-ALT-7741-NX. Explicitly labelled as fictional in the docs so the RAG answers prove the content comes from the vault and not from the LLM's training data. - relativity/ — GR research notebook: Hubble tension working model, ringdown overtone extraction (codename Kingfisher-R3), reading list, lab ID GR-TAU-26-2047. - saas-ai/ — SaaS in the age of AI: unit economics (gross-margin staircase HIGHLAND-GM-2026), moats, pricing playbook. - smartbase-docs/ — product docs for SmartBaseAI itself, so the tool can answer questions about how to use it. Also: - scripts/seed_demo.py: idempotent seeder. Creates tenants, creates one demo user per tenant (alice/demo/orion/einat/nadav/docs), ingests every .md/.txt/.csv into the tenant's Chroma collection with a stable doc_id, and loads the CSVs into sqlite for the exact_lookup path. Each tenant is pre-configured with a models[] array of [ollama, openai, anthropic] so the UI model picker is populated. - tenants/tenants.json: the six vault entries with models[] arrays. Run: python scripts/seed_demo.py Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… status, vault-selection gate Adds the three main UX surfaces that make the rewritten backend actually usable from a browser, plus a subtle but important context-scoping fix. New pages: - pages/RagVisualizer.jsx: live view of a single query flowing through the three-source orchestrator. Top row is a pipeline diagram (query -> orchestrator[history/db/rag] -> fusion -> llm -> reply), colored by stage. Below the diagram, a dedicated "Vector store search" panel shows tenant/collection/vector-count/embedding-model/device, then the keyword hits and the full semantic ranking with L2 distance bars (green >=60% relevance / amber / red) and a "kept" badge on the rows the fusion layer actually used. Per-vault preset queries pick up the needle facts seeded into each corpus. - pages/Vault.jsx: file browser + markdown editor for the current tenant's vault. Click a file to load it, edit, "Save + re-ingest" overwrites disk and replaces the vector under the same stable doc_id. "+ Add file" upload flows through the same auto-ingest path so uploaded files are immediately searchable. - pages/Settings.jsx (super_admin only): provider credentials form with live Test buttons (POST /admin/models/test), masked Save, plus per-tenant multi-model editor — add/remove rows of {provider, name, label}. First row is the default; users pick at chat time via a dropdown. New components: - components/ApiStatus.jsx: polls /admin/models/status every 30s and renders the sidebar's three-dot indicator with real colors instead of the previous hardcoded green dots. Replaces the long-standing bug where ollama/openai always showed "connected" even when down. - components/VaultGate.jsx: blocking placeholder for pages that need an active vault but don't have one yet. Super-admins see this on first load and must explicitly pick a vault from the top-right dropdown before chat / rag / vault load; regular users land directly in their scoped tenant. Routing + context fix (router.jsx + Layout.jsx): - Lifted <Layout> out of every page and into PrivateRoute so Layout wraps every authenticated route from above. Previously each page rendered its own <Layout>, which put the AppContext.Provider BELOW the page in the React tree — so useContext(AppContext) on each page read the default empty context, not Layout's state. This was the bug behind "super_admin picks a vault, Chat still shows the gate" — activeTenant updates never reached the page. - Layout: removed the old empty gear-icon modal (for non-super-admin it was literally empty); the gear icon now navigates to /settings. Added a top-right "Active vault" dropdown for super_admin that starts at "— choose a vault —" with no auto-selection. Chat.jsx: - Multi-model picker above the textarea; per-request model override via the new backend fields. - VaultGate when no active tenant. Tenants.jsx / Users.jsx: stripped their now-redundant <Layout> wrappers after the router refactor. pages/Files.jsx deleted: browse+upload functionality folded into pages/Vault.jsx. Having two "files" pages was confusing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ts, test adapts README.md: full recruiter-facing rewrite — hero, elevator pitch, quickstart, mermaid architecture, tech stack, contact section, license. Fixes the stale ui/web path (real frontend is frontend/, Vite port 5173). docs/architecture.md: one-page technical write-up — request lifecycle (auth -> tenant lookup -> session -> orchestration -> fusion -> generation -> persist), the DB-preferred fusion rationale, the hybrid retrieval rationale, per-tenant isolation model, and a mermaid diagram of the pipeline. docs/gh-repo-edit.sh: draft script for setting GitHub topics and description — not executed, left for explicit approval. docs/screenshots/README.md: placeholder for screenshot contributions. CHANGELOG.md: retroactive milestone log grouped by theme, plus an "Unreleased" section for everything on this branch. TODO.md: backlog that came up during the polish pass — deprecation warnings, unused directories, cap recommendations, etc. Smoke scripts: - scripts/smoke_test.py: end-to-end backend smoke (auth -> tenant CRUD -> user CRUD -> file upload -> chat RAG -> chat DB lookup). - scripts/ui_smoke.py: covers every interactive UI control by firing the exact endpoint it calls, producing a per-page PASS / FAIL / NEEDS_KEY table. Test adaptations: - tests/test_ai.py: the old tests asserted exact stub output strings like "[Anthropic] Response to: yo". With real SDKs that's no longer meaningful — loosen to class-selection checks. - tests/test_ingestion_rag.py: test_rag_pipeline_query_and_answer used to assert the OpenAI stub's literal output; replaced with a local StubModel so the test is independent of provider availability. - tests/test_api.py: monkeypatch was poking routes_files.UPLOAD_DIR which no longer exists; redirected to the new VAULT_FALLBACK symbol. pytest: 24/24 green. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…loader New resumable loader that builds a large-scale fundamental-analysis tenant. Three scale tiers: --scale small FinanceBench only (~60 files, ~10 min) --scale medium FB + bars + profiles for all SP500 (~1200 files, ~25 min) --scale large Medium + latest 10-K per non-FB SP500 (~1600 files, ~2h) The FinanceBench GitHub repo ships ~368 10-K PDFs pre-bundled for 40 companies plus 150 open-source Q&A pairs with ground-truth answers. Phase 1 unpacks those (no SEC EDGAR calls needed at this tier) and drops the Q&A rows into data/financebench.db#ground_truth so the orchestrator's DB-exact-lookup path has a real evaluation harness. Phase 2 scrapes the current S&P 500 list from Wikipedia for the universe. Phase 3 pulls 10 years of daily OHLCV for every ticker via yfinance and writes them into data/financebench.db#daily_bars so the DB lookup path handles price questions natively without embedding ~1.25M time-series rows. Phase 4 writes one profile.md per company (name, sector, industry, market cap, HQ, website, business description) from yfinance.Ticker.info. These ARE embedded so semantic queries like "which companies are in the semiconductor industry" work. Phase 5 (large only) downloads the latest 10-K for each non-FB SP500 ticker via sec-edgar-downloader, strips HTML/XBRL tags, truncates to 400K chars, and writes as markdown. Phase 6 walks every .md under data/financebench/ and embeds it into the tenant's Chroma collection. Files are chunked in-memory at ~500 tokens with paragraph-boundary splitting; each chunk gets a stable doc_id ({tenant}:{ticker}/{filename}#{chunk_idx}) and metadata ({ticker, filename, path, chunk_idx}) for filtering. Tenant 'financebench' is created with all three model providers configured and a demo user (fbuser / FbUser123!). Tenant admin still needs at least one LLM API key in .env to get real answers. Idempotent: any file that already exists on disk is skipped, so a rerun only fills in gaps. The ingest phase wipes the tenant's vectors first and re-embeds everything to keep the collection consistent. New deps: yfinance, pypdf, sec-edgar-downloader, tqdm, pandas. Also adds data/financebench/ and data/_downloads/ to .gitignore — the loader output is ~500 MB at the large tier, regenerable from the script, and shouldn't live in git history. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…lback Three fixes for issues hit during the first Large-tier run: 1. load_sp500_universe() — Wikipedia's scrape path blocked urllib with a 403 (default User-Agent). Primary source is now the datasets/s-and-p-500-companies GitHub CSV; Wikipedia is the fallback, now using requests with a Mozilla UA. 2. chunk_text() — the first ingest attempt segfaulted inside sentence-transformers on the very first batch. Root cause was unbounded "paragraphs" in pypdf-extracted 10-K text: tables and figure-captions merge into one line-break-less blob and some files produce single 50 KB paragraphs that blow the tokenizer. Added a MAX_CHUNK_CHARS = 2000 hard cap, a _hard_split helper for oversized paragraphs, and belt-and-braces slicing on the final chunks. 3. phase_ingest() — added resilient logging (chunk/read/embed errors are caught per-file and reported, one bad file can't kill the phase) and a force_cpu flag. Note: forcing CPU via CUDA_VISIBLE_DEVICES inside the script does NOT work reliably because torch is already imported transitively via yfinance->pandas earlier in the run. Operators should set the env var at the shell level (CUDA_VISIBLE_DEVICES= python scripts/load_financebench.py ...). The tenant itself is fine — a direct TenantVectorStore('financebench') probe loads cleanly and reports collection.count() correctly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The main loader (scripts/load_financebench.py) reliably completes Phases 1-5 (FB parse -> SP500 -> bars -> profiles -> SEC EDGAR) and then segfaults at the Phase 6 transition with exit code 139. The tenant itself is fine — a fresh Python process can load TenantVectorStore('financebench') and report collection.count() correctly — so the crash is native state left behind by something in the long-running process (likely pyrate-limiter / requests inside sec-edgar-downloader, which ran 466 rate-limited calls over 17 min during Phase 5). Rather than chase the crash inside a multi-GB-of-C-extension process, split Phase 6 into its own entry point. scripts/ingest_financebench.py runs in a fresh process with CUDA disabled, imports only the bare minimum (TenantManager + TenantVectorStore + chunk_text), batches 32 chunks per Chroma add() call for speed, and falls back to one-by-one adds if a batch fails. Workflow after this split: python scripts/load_financebench.py --scale large ... # (crashes at Phase 6 — that's OK, Phases 1-5 left 735 md + 503 csv # + 506 profiles + 1.2M bar rows + 150 GT rows on disk) CUDA_VISIBLE_DEVICES= python scripts/ingest_financebench.py # (runs in a fresh process, reads the same data/financebench/ tree, # embeds ~10K chunks into vector_store/financebench/) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ch mapping Three coordinated fixes so the Vault page can browse tenants whose content is organized under subdirectories (e.g. data/financebench/ holds one subdir per ticker with 10-Ks, bars.csv, profile.md under each). 1. VAULT_ROOTS gains the financebench entry so the route doesn't fall through to data/vaults/financebench/ which doesn't exist. 2. list_vault uses rglob("*") instead of glob("*") so nested files show up. Each entry's filename is now the POSIX path relative to the vault root (e.g. "AAPL/10k_2022.md"). For financebench this surfaces 1,744 files; existing flat tenants are unaffected because a top-level rglob just returns the same set as glob. 3. Route patterns become {filename:path} so FastAPI accepts literal forward slashes in the filename. _safe_vault_path now validates each path segment against SAFE_NAME_RE, rejects any part equal to "" / "." / ".." to block traversal, and resolves the final path under the vault root with a proper relative_to check. Verified live on the financebench tenant: GET /files/vault returns 1,744 files, GET /files/vault/AAPL/profile.md returns 2,117 chars of real content, and the existing seeded tenants (personal / company / organization / relativity / saas-ai / smartbase-docs) still list their flat content the same as before. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… Vault Replaces the flat <ul> file list in the Vault page with a proper tree browser. Needed because the financebench tenant has 1,744 files across 506 per-ticker subdirectories and the old flat list was unusable — no way to find AAPL's 2022 10-K without scrolling through hundreds of unrelated filings. components/FileTree.jsx — new reusable component: - Groups files by first path segment. Flat tenants (personal/company) render their files in a root group at the top; nested tenants (financebench) render one collapsible folder per ticker with a count badge. - Live search box filters files and folders by substring (case insensitive). Matches the folder name, the basename, or the full path. Matching substrings are highlighted inline with a yellow <mark>. Counter flips between "N files · M folders" and "X of N match" so you can tell when your filter is too broad. - Folders start collapsed by default so a 506-ticker vault doesn't flood the sidebar on open. A query auto-expands every matching folder; clearing the query restores the user's manual expand/ collapse state. Opening a file auto-expands its parent so the selected row stays visible. - Expand-all / Collapse-all buttons for when the user wants to see everything at once (also toggle back on manual click). - File icons per suffix (📄 for md/txt, 📊 for csv, 📜 for log), folder icon (📁), size shown in KB/MB. Vault.jsx — trimmed the file-list panel down to <FileTree/> plus the existing "+ Add file" upload + footer caption. Everything else about the edit panel, save+re-ingest flow, upload-and-select behavior is unchanged. Verified in a fresh build against the financebench tenant (1,744 files, 506 folders): typing "AAPL" collapses everything except Apple's folder and highlights the matching letters; typing "10k_2023" surfaces every 2023 annual across all tickers in one pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

New workflow: after you run a query in the RAG Visualizer, click "💾 Save trace" to persist the full trace JSON (pipeline stages, every retrieved chunk with its score, the assembled prompt, the LLM reply) into data/system.db. Saved traces show up in a new collapsible panel directly below the preset-query buttons — click any row to reload the whole visualization from storage, no backend LLM call, no API tokens spent. Why: demos and reviews often revisit the same "look at how the retrieval handled this specific question" multiple times. With real Anthropic/OpenAI keys in play, each rerun costs real money and latency, and the retrieval result can also shift if the vault is edited between runs. A saved trace is frozen — same chunks, same scores, same reply — which makes it ideal for sharing, for regression comparison, and for scripted portfolio demos. Backend (db/rag_trace_repository.py + routes_chat.py): POST /chat/traces - save {tenant_id, title?, query, reply, trace} GET /chat/traces - list {id, title, query, created_by, created_at} scoped to the requested tenant_id GET /chat/traces/{id} - fetch the full trace JSON DELETE /chat/traces/{id} - creator or super_admin only New table rag_traces(id, tenant_id, title, query, reply, trace_json, created_by, created_at). Indexed on tenant_id. Title defaults to the first 80 chars of the query if blank. Access control mirrors /chat/message and /chat/trace: super_admin can read across tenants, regular users can only read traces for their own tenant. Frontend (pages/RagVisualizer.jsx): - New "Saved traces" section below the preset-query buttons, with a collapsible header showing the count. - Each row shows title, full query, creator, timestamp, and the reply length. Clicking loads the trace into the visualizer — the pipeline diagram, score bars, vector store info strip, full prompt panel, and reply all render from the persisted blob. - "💾 Save trace" button above the pipeline diagram, visible whenever a trace is active and unsaved. A prompt() asks for a friendly title defaulting to the query. - After save, the status strip switches to "✓ Saved trace #id" and the button disables to prevent double-save. - Delete button (✕) on each row; only shown to the creator or super_admin, matching the backend rule. Verified live round-trip against financebench: run trace -> save -> list -> reload -> delete, all 200. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Phase 5 of load_financebench.py (SEC EDGAR latest-10K downloader) was writing the raw Path object into every 10k_latest.md header: Source: E:\Program_Research\smartbaseai\data\_downloads\sec\... 466 files leaked the operator's absolute filesystem layout into the vault corpus — visible in the Vault editor, ingested into the Chroma chunks, and findable by retrieval. Bad look for a public repo. Two fixes: 1. load_financebench.py -> phase_extra_10ks: rewrite the header to use only the stable SEC accession number (the last-but-one path segment, e.g. "0001234567-25-000001") instead of the full path. Also hardened ensure_tenant() to store the repo-relative "data/financebench.db" in tenants.json instead of str(STRUCTURED_DB) which would have been absolute. 2. scripts/scrub_vault_paths.py: idempotent one-off cleanup that walks data/financebench/**/*.md, matches the leaked "Source:" header with a Windows-absolute-path regex, extracts the accession number, and rewrites the header in place. Verifies zero remaining leaks by re-scanning for "[A-Z]:\Program_Research". Run: python scripts/scrub_vault_paths.py Result on the current repo: scanned 1,241 files, rewrote 466, remaining leaks 0. After scrubbing the disk files, re-run CUDA_VISIBLE_DEVICES= python scripts/ingest_financebench.py to wipe and re-embed the cleaned text so the Chroma collection matches. (The scrubber prints this as a reminder.) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Big coordinated batch covering Tier 2 (functional gaps), Tier 3 (polish), and Tier 4 (production readiness) from the earlier what's-missing audit. Only #20 (KMS/Vault secrets manager) is intentionally skipped as overkill for a single-operator dev box. # Backend T2 #5 — generalize exact_lookup (db/query_engine.py) Now dispatches by tenant and table. financebench queries with a ticker hint (sniffed from the user message) and a YYYY-MM-DD date land on daily_bars(ticker, date, open, high, low, close, volume) in financebench.db and return the exact row. Verified live: "AAPL closing price on 2024-06-14" -> {close: 212.49, volume: 70.1M}. Legacy market_data path still works for company/organization vaults. ResponseGenerator._lookup_db now tries the intraday pattern first, then falls back to date-only; passes message= through so the ticker detector can see it. Shared _format_row helper handles both row shapes. T2 #6 — FinanceBench eval harness (scripts/eval_financebench.py) Runs each of the 150 open-source ground-truth questions through /chat/message and scores the reply with three strategies: - numeric match (2% relative tolerance, handles $/bn/M/%) - substring match on the canonical answer - optional Claude-as-judge via --judge flag Writes a markdown report at docs/financebench_eval.md with headline accuracy, per-question-type breakdown, latency stats, sample failures. Usage: python scripts/eval_financebench.py --model-provider anthropic --model-name claude-opus-4-6 --limit 50 --judge T2 #7 — chunked re-ingest on vault edit (api/routes_files.py) _ingest_file_into_tenant_store now uses the shared ai/chunking module, deletes every existing {tenant}:{filename}#* vector under the file's prefix before inserting fresh chunks, and honors section-aware chunking. The vault PUT endpoint delegates to the same helper so edits match the loader's layout exactly. T2 #8 — streaming responses (api/routes_chat.py) New POST /chat/message/stream endpoint returning SSE events: event: delta data: {"text": "..."} (60-char chunks) event: done data: {"latency_ms": N} event: error data: {"detail": "..."} Records usage + writes audit log + updates conversation history just like the non-streaming variant. T2 #9 — cleared all datetime.utcnow() deprecation warnings Updated 8 callsites across api/routes_auth, db/{audit_log, conversation, file, rag_trace, settings, user}_repository, and ingestion/metadata_generator to use datetime.now(timezone.utc). Warning count dropped 83 -> 12 on pytest runs. T3 #14 — cross-tenant search (api/routes_admin.py + frontend page) GET /admin/search?q=...&tenants=... runs the hybrid retriever across every (or a subset of) tenant's Chroma store and returns a ranked flat list with tenant tags. Super-admin only. Verified live: q=quokka across all 7 tenants returns 29 hits primarily from organization + company vaults. T3 #15 — audit log viewer (db/audit_log_repository + /admin/audit-log + frontend) audit_log_repository gains list_logs(limit, offset, username, action) and count_logs(). GET /admin/audit-log returns paginated events. New pages/AuditLog.jsx renders a filterable table. T3 #16 — prompt-cache verification (ai/models/anthropic_model.py) AnthropicModel.generate now logs anthropic usage: input=N cache_create=N cache_read=N output=N on every response via a dedicated smartbaseai.anthropic logger, tagged with the request id from the middleware. Lets operators verify cache_control is actually being hit across turns. T3 #21 — metadata-aware chunking (new ai/chunking.py) Two-tier chunker: 1. Split markdown on ## / ### headings (preserves 10-K sections like Risk Factors, MD&A, Balance Sheet in their own chunks) 2. Within each section, pack paragraphs with a MAX_CHUNK_CHARS=2000 hard cap; oversize paragraphs get whitespace-aligned hard-split. Each chunk is prefixed with **Section Title** so semantic similarity can match on the section name. chunk_with_sections() returns rich dicts with section metadata for callers that want it. T4 #17 — structured logging + request IDs (new api/logging_config.py) New RequestIdMiddleware assigns uuid4 per request (or honors an incoming X-Request-ID header), propagates via contextvars, and returns the id in the response header. Log format: HH:MM:SS INFO rid=abc123456789 smartbaseai.http: POST /chat/trace -> 200 (142.1ms) Installed via configure_logging() in api/app.py. T4 #18 — rate limiting + cost tracking (new db/usage_repository.py) llm_usage table records (tenant_id, username, provider, model_name, input_tokens, output_tokens, cache_read_tokens, cache_create_tokens, latency_ms, created_at) per request. chat_message + chat_message_stream both record on every call with a 4-chars-per-token heuristic for providers that don't expose usage. Per-tenant daily cap: set "daily_token_cap" on the tenant config to enforce. Chat endpoint rejects with 429 when hit: "Daily token cap reached for tenant 'X' (N/CAP)" GET /admin/usage returns a per-day x per-tenant x per-provider rollup with estimated cost using blended prices (Anthropic/OpenAI public rates). Ollama is free/local. New pages/UsageDashboard.jsx renders the rollup with totals across requests / tokens / estimated cost. T4 #19 — session timeout interceptor (frontend/src/api/api.js) axios response interceptor catches 401 / "Token expired" / "Invalid token" and: - clears localStorage (access_token, role, tenant_id, active_tenant, username) - stashes the current path in sessionStorage.post_login_redirect - window.location.assign('/login') Login page reads post_login_redirect on success and bounces back. # Frontend T3 #10 — markdown preview in Vault editor (pages/Vault.jsx) New edit / split / preview toggle above the textarea. split mode shows raw markdown on the left and rendered output on the right. Uses react-markdown (new dep). T3 #11 — export trace (pages/RagVisualizer.jsx) Three new buttons above the pipeline diagram: - 📋 Copy JSON — writes full trace to clipboard - ⬇ .md — downloads a formatted markdown report (query, DB lookup, hybrid retrieval ranking, fusion block, LLM reply) - ⬇ .json — downloads the raw trace JSON Live alongside the existing "💾 Save trace" button. T3 #12 — bulk upload in Vault (pages/Vault.jsx) File input becomes <input multiple>. upload() iterates every selected file, catches per-file errors, reports "Uploaded N/M files · K ingested." when multi. T3 #13 — error boundary (components/ErrorBoundary.jsx + App.jsx) React class component wraps <AppRouter/>. Catches render errors with a recoverable fallback card (try again / reload app) instead of blanking the page. # Login / app plumbing Login.jsx now also stashes username in localStorage and honors the post_login_redirect bounce. components/Layout.jsx gets new sidebar links for super_admin (Cross-tenant Search / Usage / Audit Log / Settings). # Tests 24/24 pytest green. Warning count dropped from 83 to 12 due to the datetime.utcnow() cleanup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The per-question progress line was printing unicode check/cross glyphs which crashed the loop on the very first question under Windows default cp1252 stdout encoding. Replaced with PASS/FAIL and force-encoded the truncated reply snippet as ascii with '?' for untranslatable chars. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

GitHub rejects pushes that touch .github/workflows/ from a PAT without workflow scope. Shipping the workflow as docs/ci-workflow.yml.template with activation instructions in the header so it is version-controlled but not at its final path yet. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Eight-step narrated walkthrough that covers: login gate, vault picker, the DB+RAG fusion killer query (AAPL price via daily_bars exact lookup), fundamentals query with vector store info strip, saved-trace zero-token replay, tree-search file browser, cross-tenant search, usage dashboard. Plus a 30-second short version for a 5-MB README GIF. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

First end-to-end run of scripts/eval_financebench.py against the live financebench tenant (506 companies, 197,201 Chroma vectors, 150 ground-truth Q&A from the Patronus open-source set). ## Headline Model: claude-opus-4-6 (real API) Sample: 20 questions (--limit 20) Numeric match: 9 / 20 (45.0%, 2% tolerance) Per-type split: domain-relevant 5/10 (50%), novel-generated 4/10 (40%) Mean latency: 43 s / question (includes retrieval + Opus call) ## What the number actually means This is a **lower bound**. The scorer looks for numeric matches with 2% tolerance plus substring matches on the canonical answer. Several failures are correct-in-spirit answers the scorer can't recognize: - JPM gross-margin question — Opus correctly explains why gross margins are not a relevant metric for a bank (canonical answer says exactly the same thing in one sentence; scorer looks for the one sentence). - Boeing cyclicality — Opus answers "yes, subject to cyclicality" with a multi-paragraph justification. Canonical is a one-word "Yes". Substring match fails. A Claude-as-judge run (available via --judge flag, ~$3 extra spend) would typically raise the headline number by 5-10pp. Left for later. ## Where the system actually fails Several domain-relevant questions (JPM Q1 2021 segment revenue, AMD FY22 quick ratio, Verizon FY22 quick ratio) show a different failure mode: the retriever is pulling chunks from the wrong ticker's 10-K. For AMD and Verizon, Opus literally reports that the retrieved context is about Citizens Financial Group and Pfizer instead. This is cross- ticker retrieval bleed — a side effect of the shared Chroma collection with 197k vectors from 506 companies and no tenant-level ticker filter on the query. Fix direction (future work): (a) pass a ticker filter into the Chroma where= clause when a ticker is detected in the question, (b) use a per-ticker metadata boost in the hybrid retriever, or (c) ingest per-ticker sub-collections. This is now the top item in the retrieval- quality backlog. ## How to reproduce python scripts/eval_financebench.py --base http://127.0.0.1:8799 --model-provider anthropic --model-name claude-opus-4-6 --limit 20 Results are written to docs/financebench_eval.md with per-question type breakdown, latency stats, and sample failures. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

danrixd · 2026-04-15T07:42:04Z

📊 First FinanceBench eval run — 45% auto-score with Claude Opus 4.6

Just ran scripts/eval_financebench.py --limit 20 against the live financebench tenant and pushed the report to docs/financebench_eval.md.

Metric	Result
Numeric match (2% tol)	9 / 20 (45.0%)
Domain-relevant questions	5 / 10 (50.0%)
Novel-generated questions	4 / 10 (40.0%)
Mean latency	43 s/question (retrieval + Opus)
Total wall-clock	~14 min for 20 questions
Approx. cost	~$1.50 at Opus rates

What the number means

Lower bound. The auto-scorer looks for numeric matches (2% tolerance) and substring matches on the canonical answer. Several failures are correct-in-spirit answers the scorer can't recognize — e.g. the JPM gross margins question, where Opus correctly explains why gross margin isn't a relevant metric for a bank (matches the canonical answer's intent exactly) but the substring matcher can't find the one-sentence canonical answer inside a 4-paragraph explanation. A Claude-as-judge run (--judge flag, ~$3 extra spend) would typically lift this by 5–10pp.

Where the system actually fails — interesting finding

Cross-ticker retrieval bleed. Several domain-relevant questions (AMD FY22 quick ratio, Verizon FY22 quick ratio, JPM Q1 2021 segment revenue) failed because the retriever pulled chunks from the wrong ticker's 10-K. Opus literally reports that the retrieved context is about Citizens Financial Group and Pfizer when asked about AMD and Verizon.

Root cause: the financebench tenant shares one Chroma collection with 197,201 vectors from 506 companies, and the hybrid retriever has no ticker-level filter on the where= clause. When the MiniLM embedder sees the phrase "quick ratio for FY22" the nearest neighbors by cosine distance happen to be from other companies' liquidity sections.

Fix direction (next iteration of retrieval quality)

Top item for the retrieval-quality backlog:

Detect ticker in the query (same pattern the DB exact-lookup path already uses), and when present, pass it as a where={"ticker": "<TICKER>"} filter into collection.query(...).
Fallback: if the ticker filter returns 0 hits (question is about a company that isn't in the vault), drop the filter and try the full collection.
Measure: re-run the 20-question eval and see how much the ticker-aware retrieval lifts the AMD/VZ/JPM-segment questions.

Predicted: cross-ticker bleed failures should drop to zero, lifting the headline from 45% → ~60%.

Status

📝 docs/financebench_eval.md committed with full per-question breakdown and failure samples
🟢 All other items in this PR verified live
🚧 CI workflow staged at docs/ci-workflow.yml.template — activates after gh auth refresh -s workflow

danrixd and others added 17 commits April 14, 2026 15:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

upgrade: real LLM SDKs, multi-model tenants, RAG visualizer, 506-company FinanceBench demo, full production polish#34

upgrade: real LLM SDKs, multi-model tenants, RAG visualizer, 506-company FinanceBench demo, full production polish#34
danrixd wants to merge 17 commits into
mainfrom
upgrade

danrixd commented Apr 15, 2026

Uh oh!

danrixd commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

danrixd commented Apr 15, 2026

Headline highlights

Commits (newest first)

Test state

CI (pending — needs a token with workflow scope)

Verified live before opening this PR

What is still outstanding

How to run locally

Default logins

Uh oh!

danrixd commented Apr 15, 2026

📊 First FinanceBench eval run — 45% auto-score with Claude Opus 4.6

What the number means

Where the system actually fails — interesting finding

Fix direction (next iteration of retrieval quality)

Status

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

CI (pending — needs a token with `workflow` scope)