upgrade: real LLM SDKs, multi-model tenants, RAG visualizer, 506-company FinanceBench demo, full production polish#34
Conversation
The vector_store/ Chroma collections and data/*.db files are runtime state that change on every query; committing them produced noisy diffs and risked leaking real tenant content through git history. Move them out of tracking and gitignore their parent directories going forward. Also ignore frontend/.env.local (per-dev API base URL) and frontend/dist/ (vite build artifact). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… vault CRUD
Replaces the mocked AnthropicModel/OpenAIModel/OllamaModel wrappers with
real SDK calls (anthropic, openai, requests) and threads them through
the orchestrator so tenants actually talk to live model providers.
Why: the previous wrappers returned deterministic "[Anthropic] Response
to: <prompt>" stubs; none of the chat or trace surfaces could produce a
grounded answer even with valid API keys. This lands the missing wire.
Major changes:
- api/config.py: new strict .env loader. Provider keys
(ANTHROPIC_API_KEY/OPENAI_API_KEY/OLLAMA_BASE_URL) come from .env ONLY;
shell env vars that are NOT in .env are wiped from os.environ so they
can't leak through. Fixes "why does OpenAI show green when I didn't
add it to .env?" surprise.
- api/auth_middleware.py, api/routes_auth.py: stop hardcoding
SECRET_KEY="super_secret"; pull from api.config.
- ai/models/anthropic_model.py: real Anthropic SDK. Defaults to
claude-opus-4-6, applies cache_control: ephemeral on the stable
system prompt so repeated turns hit the prompt cache. Typed
exceptions for auth/rate-limit/api errors. Clearly-marked
unavailability stub when SDK missing or key unset.
- ai/models/openai_model.py: real OpenAI SDK, gpt-4o-mini default.
- ai/models/ollama_model.py: real /api/generate call (already had one)
plus /api/tags ping that actually tests connectivity and reads
base_url from the settings store.
- Each model class now exposes a .ping() classmethod used by the live
status indicator; replaces the hardcoded green-dot UI.
- chatbot/response_generator.py: register "anthropic" in MODELS so
tenants can actually select it. Add generate_response_trace() that
returns the full pipeline breakdown (history + db_lookup + hybrid
retrieval + fusion + prompt + llm) for the RAG visualizer.
- ai/vector_stores/chroma_store.py: add store_info() (collection size,
embedding model, dim, device) and hybrid_query_trace() that returns
the full ranked candidate list with L2 distances plus keyword hits
plus the post-fusion kept set. Enough metadata for the UI to render
a real vector-DB panel.
- api/routes_chat.py: ChatRequest gains optional model_provider +
model_name override so users can pick any of the tenant's configured
models per-request. New POST /chat/trace endpoint is read-only wrt
conversation history and returns the full trace for the visualizer.
- api/routes_files.py: new vault endpoints — GET /files/vault (list),
GET/PUT /files/vault/{filename} (read + edit with auto re-ingest into
Chroma under a stable doc_id), POST /files/upload now lands in the
vault dir and ingests the same way. Falls back to data/vaults/{tenant}
for non-seeded tenants. Proper path-traversal protection.
- api/routes_admin.py: GET/PUT /admin/settings, GET /admin/models/status
(real per-provider ping), POST /admin/models/test (test with override
key before saving). Duplicate tenant/user -> 409 Conflict instead of
500. Fix PATCH /admin/tenants/{id} which was calling manager.create()
and always raising -- now goes through TenantManager.update().
- tenants/tenant_manager.py: stateless — re-reads tenants.json on every
call. Fixes the bug where routes_chat and routes_admin each held
separate cached instances, so tenants created via admin were invisible
to chat until process restart (404 "tenant not found"). Also adds a
real update() method.
- db/settings_repository.py: new SQLite-backed kv store. Stored values
take precedence over env; empty string clears back to env. Secrets
are masked on read (GET /admin/settings).
- requirements.txt: + anthropic, + openai, + python-dotenv.
- .env.example: documents ANTHROPIC_API_KEY, OPENAI_API_KEY,
OLLAMA_BASE_URL; explains the Settings-UI override story.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ation / relativity / saas-ai / smartbase-docs) Scripted corpus so the RAG visualizer has something meaningful to search. Every vault ships with hand-written content containing at least one "needle" fact that cannot be googled, so retrieval quality is verifiable. Vaults: - personal/ — single-user notes: habits, 2026 goals, contacts. - demo/ (tenant_id: company) — Acme Analytics product docs (Pulse, Atlas, policies, FAQ) plus a market-data CSV. The FAQ file contains the fake "secret internal codeword" needle. - organization/ — fictional Project Lunar Harbor moon program. A 7-step work plan (with owners, dates, gates), non-public contingency procedures C-7 "Quokka" / C-12 "Bramble" / C-19 "Kingfisher", and a lab reference code LH-ALT-7741-NX. Explicitly labelled as fictional in the docs so the RAG answers prove the content comes from the vault and not from the LLM's training data. - relativity/ — GR research notebook: Hubble tension working model, ringdown overtone extraction (codename Kingfisher-R3), reading list, lab ID GR-TAU-26-2047. - saas-ai/ — SaaS in the age of AI: unit economics (gross-margin staircase HIGHLAND-GM-2026), moats, pricing playbook. - smartbase-docs/ — product docs for SmartBaseAI itself, so the tool can answer questions about how to use it. Also: - scripts/seed_demo.py: idempotent seeder. Creates tenants, creates one demo user per tenant (alice/demo/orion/einat/nadav/docs), ingests every .md/.txt/.csv into the tenant's Chroma collection with a stable doc_id, and loads the CSVs into sqlite for the exact_lookup path. Each tenant is pre-configured with a models[] array of [ollama, openai, anthropic] so the UI model picker is populated. - tenants/tenants.json: the six vault entries with models[] arrays. Run: python scripts/seed_demo.py Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… status, vault-selection gate
Adds the three main UX surfaces that make the rewritten backend actually
usable from a browser, plus a subtle but important context-scoping fix.
New pages:
- pages/RagVisualizer.jsx: live view of a single query flowing through
the three-source orchestrator. Top row is a pipeline diagram
(query -> orchestrator[history/db/rag] -> fusion -> llm -> reply),
colored by stage. Below the diagram, a dedicated "Vector store search"
panel shows tenant/collection/vector-count/embedding-model/device,
then the keyword hits and the full semantic ranking with L2 distance
bars (green >=60% relevance / amber / red) and a "kept" badge on the
rows the fusion layer actually used. Per-vault preset queries pick
up the needle facts seeded into each corpus.
- pages/Vault.jsx: file browser + markdown editor for the current
tenant's vault. Click a file to load it, edit, "Save + re-ingest"
overwrites disk and replaces the vector under the same stable
doc_id. "+ Add file" upload flows through the same auto-ingest path
so uploaded files are immediately searchable.
- pages/Settings.jsx (super_admin only): provider credentials form
with live Test buttons (POST /admin/models/test), masked Save, plus
per-tenant multi-model editor — add/remove rows of
{provider, name, label}. First row is the default; users pick at
chat time via a dropdown.
New components:
- components/ApiStatus.jsx: polls /admin/models/status every 30s and
renders the sidebar's three-dot indicator with real colors instead of
the previous hardcoded green dots. Replaces the long-standing bug
where ollama/openai always showed "connected" even when down.
- components/VaultGate.jsx: blocking placeholder for pages that need an
active vault but don't have one yet. Super-admins see this on
first load and must explicitly pick a vault from the top-right
dropdown before chat / rag / vault load; regular users land directly
in their scoped tenant.
Routing + context fix (router.jsx + Layout.jsx):
- Lifted <Layout> out of every page and into PrivateRoute so Layout
wraps every authenticated route from above. Previously each page
rendered its own <Layout>, which put the AppContext.Provider BELOW
the page in the React tree — so useContext(AppContext) on each page
read the default empty context, not Layout's state. This was the
bug behind "super_admin picks a vault, Chat still shows the gate" —
activeTenant updates never reached the page.
- Layout: removed the old empty gear-icon modal (for non-super-admin
it was literally empty); the gear icon now navigates to /settings.
Added a top-right "Active vault" dropdown for super_admin that
starts at "— choose a vault —" with no auto-selection.
Chat.jsx:
- Multi-model picker above the textarea; per-request model override
via the new backend fields.
- VaultGate when no active tenant.
Tenants.jsx / Users.jsx: stripped their now-redundant <Layout>
wrappers after the router refactor.
pages/Files.jsx deleted: browse+upload functionality folded into
pages/Vault.jsx. Having two "files" pages was confusing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ts, test adapts README.md: full recruiter-facing rewrite — hero, elevator pitch, quickstart, mermaid architecture, tech stack, contact section, license. Fixes the stale ui/web path (real frontend is frontend/, Vite port 5173). docs/architecture.md: one-page technical write-up — request lifecycle (auth -> tenant lookup -> session -> orchestration -> fusion -> generation -> persist), the DB-preferred fusion rationale, the hybrid retrieval rationale, per-tenant isolation model, and a mermaid diagram of the pipeline. docs/gh-repo-edit.sh: draft script for setting GitHub topics and description — not executed, left for explicit approval. docs/screenshots/README.md: placeholder for screenshot contributions. CHANGELOG.md: retroactive milestone log grouped by theme, plus an "Unreleased" section for everything on this branch. TODO.md: backlog that came up during the polish pass — deprecation warnings, unused directories, cap recommendations, etc. Smoke scripts: - scripts/smoke_test.py: end-to-end backend smoke (auth -> tenant CRUD -> user CRUD -> file upload -> chat RAG -> chat DB lookup). - scripts/ui_smoke.py: covers every interactive UI control by firing the exact endpoint it calls, producing a per-page PASS / FAIL / NEEDS_KEY table. Test adaptations: - tests/test_ai.py: the old tests asserted exact stub output strings like "[Anthropic] Response to: yo". With real SDKs that's no longer meaningful — loosen to class-selection checks. - tests/test_ingestion_rag.py: test_rag_pipeline_query_and_answer used to assert the OpenAI stub's literal output; replaced with a local StubModel so the test is independent of provider availability. - tests/test_api.py: monkeypatch was poking routes_files.UPLOAD_DIR which no longer exists; redirected to the new VAULT_FALLBACK symbol. pytest: 24/24 green. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…loader
New resumable loader that builds a large-scale fundamental-analysis
tenant. Three scale tiers:
--scale small FinanceBench only (~60 files, ~10 min)
--scale medium FB + bars + profiles for all SP500 (~1200 files, ~25 min)
--scale large Medium + latest 10-K per non-FB SP500 (~1600 files, ~2h)
The FinanceBench GitHub repo ships ~368 10-K PDFs pre-bundled for 40
companies plus 150 open-source Q&A pairs with ground-truth answers.
Phase 1 unpacks those (no SEC EDGAR calls needed at this tier) and
drops the Q&A rows into data/financebench.db#ground_truth so the
orchestrator's DB-exact-lookup path has a real evaluation harness.
Phase 2 scrapes the current S&P 500 list from Wikipedia for the
universe. Phase 3 pulls 10 years of daily OHLCV for every ticker via
yfinance and writes them into data/financebench.db#daily_bars so the
DB lookup path handles price questions natively without embedding
~1.25M time-series rows.
Phase 4 writes one profile.md per company (name, sector, industry,
market cap, HQ, website, business description) from yfinance.Ticker.info.
These ARE embedded so semantic queries like "which companies are in
the semiconductor industry" work.
Phase 5 (large only) downloads the latest 10-K for each non-FB SP500
ticker via sec-edgar-downloader, strips HTML/XBRL tags, truncates to
400K chars, and writes as markdown.
Phase 6 walks every .md under data/financebench/ and embeds it into
the tenant's Chroma collection. Files are chunked in-memory at ~500
tokens with paragraph-boundary splitting; each chunk gets a stable
doc_id ({tenant}:{ticker}/{filename}#{chunk_idx}) and metadata
({ticker, filename, path, chunk_idx}) for filtering.
Tenant 'financebench' is created with all three model providers
configured and a demo user (fbuser / FbUser123!). Tenant admin still
needs at least one LLM API key in .env to get real answers.
Idempotent: any file that already exists on disk is skipped, so a
rerun only fills in gaps. The ingest phase wipes the tenant's vectors
first and re-embeds everything to keep the collection consistent.
New deps: yfinance, pypdf, sec-edgar-downloader, tqdm, pandas.
Also adds data/financebench/ and data/_downloads/ to .gitignore —
the loader output is ~500 MB at the large tier, regenerable from
the script, and shouldn't live in git history.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…lback
Three fixes for issues hit during the first Large-tier run:
1. load_sp500_universe() — Wikipedia's scrape path blocked urllib with a
403 (default User-Agent). Primary source is now the
datasets/s-and-p-500-companies GitHub CSV; Wikipedia is the fallback,
now using requests with a Mozilla UA.
2. chunk_text() — the first ingest attempt segfaulted inside
sentence-transformers on the very first batch. Root cause was
unbounded "paragraphs" in pypdf-extracted 10-K text: tables and
figure-captions merge into one line-break-less blob and some files
produce single 50 KB paragraphs that blow the tokenizer. Added a
MAX_CHUNK_CHARS = 2000 hard cap, a _hard_split helper for oversized
paragraphs, and belt-and-braces slicing on the final chunks.
3. phase_ingest() — added resilient logging (chunk/read/embed errors
are caught per-file and reported, one bad file can't kill the phase)
and a force_cpu flag. Note: forcing CPU via CUDA_VISIBLE_DEVICES
inside the script does NOT work reliably because torch is already
imported transitively via yfinance->pandas earlier in the run.
Operators should set the env var at the shell level
(CUDA_VISIBLE_DEVICES= python scripts/load_financebench.py ...).
The tenant itself is fine — a direct TenantVectorStore('financebench')
probe loads cleanly and reports collection.count() correctly.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The main loader (scripts/load_financebench.py) reliably completes
Phases 1-5 (FB parse -> SP500 -> bars -> profiles -> SEC EDGAR) and
then segfaults at the Phase 6 transition with exit code 139. The tenant
itself is fine — a fresh Python process can load
TenantVectorStore('financebench') and report collection.count()
correctly — so the crash is native state left behind by something in
the long-running process (likely pyrate-limiter / requests inside
sec-edgar-downloader, which ran 466 rate-limited calls over 17 min
during Phase 5).
Rather than chase the crash inside a multi-GB-of-C-extension process,
split Phase 6 into its own entry point. scripts/ingest_financebench.py
runs in a fresh process with CUDA disabled, imports only the bare
minimum (TenantManager + TenantVectorStore + chunk_text), batches
32 chunks per Chroma add() call for speed, and falls back to one-by-one
adds if a batch fails.
Workflow after this split:
python scripts/load_financebench.py --scale large ...
# (crashes at Phase 6 — that's OK, Phases 1-5 left 735 md + 503 csv
# + 506 profiles + 1.2M bar rows + 150 GT rows on disk)
CUDA_VISIBLE_DEVICES= python scripts/ingest_financebench.py
# (runs in a fresh process, reads the same data/financebench/ tree,
# embeds ~10K chunks into vector_store/financebench/)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ch mapping
Three coordinated fixes so the Vault page can browse tenants whose
content is organized under subdirectories (e.g. data/financebench/
holds one subdir per ticker with 10-Ks, bars.csv, profile.md under
each).
1. VAULT_ROOTS gains the financebench entry so the route doesn't fall
through to data/vaults/financebench/ which doesn't exist.
2. list_vault uses rglob("*") instead of glob("*") so nested files
show up. Each entry's filename is now the POSIX path relative to
the vault root (e.g. "AAPL/10k_2022.md"). For financebench this
surfaces 1,744 files; existing flat tenants are unaffected because
a top-level rglob just returns the same set as glob.
3. Route patterns become {filename:path} so FastAPI accepts literal
forward slashes in the filename. _safe_vault_path now validates
each path segment against SAFE_NAME_RE, rejects any part equal to
"" / "." / ".." to block traversal, and resolves the final path
under the vault root with a proper relative_to check.
Verified live on the financebench tenant: GET /files/vault returns
1,744 files, GET /files/vault/AAPL/profile.md returns 2,117 chars of
real content, and the existing seeded tenants (personal / company /
organization / relativity / saas-ai / smartbase-docs) still list
their flat content the same as before.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… Vault Replaces the flat <ul> file list in the Vault page with a proper tree browser. Needed because the financebench tenant has 1,744 files across 506 per-ticker subdirectories and the old flat list was unusable — no way to find AAPL's 2022 10-K without scrolling through hundreds of unrelated filings. components/FileTree.jsx — new reusable component: - Groups files by first path segment. Flat tenants (personal/company) render their files in a root group at the top; nested tenants (financebench) render one collapsible folder per ticker with a count badge. - Live search box filters files and folders by substring (case insensitive). Matches the folder name, the basename, or the full path. Matching substrings are highlighted inline with a yellow <mark>. Counter flips between "N files · M folders" and "X of N match" so you can tell when your filter is too broad. - Folders start collapsed by default so a 506-ticker vault doesn't flood the sidebar on open. A query auto-expands every matching folder; clearing the query restores the user's manual expand/ collapse state. Opening a file auto-expands its parent so the selected row stays visible. - Expand-all / Collapse-all buttons for when the user wants to see everything at once (also toggle back on manual click). - File icons per suffix (📄 for md/txt, 📊 for csv, 📜 for log), folder icon (📁), size shown in KB/MB. Vault.jsx — trimmed the file-list panel down to <FileTree/> plus the existing "+ Add file" upload + footer caption. Everything else about the edit panel, save+re-ingest flow, upload-and-select behavior is unchanged. Verified in a fresh build against the financebench tenant (1,744 files, 506 folders): typing "AAPL" collapses everything except Apple's folder and highlights the matching letters; typing "10k_2023" surfaces every 2023 annual across all tickers in one pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New workflow: after you run a query in the RAG Visualizer, click
"💾 Save trace" to persist the full trace JSON (pipeline stages, every
retrieved chunk with its score, the assembled prompt, the LLM reply)
into data/system.db. Saved traces show up in a new collapsible panel
directly below the preset-query buttons — click any row to reload the
whole visualization from storage, no backend LLM call, no API tokens
spent.
Why: demos and reviews often revisit the same "look at how the
retrieval handled this specific question" multiple times. With real
Anthropic/OpenAI keys in play, each rerun costs real money and
latency, and the retrieval result can also shift if the vault is
edited between runs. A saved trace is frozen — same chunks, same
scores, same reply — which makes it ideal for sharing, for
regression comparison, and for scripted portfolio demos.
Backend (db/rag_trace_repository.py + routes_chat.py):
POST /chat/traces - save {tenant_id, title?, query, reply, trace}
GET /chat/traces - list {id, title, query, created_by, created_at}
scoped to the requested tenant_id
GET /chat/traces/{id} - fetch the full trace JSON
DELETE /chat/traces/{id} - creator or super_admin only
New table rag_traces(id, tenant_id, title, query, reply, trace_json,
created_by, created_at). Indexed on tenant_id. Title defaults to the
first 80 chars of the query if blank.
Access control mirrors /chat/message and /chat/trace: super_admin can
read across tenants, regular users can only read traces for their
own tenant.
Frontend (pages/RagVisualizer.jsx):
- New "Saved traces" section below the preset-query buttons, with a
collapsible header showing the count.
- Each row shows title, full query, creator, timestamp, and the
reply length. Clicking loads the trace into the visualizer — the
pipeline diagram, score bars, vector store info strip, full prompt
panel, and reply all render from the persisted blob.
- "💾 Save trace" button above the pipeline diagram, visible whenever
a trace is active and unsaved. A prompt() asks for a friendly title
defaulting to the query.
- After save, the status strip switches to "✓ Saved trace #id" and
the button disables to prevent double-save.
- Delete button (✕) on each row; only shown to the creator or
super_admin, matching the backend rule.
Verified live round-trip against financebench: run trace -> save ->
list -> reload -> delete, all 200.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Phase 5 of load_financebench.py (SEC EDGAR latest-10K downloader) was
writing the raw Path object into every 10k_latest.md header:
Source: E:\Program_Research\smartbaseai\data\_downloads\sec\...
466 files leaked the operator's absolute filesystem layout into the
vault corpus — visible in the Vault editor, ingested into the Chroma
chunks, and findable by retrieval. Bad look for a public repo.
Two fixes:
1. load_financebench.py -> phase_extra_10ks: rewrite the header to use
only the stable SEC accession number (the last-but-one path
segment, e.g. "0001234567-25-000001") instead of the full path.
Also hardened ensure_tenant() to store the repo-relative
"data/financebench.db" in tenants.json instead of str(STRUCTURED_DB)
which would have been absolute.
2. scripts/scrub_vault_paths.py: idempotent one-off cleanup that
walks data/financebench/**/*.md, matches the leaked "Source:"
header with a Windows-absolute-path regex, extracts the
accession number, and rewrites the header in place. Verifies
zero remaining leaks by re-scanning for "[A-Z]:\Program_Research".
Run: python scripts/scrub_vault_paths.py
Result on the current repo: scanned 1,241 files, rewrote 466,
remaining leaks 0.
After scrubbing the disk files, re-run
CUDA_VISIBLE_DEVICES= python scripts/ingest_financebench.py
to wipe and re-embed the cleaned text so the Chroma collection
matches. (The scrubber prints this as a reminder.)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Big coordinated batch covering Tier 2 (functional gaps), Tier 3 (polish), and Tier 4 (production readiness) from the earlier what's-missing audit. Only #20 (KMS/Vault secrets manager) is intentionally skipped as overkill for a single-operator dev box. # Backend T2 #5 — generalize exact_lookup (db/query_engine.py) Now dispatches by tenant and table. financebench queries with a ticker hint (sniffed from the user message) and a YYYY-MM-DD date land on daily_bars(ticker, date, open, high, low, close, volume) in financebench.db and return the exact row. Verified live: "AAPL closing price on 2024-06-14" -> {close: 212.49, volume: 70.1M}. Legacy market_data path still works for company/organization vaults. ResponseGenerator._lookup_db now tries the intraday pattern first, then falls back to date-only; passes message= through so the ticker detector can see it. Shared _format_row helper handles both row shapes. T2 #6 — FinanceBench eval harness (scripts/eval_financebench.py) Runs each of the 150 open-source ground-truth questions through /chat/message and scores the reply with three strategies: - numeric match (2% relative tolerance, handles $/bn/M/%) - substring match on the canonical answer - optional Claude-as-judge via --judge flag Writes a markdown report at docs/financebench_eval.md with headline accuracy, per-question-type breakdown, latency stats, sample failures. Usage: python scripts/eval_financebench.py --model-provider anthropic --model-name claude-opus-4-6 --limit 50 --judge T2 #7 — chunked re-ingest on vault edit (api/routes_files.py) _ingest_file_into_tenant_store now uses the shared ai/chunking module, deletes every existing {tenant}:{filename}#* vector under the file's prefix before inserting fresh chunks, and honors section-aware chunking. The vault PUT endpoint delegates to the same helper so edits match the loader's layout exactly. T2 #8 — streaming responses (api/routes_chat.py) New POST /chat/message/stream endpoint returning SSE events: event: delta data: {"text": "..."} (60-char chunks) event: done data: {"latency_ms": N} event: error data: {"detail": "..."} Records usage + writes audit log + updates conversation history just like the non-streaming variant. T2 #9 — cleared all datetime.utcnow() deprecation warnings Updated 8 callsites across api/routes_auth, db/{audit_log, conversation, file, rag_trace, settings, user}_repository, and ingestion/metadata_generator to use datetime.now(timezone.utc). Warning count dropped 83 -> 12 on pytest runs. T3 #14 — cross-tenant search (api/routes_admin.py + frontend page) GET /admin/search?q=...&tenants=... runs the hybrid retriever across every (or a subset of) tenant's Chroma store and returns a ranked flat list with tenant tags. Super-admin only. Verified live: q=quokka across all 7 tenants returns 29 hits primarily from organization + company vaults. T3 #15 — audit log viewer (db/audit_log_repository + /admin/audit-log + frontend) audit_log_repository gains list_logs(limit, offset, username, action) and count_logs(). GET /admin/audit-log returns paginated events. New pages/AuditLog.jsx renders a filterable table. T3 #16 — prompt-cache verification (ai/models/anthropic_model.py) AnthropicModel.generate now logs anthropic usage: input=N cache_create=N cache_read=N output=N on every response via a dedicated smartbaseai.anthropic logger, tagged with the request id from the middleware. Lets operators verify cache_control is actually being hit across turns. T3 #21 — metadata-aware chunking (new ai/chunking.py) Two-tier chunker: 1. Split markdown on ## / ### headings (preserves 10-K sections like Risk Factors, MD&A, Balance Sheet in their own chunks) 2. Within each section, pack paragraphs with a MAX_CHUNK_CHARS=2000 hard cap; oversize paragraphs get whitespace-aligned hard-split. Each chunk is prefixed with **Section Title** so semantic similarity can match on the section name. chunk_with_sections() returns rich dicts with section metadata for callers that want it. T4 #17 — structured logging + request IDs (new api/logging_config.py) New RequestIdMiddleware assigns uuid4 per request (or honors an incoming X-Request-ID header), propagates via contextvars, and returns the id in the response header. Log format: HH:MM:SS INFO rid=abc123456789 smartbaseai.http: POST /chat/trace -> 200 (142.1ms) Installed via configure_logging() in api/app.py. T4 #18 — rate limiting + cost tracking (new db/usage_repository.py) llm_usage table records (tenant_id, username, provider, model_name, input_tokens, output_tokens, cache_read_tokens, cache_create_tokens, latency_ms, created_at) per request. chat_message + chat_message_stream both record on every call with a 4-chars-per-token heuristic for providers that don't expose usage. Per-tenant daily cap: set "daily_token_cap" on the tenant config to enforce. Chat endpoint rejects with 429 when hit: "Daily token cap reached for tenant 'X' (N/CAP)" GET /admin/usage returns a per-day x per-tenant x per-provider rollup with estimated cost using blended prices (Anthropic/OpenAI public rates). Ollama is free/local. New pages/UsageDashboard.jsx renders the rollup with totals across requests / tokens / estimated cost. T4 #19 — session timeout interceptor (frontend/src/api/api.js) axios response interceptor catches 401 / "Token expired" / "Invalid token" and: - clears localStorage (access_token, role, tenant_id, active_tenant, username) - stashes the current path in sessionStorage.post_login_redirect - window.location.assign('/login') Login page reads post_login_redirect on success and bounces back. # Frontend T3 #10 — markdown preview in Vault editor (pages/Vault.jsx) New edit / split / preview toggle above the textarea. split mode shows raw markdown on the left and rendered output on the right. Uses react-markdown (new dep). T3 #11 — export trace (pages/RagVisualizer.jsx) Three new buttons above the pipeline diagram: - 📋 Copy JSON — writes full trace to clipboard - ⬇ .md — downloads a formatted markdown report (query, DB lookup, hybrid retrieval ranking, fusion block, LLM reply) - ⬇ .json — downloads the raw trace JSON Live alongside the existing "💾 Save trace" button. T3 #12 — bulk upload in Vault (pages/Vault.jsx) File input becomes <input multiple>. upload() iterates every selected file, catches per-file errors, reports "Uploaded N/M files · K ingested." when multi. T3 #13 — error boundary (components/ErrorBoundary.jsx + App.jsx) React class component wraps <AppRouter/>. Catches render errors with a recoverable fallback card (try again / reload app) instead of blanking the page. # Login / app plumbing Login.jsx now also stashes username in localStorage and honors the post_login_redirect bounce. components/Layout.jsx gets new sidebar links for super_admin (Cross-tenant Search / Usage / Audit Log / Settings). # Tests 24/24 pytest green. Warning count dropped from 83 to 12 due to the datetime.utcnow() cleanup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The per-question progress line was printing unicode check/cross glyphs which crashed the loop on the very first question under Windows default cp1252 stdout encoding. Replaced with PASS/FAIL and force-encoded the truncated reply snippet as ascii with '?' for untranslatable chars. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
GitHub rejects pushes that touch .github/workflows/ from a PAT without workflow scope. Shipping the workflow as docs/ci-workflow.yml.template with activation instructions in the header so it is version-controlled but not at its final path yet. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Eight-step narrated walkthrough that covers: login gate, vault picker, the DB+RAG fusion killer query (AAPL price via daily_bars exact lookup), fundamentals query with vector store info strip, saved-trace zero-token replay, tree-search file browser, cross-tenant search, usage dashboard. Plus a 30-second short version for a 5-MB README GIF. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
First end-to-end run of scripts/eval_financebench.py against the live financebench tenant (506 companies, 197,201 Chroma vectors, 150 ground-truth Q&A from the Patronus open-source set). ## Headline Model: claude-opus-4-6 (real API) Sample: 20 questions (--limit 20) Numeric match: 9 / 20 (45.0%, 2% tolerance) Per-type split: domain-relevant 5/10 (50%), novel-generated 4/10 (40%) Mean latency: 43 s / question (includes retrieval + Opus call) ## What the number actually means This is a **lower bound**. The scorer looks for numeric matches with 2% tolerance plus substring matches on the canonical answer. Several failures are correct-in-spirit answers the scorer can't recognize: - JPM gross-margin question — Opus correctly explains why gross margins are not a relevant metric for a bank (canonical answer says exactly the same thing in one sentence; scorer looks for the one sentence). - Boeing cyclicality — Opus answers "yes, subject to cyclicality" with a multi-paragraph justification. Canonical is a one-word "Yes". Substring match fails. A Claude-as-judge run (available via --judge flag, ~$3 extra spend) would typically raise the headline number by 5-10pp. Left for later. ## Where the system actually fails Several domain-relevant questions (JPM Q1 2021 segment revenue, AMD FY22 quick ratio, Verizon FY22 quick ratio) show a different failure mode: the retriever is pulling chunks from the wrong ticker's 10-K. For AMD and Verizon, Opus literally reports that the retrieved context is about Citizens Financial Group and Pfizer instead. This is cross- ticker retrieval bleed — a side effect of the shared Chroma collection with 197k vectors from 506 companies and no tenant-level ticker filter on the query. Fix direction (future work): (a) pass a ticker filter into the Chroma where= clause when a ticker is detected in the question, (b) use a per-ticker metadata boost in the hybrid retriever, or (c) ingest per-ticker sub-collections. This is now the top item in the retrieval- quality backlog. ## How to reproduce python scripts/eval_financebench.py --base http://127.0.0.1:8799 --model-provider anthropic --model-name claude-opus-4-6 --limit 20 Results are written to docs/financebench_eval.md with per-question type breakdown, latency stats, and sample failures. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
📊 First FinanceBench eval run — 45% auto-score with Claude Opus 4.6Just ran
What the number meansLower bound. The auto-scorer looks for numeric matches (2% tolerance) and substring matches on the canonical answer. Several failures are correct-in-spirit answers the scorer can't recognize — e.g. the JPM gross margins question, where Opus correctly explains why gross margin isn't a relevant metric for a bank (matches the canonical answer's intent exactly) but the substring matcher can't find the one-sentence canonical answer inside a 4-paragraph explanation. A Claude-as-judge run ( Where the system actually fails — interesting findingCross-ticker retrieval bleed. Several domain-relevant questions (AMD FY22 quick ratio, Verizon FY22 quick ratio, JPM Q1 2021 segment revenue) failed because the retriever pulled chunks from the wrong ticker's 10-K. Opus literally reports that the retrieved context is about Citizens Financial Group and Pfizer when asked about AMD and Verizon. Root cause: the financebench tenant shares one Chroma collection with 197,201 vectors from 506 companies, and the hybrid retriever has no ticker-level filter on the Fix direction (next iteration of retrieval quality)Top item for the retrieval-quality backlog:
Predicted: cross-ticker bleed failures should drop to zero, lifting the headline from 45% → ~60%. Status
|
Long-running
upgradebranch — 19 commits that take SmartBaseAI from a half-documented starter kit with mocked model wrappers into a working multi-tenant knowledge platform with real LLM providers, interactive retrieval visualization, a 506-company fundamental-analysis demo, and a full operational surface (audit log, usage tracking, cross-tenant search, session timeout handling, rate limiting, structured logging).Headline highlights
6 seeded knowledge vaults (
personal,company/ Acme Analytics,organization/ Project Lunar Harbor,relativity,saas-ai,smartbase-docs) plus one large-scale vaultfinancebenchwith 506 S&P 500 companies, 1,241 markdown files, 197,201 Chroma vectors, 1.23 M daily bar rows, and 150 ground-truth Q&A pulled from the FinanceBench open-source benchmark.Real LLM providers —
AnthropicModel/OpenAIModel/OllamaModelnow call the live SDKs (was mocked). Anthropic defaults toclaude-opus-4-6withcache_control: ephemeralon the system prompt per theclaude-apiskill guidance, logscache_read_input_tokensper response for verification. Tenants can declare multiple providers; users pick at chat time via a dropdown, with per-request override threaded through/chat/messageand/chat/trace.RAG Visualizer (new page) — live view of a single query flowing through the three-source orchestrator. Shows the vector-store info strip (tenant, collection, vector count, embedding model, device), full semantic candidate ranking with L2 distance bars, keyword hits, fusion block, assembled prompt, and LLM reply. Persist traces with 💾 Save trace, reload later without re-running the LLM (zero-token replay for demos). Export to JSON / Markdown.
Interactive Vault editor — 1,744-file tree browser with live search (
components/FileTree.jsx), collapsible per-ticker folders, inline match highlighting, markdown edit/split/preview toggle, bulk upload, automatic section-aware re-chunk + re-embed on save.Generalised
exact_lookup— the DB-preferred fusion fast path now dispatches by tenant and table. Financebench queries with a ticker hint (sniffed from the message) hitdaily_bars(ticker, date, open, high, low, close, volume)and return the exact row. Verified live: "AAPL closing price on 2024-06-14" →close=212.49, volume=70.1Mdirectly from SQLite, RAG retrieval runs as supplemental context only.FinanceBench eval harness (
scripts/eval_financebench.py) — runs the 150 ground-truth questions through/chat/message, scores with three strategies (numeric 2% tolerance, substring, optional Claude-as-judge), writesdocs/financebench_eval.mdwith headline accuracy, per-question-type breakdown, latency stats, and sample failures.Production surface — structured logging with request-id middleware, per-tenant daily token cap with 429 on overage, usage rollup (
/admin/usage), audit log viewer (/admin/audit-log), cross-tenant search (/admin/search), session-timeout redirect, error boundary, strict.envloader that refuses to leak shell env vars.Commits (newest first)
Test state
datetime.utcnow()cleanup.CI (pending — needs a token with
workflowscope)A
.github/workflows/test.ymlis ready locally (backend pytest + frontend vite build, both on Python 3.12 / Node 20 with pip/npm caching). GitHub rejected the push because the current PAT lacksworkflowscope. Rungh auth refresh -s workflow, then the file can be committed to complete CI setup.Verified live before opening this PR
close=212.49, volume=70,122,700straight fromdaily_barsfinancebench: 1,744 files visible recursivelyWhat is still outstanding
docs/screenshots/empty)docs/financebench_eval.mdwill be generated separatelyHow to run locally
Default logins
admin / ChangeThis123!(cross-tenant)alice / Alice123!demo / Demo123!orion / Orion123!einat / Einat123!nadav / Nadav123!docs / Docs1234!fbuser / FbUser123!🤖 Generated with Claude Code