RAG-powered chat system for agricultural technical support, with hallucination verification through semantic entropy.
Agricultural extension workers and agronomists need quick, reliable answers to technical questions about crop management, soil treatment, pest control, and planting schedules. Traditional search through dense PDF manuals is slow and error-prone.
SmartB100 indexes agricultural PDF documents into a vector database and uses a local LLM to generate answers grounded in the indexed content. The system adapts response complexity to the user's expertise level (beginner, intermediate, expert) and flags potentially hallucinated answers using semantic entropy scoring, so users know when to double-check the information.
SmartB100 is a modular monolith with composed deployment:
- One application process.
api/main.pyloads every domain module (api/routes/*,core/*,retrieval/*,memory/*,generation/*,verification/*,profiling/*,database/*) into a single FastAPI runtime. Inter-module communication is function calls inside the same Python interpreter — no RPC, no message broker, no queue. - Eight internal layers, one binary. The folder boundary is a convention for testability and review; it is not a network boundary.
- External processes are limited to genuine third-party services. No domain code lives outside the API process.
External components (each runs in its own process):
| Component | Role | Containerized? | Protocol |
|---|---|---|---|
| Qdrant | Vector DB (archives_v2 collection, 768-dim embeddings) |
Yes — docker compose --profile infra |
HTTP REST :6333 + gRPC :6334 |
| Ollama | LLM chat (llama3.2:3b) + embeddings (nomic-embed-text) |
No — runs on the host | HTTP REST :11434 via OLLAMA_HOST |
| SQLite | Auth + conversation history | No (filesystem) | Bind-mount ./smartb100_v2.db:/app/smartb100_v2.db |
Client tier (two paths):
- Gradio UI (
ui/chat_ui.py) — stateless HTTP client containerized viadocker compose --profile app. Calls onlyPOST /chat. Does not import any domain module — it is a UI shell, not a microservice. - Direct HTTP —
curl, scripts, future mobile clients. Same endpoint, same JSON contract.
Why not microservices. The RAG pipeline (embed → search → generate → verify) shares the same ChatRequest/ChatResponse model and runs synchronously within a single request. Splitting any step into its own service would add network latency between calls that are currently in-process, plus contract-versioning overhead, without delivering independent scaling benefit at current load.
When to reconsider. If verification/ (entropy sampling, the slowest step) needs to scale independently of generation/, or if the workload grows beyond ~500 req/s, the verification gate is the natural extraction point — it already has a clean async-friendly interface (evaluate(question, context, answer)).
flowchart TD
subgraph CLIENT["Client"]
GRADIO["Gradio UI\n:7860"]
CURL["curl / HTTP"]
end
subgraph API["API Layer"]
ENDPOINT["POST /chat"]
AUTH["POST /auth/*"]
HEALTH["GET /health"]
end
subgraph PIPELINE["RAG Pipeline"]
EMBED["Embedder\nOllama nomic-embed-text\n768 dims"]
SEARCH["Vector Search\nCosine Similarity"]
MEMORY["ConversationBuffer\nFIFO deque (maxlen=10)"]
PROFILE["Profiling\nbeginner | intermediate | expert"]
LLM["LLM Generator\nOllama llama3.2:3b"]
end
subgraph VERIFY["Verification"]
ENTROPY["Semantic Entropy\nMulti-provider (Groq/Ollama/OpenRouter)"]
GATE["Verification Gate\nRetry + Fallback"]
end
subgraph DATA["Data Layer"]
QDRANT[("Qdrant\n:6333\narchives_v2")]
SQLITE[("SQLite\nusers / conversations")]
end
GRADIO -->|HTTP JSON| ENDPOINT
CURL -->|HTTP JSON| ENDPOINT
ENDPOINT --> EMBED
EMBED --> SEARCH
SEARCH --> QDRANT
ENDPOINT --> MEMORY
MEMORY -.->|history| LLM
SEARCH -->|context| PROFILE
PROFILE --> LLM
LLM --> GATE
GATE -->|verification_enabled| ENTROPY
ENTROPY -->|score| GATE
GATE -->|retry if high entropy| LLM
GATE --> RESPONSE["ChatResponse\n{answer, hallucination_score}"]
AUTH --> SQLITE
RAG Pipeline Flow:
sequenceDiagram
participant C as Client
participant A as API /chat
participant E as Embedder
participant Q as Qdrant
participant G as LLM Generator
participant V as Verification Gate
C->>A: POST /chat {session_id, question, profile}
A->>E: generate_embedding(question)
E-->>A: vector[768]
A->>Q: search_context(vector, top_k=3)
Q-->>A: chunks[]
A->>G: generate(question, context, history, profile)
G-->>A: answer
alt verification_enabled
A->>V: evaluate(question, context, answer)
V-->>A: {answer, hallucination_score}
end
A-->>C: ChatResponse {answer, hallucination_score}
Deployment Topology:
flowchart LR
subgraph CLIENTS["Clients"]
direction TB
BROWSER["Browser"]
SCRIPTS["curl / scripts"]
end
subgraph HOST["Developer host"]
OLLAMA["Ollama :11434<br/>llama3.2:3b + nomic-embed-text"]
end
subgraph COMPOSE["docker-compose stack"]
direction TB
subgraph INFRA["profile: infra"]
QDRANT[("Qdrant<br/>:6333 REST / :6334 gRPC")]
end
subgraph APP["profile: app"]
API["FastAPI :8000<br/>monolith binary"]
GRADIO["Gradio :7860"]
SQLITE[("SQLite<br/>bind-mount")]
end
end
BROWSER -->|HTTP| GRADIO
SCRIPTS -->|HTTP /chat| API
GRADIO -->|HTTP /chat| API
API -->|HTTP REST| QDRANT
API -->|HTTP /api/chat,<br/>/api/embeddings| OLLAMA
API -. SQLAlchemy .-> SQLITE
The two earlier diagrams are logical (what runs); this one is topological (where it runs). They complement, not duplicate.
| Decision | Rationale |
|---|---|
| Ollama for all embeddings | Even when generation uses Groq or OpenRouter, embeddings for entropy clustering use Ollama (nomic-embed-text) locally. Free, fast, no external API dependency for embeddings. |
| Semantic entropy over binary classifiers | Generates N candidate responses, clusters by semantic similarity, computes Shannon entropy. Higher entropy = less agreement between candidates = higher hallucination risk. Produces a continuous score (0.0-1.0) instead of a binary flag. |
| Multi-provider verification | Replaced OpenAI-only verification with Groq/Ollama/OpenRouter dispatch. Removes hard dependency on paid API for hallucination checks. |
| Ollama embeddings with retries + backoff | Centralized in retrieval/ollama_embeddings.py: truncation at 8192 chars, 6 attempts, exponential backoff up to 60s. Handles ResponseError, ConnectionError, httpx errors, and OSError. Used by chunker, embedder, and entropy verification. |
| SQLite with pathlib + POSIX URLs | database/db.py uses Path.as_posix() for SQLite connection strings. On Windows with Docker bind mounts, the host may create smartb100_v2.db as a directory instead of a file; the API raises RuntimeError with a clear message if this happens. |
| Sync endpoint for /chat | def chat() instead of async def chat(). FastAPI runs sync handlers in a thread pool, which frees the event loop for /health and other concurrent requests while the LLM blocks. |
mypy ignore_missing_imports=true |
Ollama, qdrant-client, and other dependencies lack type stubs. Avoids false positives without compromising type checking on project code. |
| Profile-aware system prompts | Three expertise levels (beginner, intermediate, expert) select different system prompts. Same RAG context, different response complexity. No separate models or fine-tuning needed. |
bcrypt + JWT gate on /chat |
Passwords hashed with bcrypt (timing-safe verify via passlib); /chat requires Authorization: Bearer <JWT>. Rate-limit via slowapi: 5 logins / 15 min and 3 registrations / hour per IP. JWT_SECRET_KEY must be ≥32 chars (validated at startup). Breaking: users created before this gate (SHA-256) must be re-registered. |
| SQLite integrity hardening | NOT NULL on required columns, CASCADE on FKs, Boolean is_hallucinated, timezone-aware created_at, connect_args["timeout"]=10, and a PRAGMA foreign_keys=ON listener so CASCADE actually fires in SQLite. Breaking: old databases must be recreated — delete smartb100_v2.db and let Base.metadata.create_all regenerate the schema on next API startup. |
| Modular monolith over microservices | One FastAPI process loads all domain modules (retrieval/, memory/, generation/, verification/, profiling/) — inter-module calls are function calls, not RPC. External services are limited to genuine third-party (Qdrant, Ollama, SQLite). Microservices would add network latency between RAG steps that share the same ChatRequest/ChatResponse model, without isolation benefit at current scale. See § Architectural Style for full rationale. |
# 1. Pull models
ollama pull llama3.2:3b && ollama pull nomic-embed-text
# 2. Install dependencies
uv sync # or: python -m venv .venv && .venv/bin/pip install -e .
# 3. Configure environment
cp .env.example .env # defaults work for local dev
# 4. Start Qdrant
docker compose --profile infra up -d
# 5. Index documents (first run only)
.venv/bin/python database/semantic_chunker.py index ./archives/
# 6. Start API
.venv/bin/python -m uvicorn api.main:app --reload
# 7. (Optional) Start Gradio UI
.venv/bin/python ui/chat_ui.pyWindows users: replace .venv/bin/python with .venv\Scripts\python.exe, or use .\start.bat / .\start.ps1 after steps 1-3.
Full Docker deployment: docker compose --profile infra --profile app up -d
The compose stack uses a multi-stage Dockerfile.api (no build-essential
in the final image), healthchecks that gate depends_on ordering, and
log rotation (max-size: 10m, max-file: 3). On Linux the OLLAMA_HOST
override is required — see SETUP.md §9.1.
See SETUP.md for remote Qdrant configuration.
curl http://localhost:6333/healthz # Qdrant: "healthz check passed"
curl http://localhost:8000/health # API: {"status":"ok"}| Endpoint | Description |
|---|---|
POST /chat |
RAG query (requires JWT); returns answer with hallucination score |
POST /auth/register |
Creates new user (rate-limit 3/hour per IP) |
POST /auth/token |
OAuth2 login; returns JWT (rate-limit 5 / 15min per IP) |
GET /health |
API health status |
POST /chat:
TOKEN=$(curl -s -X POST "http://localhost:8000/auth/token" \
-d "username=demo&password=long-enough-pw" | jq -r .access_token)
curl -X POST "http://localhost:8000/chat" \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"session_id": "demo-session",
"question": "Qual a epoca ideal de plantio da soja?",
"profile": {"name": "User", "expertise": "beginner"}
}'
# {"answer": "...", "hallucination_score": 0.18}Without the Authorization header the API returns 401 Unauthorized.
| Request Field | Type | Description |
|---|---|---|
session_id |
string | UUID for conversation continuity |
question |
string | User query |
profile.expertise |
enum | beginner | intermediate | expert |
| Response Field | Type | Description |
|---|---|---|
answer |
string | Generated response adapted to expertise level |
hallucination_score |
float | 0.0 (grounded) to 1.0 (likely hallucinated) |
sb100_agents/
├── api/ # FastAPI backend
│ ├── main.py # App entry (CORS + routers + lifespan)
│ └── routes/ # chat.py, auth.py, health.py
├── core/ # Pydantic schemas & configuration
├── retrieval/ # Embeddings + Qdrant vector search
├── generation/ # LLM response generation
├── memory/ # Conversation buffer (FIFO)
├── profiling/ # User expertise adaptation
├── verification/ # Semantic entropy + verification gate
├── database/ # SQLite + PDF semantic chunking
├── eval/ # 5-step evaluation pipeline
├── ui/ # Gradio chat interface
├── tests/ # Unit + integration tests
├── .claude/ # Agent workflow enforcement
│ ├── rules/ # Directive files (00-12)
│ ├── guia-configuracao-codex.md # Codex plugin setup guide
│ ├── registry.md # Project state & history
│ ├── tasks.md # Task registry
│ └── hooks/ # Git hooks (commit-msg, pre-commit, etc.)
├── .github/workflows/ # CI + Claude Code automation
├── .dockerignore # Shrinks build context (drops .git, tests/, eval/, .claude/, etc.)
├── Dockerfile.api # Multi-stage build (builder + runtime)
├── docker-compose.yml # Qdrant (infra) + API+Gradio (app) with healthchecks
└── pyproject.toml
| Feature | Description |
|---|---|
| Hybrid search | Dense + sparse vectors with RRF fusion |
| LangGraph migration | ReAct agent with agricultural intent filter |
| Claim Verification | Atomic decomposition + RAG-based fact checking |
| Streaming | SSE for incremental responses |
Issues labeled claude-auto are automatically implemented by Claude Code via GitHub Actions. Mention @claude in any issue or PR comment for interactive assistance.
Setup: add ANTHROPIC_API_KEY secret and create the claude-auto label.
See CONTRIBUTING.md. Quick summary: fork, branch (type/TASK-NNN-description), tests, Conventional Commits, PR.