SmartB100 — Agriculture RAG Agent

RAG-powered chat system for agricultural technical support, with hallucination verification through semantic entropy.

Why This Exists

Agricultural extension workers and agronomists need quick, reliable answers to technical questions about crop management, soil treatment, pest control, and planting schedules. Traditional search through dense PDF manuals is slow and error-prone.

SmartB100 indexes agricultural PDF documents into a vector database and uses a local LLM to generate answers grounded in the indexed content. The system adapts response complexity to the user's expertise level (beginner, intermediate, expert) and flags potentially hallucinated answers using semantic entropy scoring, so users know when to double-check the information.

Architecture

Architectural Style

SmartB100 is a modular monolith with composed deployment:

One application process. api/main.py loads every domain module (api/routes/*, core/*, retrieval/*, memory/*, generation/*, verification/*, profiling/*, database/*) into a single FastAPI runtime. Inter-module communication is function calls inside the same Python interpreter — no RPC, no message broker, no queue.
Eight internal layers, one binary. The folder boundary is a convention for testability and review; it is not a network boundary.
External processes are limited to genuine third-party services. No domain code lives outside the API process.

External components (each runs in its own process):

Component	Role	Containerized?	Protocol
Qdrant	Vector DB (`archives_v2` collection, 768-dim embeddings)	Yes — `docker compose --profile infra`	HTTP REST `:6333` + gRPC `:6334`
Ollama	LLM chat (`llama3.2:3b`) + embeddings (`nomic-embed-text`)	No — runs on the host	HTTP REST `:11434` via `OLLAMA_HOST`
SQLite	Auth + conversation history	No (filesystem)	Bind-mount `./smartb100_v2.db:/app/smartb100_v2.db`

Client tier (two paths):

Gradio UI (ui/chat_ui.py) — stateless HTTP client containerized via docker compose --profile app. Calls only POST /chat. Does not import any domain module — it is a UI shell, not a microservice.
Direct HTTP — curl, scripts, future mobile clients. Same endpoint, same JSON contract.

Why not microservices. The RAG pipeline (embed → search → generate → verify) shares the same ChatRequest/ChatResponse model and runs synchronously within a single request. Splitting any step into its own service would add network latency between calls that are currently in-process, plus contract-versioning overhead, without delivering independent scaling benefit at current load.

When to reconsider. If verification/ (entropy sampling, the slowest step) needs to scale independently of generation/, or if the workload grows beyond ~500 req/s, the verification gate is the natural extraction point — it already has a clean async-friendly interface (evaluate(question, context, answer)).

flowchart TD
    subgraph CLIENT["Client"]
        GRADIO["Gradio UI\n:7860"]
        CURL["curl / HTTP"]
    end

    subgraph API["API Layer"]
        ENDPOINT["POST /chat"]
        AUTH["POST /auth/*"]
        HEALTH["GET /health"]
    end

    subgraph PIPELINE["RAG Pipeline"]
        EMBED["Embedder\nOllama nomic-embed-text\n768 dims"]
        SEARCH["Vector Search\nCosine Similarity"]
        MEMORY["ConversationBuffer\nFIFO deque (maxlen=10)"]
        PROFILE["Profiling\nbeginner | intermediate | expert"]
        LLM["LLM Generator\nOllama llama3.2:3b"]
    end

    subgraph VERIFY["Verification"]
        ENTROPY["Semantic Entropy\nMulti-provider (Groq/Ollama/OpenRouter)"]
        GATE["Verification Gate\nRetry + Fallback"]
    end

    subgraph DATA["Data Layer"]
        QDRANT[("Qdrant\n:6333\narchives_v2")]
        SQLITE[("SQLite\nusers / conversations")]
    end

    GRADIO -->|HTTP JSON| ENDPOINT
    CURL -->|HTTP JSON| ENDPOINT

    ENDPOINT --> EMBED
    EMBED --> SEARCH
    SEARCH --> QDRANT

    ENDPOINT --> MEMORY
    MEMORY -.->|history| LLM
    SEARCH -->|context| PROFILE
    PROFILE --> LLM

    LLM --> GATE
    GATE -->|verification_enabled| ENTROPY
    ENTROPY -->|score| GATE
    GATE -->|retry if high entropy| LLM

    GATE --> RESPONSE["ChatResponse\n{answer, hallucination_score}"]

    AUTH --> SQLITE

RAG Pipeline Flow:

sequenceDiagram
    participant C as Client
    participant A as API /chat
    participant E as Embedder
    participant Q as Qdrant
    participant G as LLM Generator
    participant V as Verification Gate

    C->>A: POST /chat {session_id, question, profile}
    A->>E: generate_embedding(question)
    E-->>A: vector[768]
    A->>Q: search_context(vector, top_k=3)
    Q-->>A: chunks[]
    A->>G: generate(question, context, history, profile)
    G-->>A: answer
    alt verification_enabled
        A->>V: evaluate(question, context, answer)
        V-->>A: {answer, hallucination_score}
    end
    A-->>C: ChatResponse {answer, hallucination_score}

Deployment Topology:

flowchart LR
    subgraph CLIENTS["Clients"]
        direction TB
        BROWSER["Browser"]
        SCRIPTS["curl / scripts"]
    end

    subgraph HOST["Developer host"]
        OLLAMA["Ollama :11434<br/>llama3.2:3b + nomic-embed-text"]
    end

    subgraph COMPOSE["docker-compose stack"]
        direction TB
        subgraph INFRA["profile: infra"]
            QDRANT[("Qdrant<br/>:6333 REST / :6334 gRPC")]
        end
        subgraph APP["profile: app"]
            API["FastAPI :8000<br/>monolith binary"]
            GRADIO["Gradio :7860"]
            SQLITE[("SQLite<br/>bind-mount")]
        end
    end

    BROWSER -->|HTTP| GRADIO
    SCRIPTS -->|HTTP /chat| API
    GRADIO -->|HTTP /chat| API
    API -->|HTTP REST| QDRANT
    API -->|HTTP /api/chat,<br/>/api/embeddings| OLLAMA
    API -. SQLAlchemy .-> SQLITE

The two earlier diagrams are logical (what runs); this one is topological (where it runs). They complement, not duplicate.

Engineering Decisions

Decision	Rationale
Ollama for all embeddings	Even when generation uses Groq or OpenRouter, embeddings for entropy clustering use Ollama (`nomic-embed-text`) locally. Free, fast, no external API dependency for embeddings.
Semantic entropy over binary classifiers	Generates N candidate responses, clusters by semantic similarity, computes Shannon entropy. Higher entropy = less agreement between candidates = higher hallucination risk. Produces a continuous score (0.0-1.0) instead of a binary flag.
Multi-provider verification	Replaced OpenAI-only verification with Groq/Ollama/OpenRouter dispatch. Removes hard dependency on paid API for hallucination checks.
Ollama embeddings with retries + backoff	Centralized in `retrieval/ollama_embeddings.py`: truncation at 8192 chars, 6 attempts, exponential backoff up to 60s. Handles `ResponseError`, `ConnectionError`, `httpx` errors, and `OSError`. Used by chunker, embedder, and entropy verification.
SQLite with pathlib + POSIX URLs	`database/db.py` uses `Path.as_posix()` for SQLite connection strings. On Windows with Docker bind mounts, the host may create `smartb100_v2.db` as a directory instead of a file; the API raises `RuntimeError` with a clear message if this happens.
Sync endpoint for /chat	`def chat()` instead of `async def chat()`. FastAPI runs sync handlers in a thread pool, which frees the event loop for `/health` and other concurrent requests while the LLM blocks.
mypy `ignore_missing_imports=true`	Ollama, qdrant-client, and other dependencies lack type stubs. Avoids false positives without compromising type checking on project code.
Profile-aware system prompts	Three expertise levels (`beginner`, `intermediate`, `expert`) select different system prompts. Same RAG context, different response complexity. No separate models or fine-tuning needed.
bcrypt + JWT gate on `/chat`	Passwords hashed with bcrypt (timing-safe verify via passlib); `/chat` requires `Authorization: Bearer <JWT>`. Rate-limit via slowapi: 5 logins / 15 min and 3 registrations / hour per IP. `JWT_SECRET_KEY` must be ≥32 chars (validated at startup). Breaking: users created before this gate (SHA-256) must be re-registered.
SQLite integrity hardening	`NOT NULL` on required columns, `CASCADE` on FKs, `Boolean is_hallucinated`, timezone-aware `created_at`, `connect_args["timeout"]=10`, and a `PRAGMA foreign_keys=ON` listener so CASCADE actually fires in SQLite. Breaking: old databases must be recreated — delete `smartb100_v2.db` and let `Base.metadata.create_all` regenerate the schema on next API startup.
Modular monolith over microservices	One FastAPI process loads all domain modules (`retrieval/`, `memory/`, `generation/`, `verification/`, `profiling/`) — inter-module calls are function calls, not RPC. External services are limited to genuine third-party (Qdrant, Ollama, SQLite). Microservices would add network latency between RAG steps that share the same `ChatRequest`/`ChatResponse` model, without isolation benefit at current scale. See § Architectural Style for full rationale.

How to Run

Prerequisites

Docker Desktop (download)
Ollama (download)
Python 3.12+ (download)

Setup

# 1. Pull models
ollama pull llama3.2:3b && ollama pull nomic-embed-text

# 2. Install dependencies
uv sync                            # or: python -m venv .venv && .venv/bin/pip install -e .

# 3. Configure environment
cp .env.example .env               # defaults work for local dev

# 4. Start Qdrant
docker compose --profile infra up -d

# 5. Index documents (first run only)
.venv/bin/python database/semantic_chunker.py index ./archives/

# 6. Start API
.venv/bin/python -m uvicorn api.main:app --reload

# 7. (Optional) Start Gradio UI
.venv/bin/python ui/chat_ui.py

Windows users: replace .venv/bin/python with .venv\Scripts\python.exe, or use .\start.bat / .\start.ps1 after steps 1-3.

Full Docker deployment: docker compose --profile infra --profile app up -d

The compose stack uses a multi-stage Dockerfile.api (no build-essential in the final image), healthchecks that gate depends_on ordering, and log rotation (max-size: 10m, max-file: 3). On Linux the OLLAMA_HOST override is required — see SETUP.md §9.1.

See SETUP.md for remote Qdrant configuration.

Verify

curl http://localhost:6333/healthz           # Qdrant: "healthz check passed"
curl http://localhost:8000/health            # API: {"status":"ok"}

API Reference

Endpoint	Description
`POST /chat`	RAG query (requires JWT); returns answer with hallucination score
`POST /auth/register`	Creates new user (rate-limit 3/hour per IP)
`POST /auth/token`	OAuth2 login; returns JWT (rate-limit 5 / 15min per IP)
`GET /health`	API health status

POST /chat:

TOKEN=$(curl -s -X POST "http://localhost:8000/auth/token" \
  -d "username=demo&password=long-enough-pw" | jq -r .access_token)

curl -X POST "http://localhost:8000/chat" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "session_id": "demo-session",
    "question": "Qual a epoca ideal de plantio da soja?",
    "profile": {"name": "User", "expertise": "beginner"}
  }'
# {"answer": "...", "hallucination_score": 0.18}

Without the Authorization header the API returns 401 Unauthorized.

Request Field	Type	Description
`session_id`	string	UUID for conversation continuity
`question`	string	User query
`profile.expertise`	enum	`beginner` \| `intermediate` \| `expert`

Response Field	Type	Description
`answer`	string	Generated response adapted to expertise level
`hallucination_score`	float	0.0 (grounded) to 1.0 (likely hallucinated)

Project Structure

sb100_agents/
├── api/                            # FastAPI backend
│   ├── main.py                     # App entry (CORS + routers + lifespan)
│   └── routes/                     # chat.py, auth.py, health.py
├── core/                           # Pydantic schemas & configuration
├── retrieval/                      # Embeddings + Qdrant vector search
├── generation/                     # LLM response generation
├── memory/                         # Conversation buffer (FIFO)
├── profiling/                      # User expertise adaptation
├── verification/                   # Semantic entropy + verification gate
├── database/                       # SQLite + PDF semantic chunking
├── eval/                           # 5-step evaluation pipeline
├── ui/                             # Gradio chat interface
├── tests/                          # Unit + integration tests
├── .claude/                        # Agent workflow enforcement
│   ├── rules/                      # Directive files (00-12)
│   ├── guia-configuracao-codex.md  # Codex plugin setup guide
│   ├── registry.md                 # Project state & history
│   ├── tasks.md                    # Task registry
│   └── hooks/                      # Git hooks (commit-msg, pre-commit, etc.)
├── .github/workflows/              # CI + Claude Code automation
├── .dockerignore                   # Shrinks build context (drops .git, tests/, eval/, .claude/, etc.)
├── Dockerfile.api                  # Multi-stage build (builder + runtime)
├── docker-compose.yml              # Qdrant (infra) + API+Gradio (app) with healthchecks
└── pyproject.toml

Roadmap

Feature	Description
Hybrid search	Dense + sparse vectors with RRF fusion
LangGraph migration	ReAct agent with agricultural intent filter
Claim Verification	Atomic decomposition + RAG-based fact checking
Streaming	SSE for incremental responses

Automated Issue Implementation

Issues labeled claude-auto are automatically implemented by Claude Code via GitHub Actions. Mention @claude in any issue or PR comment for interactive assistance.

Setup: add ANTHROPIC_API_KEY secret and create the claude-auto label.

Contributing

See CONTRIBUTING.md. Quick summary: fork, branch (type/TASK-NNN-description), tests, Conventional Commits, PR.

License

MIT License

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SmartB100 — Agriculture RAG Agent

Why This Exists

Architecture

Architectural Style

Engineering Decisions

How to Run

Prerequisites

Setup

Verify

API Reference

Project Structure

Roadmap

Automated Issue Implementation

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 253 Commits
.claude		.claude
.github/workflows		.github/workflows
api		api
archives		archives
core		core
database		database
eval		eval
generation		generation
memory		memory
profiling		profiling
retrieval		retrieval
scripts		scripts
tests		tests
ui		ui
verification		verification
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile.api		Dockerfile.api
LICENSE		LICENSE
README.md		README.md
SETUP.md		SETUP.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
start.bat		start.bat
start.ps1		start.ps1
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

SmartB100 — Agriculture RAG Agent

Why This Exists

Architecture

Architectural Style

Engineering Decisions

How to Run

Prerequisites

Setup

Verify

API Reference

Project Structure

Roadmap

Automated Issue Implementation

Contributing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages