Skip to content

LukeSantossz/sb100_agents

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

253 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Python FastAPI Gradio Status CI Coverage

SmartB100 — Agriculture RAG Agent

RAG-powered chat system for agricultural technical support, with hallucination verification through semantic entropy.

Why This Exists

Agricultural extension workers and agronomists need quick, reliable answers to technical questions about crop management, soil treatment, pest control, and planting schedules. Traditional search through dense PDF manuals is slow and error-prone.

SmartB100 indexes agricultural PDF documents into a vector database and uses a local LLM to generate answers grounded in the indexed content. The system adapts response complexity to the user's expertise level (beginner, intermediate, expert) and flags potentially hallucinated answers using semantic entropy scoring, so users know when to double-check the information.

Architecture

Architectural Style

SmartB100 is a modular monolith with composed deployment:

  • One application process. api/main.py loads every domain module (api/routes/*, core/*, retrieval/*, memory/*, generation/*, verification/*, profiling/*, database/*) into a single FastAPI runtime. Inter-module communication is function calls inside the same Python interpreter — no RPC, no message broker, no queue.
  • Eight internal layers, one binary. The folder boundary is a convention for testability and review; it is not a network boundary.
  • External processes are limited to genuine third-party services. No domain code lives outside the API process.

External components (each runs in its own process):

Component Role Containerized? Protocol
Qdrant Vector DB (archives_v2 collection, 768-dim embeddings) Yes — docker compose --profile infra HTTP REST :6333 + gRPC :6334
Ollama LLM chat (llama3.2:3b) + embeddings (nomic-embed-text) No — runs on the host HTTP REST :11434 via OLLAMA_HOST
SQLite Auth + conversation history No (filesystem) Bind-mount ./smartb100_v2.db:/app/smartb100_v2.db

Client tier (two paths):

  • Gradio UI (ui/chat_ui.py) — stateless HTTP client containerized via docker compose --profile app. Calls only POST /chat. Does not import any domain module — it is a UI shell, not a microservice.
  • Direct HTTPcurl, scripts, future mobile clients. Same endpoint, same JSON contract.

Why not microservices. The RAG pipeline (embed → search → generate → verify) shares the same ChatRequest/ChatResponse model and runs synchronously within a single request. Splitting any step into its own service would add network latency between calls that are currently in-process, plus contract-versioning overhead, without delivering independent scaling benefit at current load.

When to reconsider. If verification/ (entropy sampling, the slowest step) needs to scale independently of generation/, or if the workload grows beyond ~500 req/s, the verification gate is the natural extraction point — it already has a clean async-friendly interface (evaluate(question, context, answer)).

flowchart TD
    subgraph CLIENT["Client"]
        GRADIO["Gradio UI\n:7860"]
        CURL["curl / HTTP"]
    end

    subgraph API["API Layer"]
        ENDPOINT["POST /chat"]
        AUTH["POST /auth/*"]
        HEALTH["GET /health"]
    end

    subgraph PIPELINE["RAG Pipeline"]
        EMBED["Embedder\nOllama nomic-embed-text\n768 dims"]
        SEARCH["Vector Search\nCosine Similarity"]
        MEMORY["ConversationBuffer\nFIFO deque (maxlen=10)"]
        PROFILE["Profiling\nbeginner | intermediate | expert"]
        LLM["LLM Generator\nOllama llama3.2:3b"]
    end

    subgraph VERIFY["Verification"]
        ENTROPY["Semantic Entropy\nMulti-provider (Groq/Ollama/OpenRouter)"]
        GATE["Verification Gate\nRetry + Fallback"]
    end

    subgraph DATA["Data Layer"]
        QDRANT[("Qdrant\n:6333\narchives_v2")]
        SQLITE[("SQLite\nusers / conversations")]
    end

    GRADIO -->|HTTP JSON| ENDPOINT
    CURL -->|HTTP JSON| ENDPOINT

    ENDPOINT --> EMBED
    EMBED --> SEARCH
    SEARCH --> QDRANT

    ENDPOINT --> MEMORY
    MEMORY -.->|history| LLM
    SEARCH -->|context| PROFILE
    PROFILE --> LLM

    LLM --> GATE
    GATE -->|verification_enabled| ENTROPY
    ENTROPY -->|score| GATE
    GATE -->|retry if high entropy| LLM

    GATE --> RESPONSE["ChatResponse\n{answer, hallucination_score}"]

    AUTH --> SQLITE
Loading

RAG Pipeline Flow:

sequenceDiagram
    participant C as Client
    participant A as API /chat
    participant E as Embedder
    participant Q as Qdrant
    participant G as LLM Generator
    participant V as Verification Gate

    C->>A: POST /chat {session_id, question, profile}
    A->>E: generate_embedding(question)
    E-->>A: vector[768]
    A->>Q: search_context(vector, top_k=3)
    Q-->>A: chunks[]
    A->>G: generate(question, context, history, profile)
    G-->>A: answer
    alt verification_enabled
        A->>V: evaluate(question, context, answer)
        V-->>A: {answer, hallucination_score}
    end
    A-->>C: ChatResponse {answer, hallucination_score}
Loading

Deployment Topology:

flowchart LR
    subgraph CLIENTS["Clients"]
        direction TB
        BROWSER["Browser"]
        SCRIPTS["curl / scripts"]
    end

    subgraph HOST["Developer host"]
        OLLAMA["Ollama :11434<br/>llama3.2:3b + nomic-embed-text"]
    end

    subgraph COMPOSE["docker-compose stack"]
        direction TB
        subgraph INFRA["profile: infra"]
            QDRANT[("Qdrant<br/>:6333 REST / :6334 gRPC")]
        end
        subgraph APP["profile: app"]
            API["FastAPI :8000<br/>monolith binary"]
            GRADIO["Gradio :7860"]
            SQLITE[("SQLite<br/>bind-mount")]
        end
    end

    BROWSER -->|HTTP| GRADIO
    SCRIPTS -->|HTTP /chat| API
    GRADIO -->|HTTP /chat| API
    API -->|HTTP REST| QDRANT
    API -->|HTTP /api/chat,<br/>/api/embeddings| OLLAMA
    API -. SQLAlchemy .-> SQLITE
Loading

The two earlier diagrams are logical (what runs); this one is topological (where it runs). They complement, not duplicate.

Engineering Decisions

Decision Rationale
Ollama for all embeddings Even when generation uses Groq or OpenRouter, embeddings for entropy clustering use Ollama (nomic-embed-text) locally. Free, fast, no external API dependency for embeddings.
Semantic entropy over binary classifiers Generates N candidate responses, clusters by semantic similarity, computes Shannon entropy. Higher entropy = less agreement between candidates = higher hallucination risk. Produces a continuous score (0.0-1.0) instead of a binary flag.
Multi-provider verification Replaced OpenAI-only verification with Groq/Ollama/OpenRouter dispatch. Removes hard dependency on paid API for hallucination checks.
Ollama embeddings with retries + backoff Centralized in retrieval/ollama_embeddings.py: truncation at 8192 chars, 6 attempts, exponential backoff up to 60s. Handles ResponseError, ConnectionError, httpx errors, and OSError. Used by chunker, embedder, and entropy verification.
SQLite with pathlib + POSIX URLs database/db.py uses Path.as_posix() for SQLite connection strings. On Windows with Docker bind mounts, the host may create smartb100_v2.db as a directory instead of a file; the API raises RuntimeError with a clear message if this happens.
Sync endpoint for /chat def chat() instead of async def chat(). FastAPI runs sync handlers in a thread pool, which frees the event loop for /health and other concurrent requests while the LLM blocks.
mypy ignore_missing_imports=true Ollama, qdrant-client, and other dependencies lack type stubs. Avoids false positives without compromising type checking on project code.
Profile-aware system prompts Three expertise levels (beginner, intermediate, expert) select different system prompts. Same RAG context, different response complexity. No separate models or fine-tuning needed.
bcrypt + JWT gate on /chat Passwords hashed with bcrypt (timing-safe verify via passlib); /chat requires Authorization: Bearer <JWT>. Rate-limit via slowapi: 5 logins / 15 min and 3 registrations / hour per IP. JWT_SECRET_KEY must be ≥32 chars (validated at startup). Breaking: users created before this gate (SHA-256) must be re-registered.
SQLite integrity hardening NOT NULL on required columns, CASCADE on FKs, Boolean is_hallucinated, timezone-aware created_at, connect_args["timeout"]=10, and a PRAGMA foreign_keys=ON listener so CASCADE actually fires in SQLite. Breaking: old databases must be recreated — delete smartb100_v2.db and let Base.metadata.create_all regenerate the schema on next API startup.
Modular monolith over microservices One FastAPI process loads all domain modules (retrieval/, memory/, generation/, verification/, profiling/) — inter-module calls are function calls, not RPC. External services are limited to genuine third-party (Qdrant, Ollama, SQLite). Microservices would add network latency between RAG steps that share the same ChatRequest/ChatResponse model, without isolation benefit at current scale. See § Architectural Style for full rationale.

How to Run

Prerequisites

Setup

# 1. Pull models
ollama pull llama3.2:3b && ollama pull nomic-embed-text

# 2. Install dependencies
uv sync                            # or: python -m venv .venv && .venv/bin/pip install -e .

# 3. Configure environment
cp .env.example .env               # defaults work for local dev

# 4. Start Qdrant
docker compose --profile infra up -d

# 5. Index documents (first run only)
.venv/bin/python database/semantic_chunker.py index ./archives/

# 6. Start API
.venv/bin/python -m uvicorn api.main:app --reload

# 7. (Optional) Start Gradio UI
.venv/bin/python ui/chat_ui.py

Windows users: replace .venv/bin/python with .venv\Scripts\python.exe, or use .\start.bat / .\start.ps1 after steps 1-3.

Full Docker deployment: docker compose --profile infra --profile app up -d

The compose stack uses a multi-stage Dockerfile.api (no build-essential in the final image), healthchecks that gate depends_on ordering, and log rotation (max-size: 10m, max-file: 3). On Linux the OLLAMA_HOST override is required — see SETUP.md §9.1.

See SETUP.md for remote Qdrant configuration.

Verify

curl http://localhost:6333/healthz           # Qdrant: "healthz check passed"
curl http://localhost:8000/health            # API: {"status":"ok"}

API Reference

Endpoint Description
POST /chat RAG query (requires JWT); returns answer with hallucination score
POST /auth/register Creates new user (rate-limit 3/hour per IP)
POST /auth/token OAuth2 login; returns JWT (rate-limit 5 / 15min per IP)
GET /health API health status

POST /chat:

TOKEN=$(curl -s -X POST "http://localhost:8000/auth/token" \
  -d "username=demo&password=long-enough-pw" | jq -r .access_token)

curl -X POST "http://localhost:8000/chat" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "session_id": "demo-session",
    "question": "Qual a epoca ideal de plantio da soja?",
    "profile": {"name": "User", "expertise": "beginner"}
  }'
# {"answer": "...", "hallucination_score": 0.18}

Without the Authorization header the API returns 401 Unauthorized.

Request Field Type Description
session_id string UUID for conversation continuity
question string User query
profile.expertise enum beginner | intermediate | expert
Response Field Type Description
answer string Generated response adapted to expertise level
hallucination_score float 0.0 (grounded) to 1.0 (likely hallucinated)

Project Structure

sb100_agents/
├── api/                            # FastAPI backend
│   ├── main.py                     # App entry (CORS + routers + lifespan)
│   └── routes/                     # chat.py, auth.py, health.py
├── core/                           # Pydantic schemas & configuration
├── retrieval/                      # Embeddings + Qdrant vector search
├── generation/                     # LLM response generation
├── memory/                         # Conversation buffer (FIFO)
├── profiling/                      # User expertise adaptation
├── verification/                   # Semantic entropy + verification gate
├── database/                       # SQLite + PDF semantic chunking
├── eval/                           # 5-step evaluation pipeline
├── ui/                             # Gradio chat interface
├── tests/                          # Unit + integration tests
├── .claude/                        # Agent workflow enforcement
│   ├── rules/                      # Directive files (00-12)
│   ├── guia-configuracao-codex.md  # Codex plugin setup guide
│   ├── registry.md                 # Project state & history
│   ├── tasks.md                    # Task registry
│   └── hooks/                      # Git hooks (commit-msg, pre-commit, etc.)
├── .github/workflows/              # CI + Claude Code automation
├── .dockerignore                   # Shrinks build context (drops .git, tests/, eval/, .claude/, etc.)
├── Dockerfile.api                  # Multi-stage build (builder + runtime)
├── docker-compose.yml              # Qdrant (infra) + API+Gradio (app) with healthchecks
└── pyproject.toml

Roadmap

Feature Description
Hybrid search Dense + sparse vectors with RRF fusion
LangGraph migration ReAct agent with agricultural intent filter
Claim Verification Atomic decomposition + RAG-based fact checking
Streaming SSE for incremental responses

Automated Issue Implementation

Issues labeled claude-auto are automatically implemented by Claude Code via GitHub Actions. Mention @claude in any issue or PR comment for interactive assistance.

Setup: add ANTHROPIC_API_KEY secret and create the claude-auto label.

Contributing

See CONTRIBUTING.md. Quick summary: fork, branch (type/TASK-NNN-description), tests, Conventional Commits, PR.

License

MIT License

About

RAG-based agricultural Q&A system using FastAPI, Qdrant vector database, and Ollama LLM for semantic document retrieval and response generation from PDF knowledge bases.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages