Production-oriented, local-first GenAI copilot for healthcare device troubleshooting. The system is designed to answer operational questions from manuals and procedures, execute deterministic troubleshooting workflows, and stay portable from a local development stack to AWS without rewriting the business layer.
This repository is intentionally structured as an engineering foundation, not a chatbot demo. The emphasis is on retrieval quality, deterministic orchestration, observability, evaluation, and clean provider boundaries.
Field support and service teams often work with fragmented troubleshooting manuals, error code references, and service procedures. A useful copilot in this environment must do more than generate fluent text:
- retrieve the right evidence from the right document section
- reason through a deterministic workflow
- invoke structured tools when needed
- refuse unsupported or unsafe requests
- expose logs, metrics, and evaluation artifacts for inspection
The target outcome is a grounded troubleshooting assistant that behaves like an auditable system, not a black-box assistant.
- Deterministic orchestration over free-form autonomous behavior
- Hybrid retrieval with reranking instead of naive vector-only search
- Provider abstractions so local development and AWS deployment share the same core logic
- Structured outputs with explicit sources and confidence
- Guardrails for scope restriction, prompt injection resistance, and low-confidence handling
- First-class observability and evaluation from the start
- Healthcare device troubleshooting guidance
- Manual and procedure-grounded question answering
- Error code lookup and procedural assistance
- Tool-assisted workflows such as ticket creation or structured lookups
- Voice-friendly interaction patterns for field usage
- Medical diagnosis
- Clinical decision support
- Ungrounded answers without retrieved evidence
- Vendor-specific lock-in inside the business layer
The project follows a few non-negotiable design principles:
- System design matters more than a single model call.
- Retrieval quality is the primary accuracy lever.
- Workflows should be explicit, testable, and traceable.
- Every meaningful capability should be measurable.
- Local-first development should map cleanly to cloud deployment.
The application is organized as a layered system with clear seams between orchestration, retrieval, provider integrations, and operational concerns.
flowchart TD
U[Engineer / Operator] --> UI[API or Voice Interface]
UI --> API[FastAPI API Layer]
API --> ORCH[LangGraph Orchestrator]
ORCH --> GR[Input Guardrails]
GR --> PLAN[Planner / Router]
PLAN --> QR[Query Rewriter]
QR --> RET[Hybrid Retrieval]
RET --> VDB[(Vector Store)]
RET --> BM25[(Keyword Index)]
RET --> RR[Reranker]
PLAN --> TOOLS[Tool Executor]
TOOLS --> E1[Error Code Lookup]
TOOLS --> E2[Troubleshooting Generator]
TOOLS --> E3[Ticket Creator]
RR --> REASON[Reasoner]
TOOLS --> REASON
REASON --> CRITIC[Critic / Grounding Check]
CRITIC --> OGR[Output Guardrails]
OGR --> RESP[Structured Response]
API --> OBS[Observability]
ORCH --> OBS
RET --> OBS
TOOLS --> OBS
RESP --> OBS
OBS --> LOGS[Logs / Traces / Metrics]
OBS --> EVAL[Evaluation Dataset + Scores]
subgraph Providers
LLM[LLM Provider]
STORE[Storage Provider]
end
REASON --> LLM
RR --> LLM
API --> STORE
The user experience is designed around a grounded troubleshooting path rather than open-ended chat.
flowchart LR
A[User asks troubleshooting question] --> B[Validate request scope]
B --> C[Normalize / rewrite technical query]
C --> D[Retrieve relevant manual sections]
D --> E[Rerank top candidates]
E --> F[Decide whether a tool is needed]
F --> G[Run deterministic tool or reasoning step]
G --> H[Ground answer against retrieved evidence]
H --> I[Format structured response]
I --> J[Return steps, warnings, tools, sources, confidence]
H --> K{Confidence too low?}
K -->|Yes| L[Return NOT FOUND IN MANUAL]
K -->|No| I
- A user submits a troubleshooting question through the API or voice interface.
- Input guardrails validate domain fit, reject disallowed requests, and sanitize the prompt.
- The orchestrator routes the request through a deterministic graph.
- The retrieval system rewrites the query, performs hybrid search, and reranks the results.
- The workflow decides whether tool execution is required.
- The reasoning layer produces a grounded response using retrieved context and tool outputs.
- Output guardrails verify grounding, confidence, and response structure.
- Observability captures logs, traces, latency, retrieval context, and evaluation signals.
The codebase is scaffolded to keep concerns separate from day one.
app/
api/ # FastAPI routes, request/response contracts, transport concerns
core/ # Shared config, types, domain models, utilities, guardrails
agents/ # LangGraph nodes, state definitions, orchestration workflows
retrieval/ # Chunking, embedding, BM25, hybrid fusion, reranking
providers/ # LLM, vector store, storage, speech, and cloud/local adapters
tools/ # Deterministic tools and action executors
eval/ # Offline evaluation pipelines, datasets, scoring
observability/ # Logging, tracing, metrics, telemetry exporters
data/ # Manuals, evaluation sets, fixtures, ingestion outputs
scripts/ # CLI utilities for ingestion, indexing, evaluation, setup
docs/ # Deep-dive architecture, ops notes, and future ADRs
The design goal is local development without sacrificing a clean production path to AWS.
| Concern | Local-first stack | AWS-aligned stack | Why this split works |
|---|---|---|---|
| API layer | FastAPI | FastAPI on Lambda or ECS | Keep transport contracts identical across environments |
| Workflow orchestration | LangGraph | LangGraph | Deterministic graph logic remains unchanged |
| LLM provider | Ollama | Amazon Bedrock | Provider abstraction isolates model vendor concerns |
| Embeddings | Local embedding model via Ollama or sentence-transformers | Bedrock embeddings or managed embedding endpoint | Swap implementation, not calling code |
| Vector retrieval | Chroma | OpenSearch | Same retrieval contract, different backing store |
| Keyword retrieval | Local BM25 / Whoosh-style index | OpenSearch BM25 | Hybrid retrieval remains consistent |
| Reranking | Local cross-encoder or lightweight reranker | Bedrock-compatible reranker or hosted rerank service | Accuracy improvements without changing workflow shape |
| Storage | Local filesystem | Amazon S3 | Ingestion and artifact storage map cleanly |
| Speech input/output | Local STT/TTS providers | Amazon Transcribe / Polly | Voice remains optional and replaceable |
| Observability | Structured JSON logs, OpenTelemetry, local dashboards | CloudWatch, X-Ray, OpenTelemetry collectors | Same telemetry model, different sink |
| Evaluation | Local scripts and notebooks | Scheduled evaluation jobs and persisted score history | Quality measurement works in both environments |
| Deployment | Local Python runtime, Docker | Lambda or ECS with managed AWS dependencies | Keeps infra choices flexible by workload |
| Capability | Local implementation | AWS implementation |
|---|---|---|
| LLM | Ollama | Bedrock |
| Vector store | Chroma | OpenSearch |
| Keyword search | Local BM25 | OpenSearch |
| Object storage | Filesystem | S3 |
| API hosting | Local process / Docker | Lambda or ECS |
| Tracing and metrics | OpenTelemetry + local sink | CloudWatch + X-Ray + OpenTelemetry |
| Voice | Local provider | Transcribe + Polly |
The README is based on the architecture intent in agent.md and the decision log in decisions.md. The most important decisions are:
- Use RAG instead of fine-tuning in the initial phase because manuals are dynamic and must stay updateable.
- Prefer deterministic workflow orchestration over open-ended autonomous agents for traceability and debugging.
- Invest in hybrid retrieval plus reranking because this use case depends on both semantic relevance and exact terminology.
- Keep provider boundaries explicit so local experimentation and AWS deployment do not fork the core design.
- Treat observability, guardrails, and evaluation as core product features rather than post-demo additions.
This is not intended to be a simple "embed and search" system. The retrieval layer is expected to include:
- section-aware chunking aligned to manual structure, procedures, and error code groupings
- chunk sizes in the roughly 300 to 800 token range with modest overlap
- hybrid retrieval across dense and sparse indexes
- weighted result fusion before reranking
- reranking from a broader candidate set to a narrow grounded context window
- query normalization and rewriting for technical terms, abbreviations, and error strings
The orchestration layer should be implemented as pure, testable graph nodes with explicit state transitions.
Expected node responsibilities:
planner: classify the request and determine the pathretriever: fetch the best supporting contextreasoner: synthesize an answer from context and tool outputstool_executor: run deterministic actionscritic: check grounding, completeness, and confidencehuman_approval: optional approval step for sensitive actions
The design intent is to minimize hidden state and make failures debuggable.
Responses should be consistent and operationally useful. The target output shape is:
Steps:
Warnings:
Tools:
Sources:
Confidence:
If the system cannot support a conclusion from retrieved material, it should say so explicitly rather than improvise. Low-confidence responses should degrade gracefully to NOT FOUND IN MANUAL.
Guardrails are a first-class requirement because the assistant operates in a sensitive domain.
- detect prompt injection attempts
- reject requests outside supported troubleshooting scope
- sanitize inputs before orchestration
- prevent use as a medical diagnosis assistant
- require grounding in retrieved evidence
- block unsupported troubleshooting steps
- return explicit low-confidence outcomes when evidence is weak
- preserve source attribution and confidence visibility
Every meaningful request should be traceable. At minimum, telemetry should capture:
- request identifier
- original and rewritten query
- retrieved documents and ranking metadata
- prompts and provider calls
- tool invocations and results
- response payload
- latency and token usage
- confidence and evaluation signals
The goal is to make failures diagnosable and improvement work measurable.
The project should maintain a standing evaluation set rather than relying on anecdotal manual testing.
Planned evaluation focus:
- faithfulness
- answer relevance
- context precision
- retrieval hit quality
- tool execution success rate
- latency by workflow path
The baseline expectation is a representative dataset with edge cases, ambiguous phrasing, exact error codes, and incomplete-context scenarios.
- deterministic workflow paths
- graceful fallback behavior
- explicit failure states
- provider abstraction boundaries
- modular folder ownership
- typed interfaces and structured logging
- no vendor-specific logic in domain modules
- environment-driven configuration
- replaceable storage, model, and retrieval adapters
- cache embeddings where possible
- avoid unnecessary model invocations
- keep retrieval candidate sets bounded
- measure latency at each stage of the graph
The repository currently contains the architecture scaffold and decision baseline. The folder layout is in place, and the next implementation phases are expected to focus on:
- provider interfaces and configuration models
- ingestion, chunking, and indexing pipelines
- LangGraph workflow nodes and state contracts
- API endpoints and structured response schemas
- evaluation harness and observability integration
This status is intentional: the repo is being shaped around a durable system architecture before feature sprawl begins.
- no real-time device integration yet
- no clinical validation
- local model quality will vary by hardware and model choice
- retrieval quality depends heavily on document quality and chunking discipline
- voice features add operational complexity and latency
- Architecture intent: agent.md
- Architectural decisions: decisions.md
The bar for this project is simple: build it the way you would if it needed to survive its first production incident.