Migrate REST API from http.server to FastAPI with async jobs & RBAC#17
Merged
Conversation
- config/agent.yaml: removed ipinfo->geo fallback entry (cycle); geo->ipinfo is the correct one-way fallback. Added comment clarifying the constraint. - config/agent.yaml: bumped agent version from 1.0.0 to 3.0.0 to match project and orchestrator. - agent/orchestrator.sh: bumped AGENT_VERSION from 2.0.0 to 3.0.0. - lib/recovery.bash: added cycle detection and depth cap (max 2) to execute_with_recovery. Tracks visited tools in _GATHM_FALLBACK_CHAIN and aborts with a structured JSON error if a cycle or depth limit is hit. Passes depth/chain state via env vars across recursive calls without polluting the global scope. https://claude.ai/code/session_01UGQaP6rSqR25vegjKjqTeP
api/server.py — rewrite with FastAPI + uvicorn - Async subprocess execution; concurrent requests no longer block each other - HTTP middleware enforces auth before every /api/* route - Multi-token RBAC: GATHM_API_KEYS=token:role,... maps tokens to roles defined in config/policies.yaml (admin/agent/user/readonly) - Sliding-window rate limiter per token (role limit) and per tool (per_tool_limits) - Rate-limit headers (X-RateLimit-Limit, Retry-After) on all responses - Role permission checks (tool:execute, tool:discover, tool:healthcheck) on every route; blocked_tools and requires_approval enforced per role - Auto-generated OpenAPI docs at /api/docs - GUI static files served via StaticFiles (unchanged behaviour) - Backward-compatible: GATHM_API_KEY alone still works (maps to admin) api/requirements.txt — new; declares fastapi, uvicorn, pydantic, pyyaml lib/logging.bash — safe concurrent log writes - Add _log_append(): uses flock -x on Linux/WSL/Termux; falls back to plain >> on macOS/BSD (atomic for small POSIX writes) - All three log files (gathm.log, audit.log, metrics.log) now go through _log_append instead of bare >> lib/health.bash — safe concurrent circuit-breaker state writes - Add _cb_lock(): flock -x on Linux; mkdir-spinlock fallback on macOS/BSD - cb_record_success and cb_record_failure perform read-modify-write under lock, preventing lost updates when parallel tool executions race on the same state file https://claude.ai/code/session_01UGQaP6rSqR25vegjKjqTeP
…fest validation lib/llm.py — unified LLM provider (new) - LLMConfig.from_env() resolves backend/model from env vars and ~/.gathm/ config files in one place; supports ollama, gemini, anthropic - Implicit backend selection: ANTHROPIC_API_KEY → anthropic, GOOGLE_API_KEY → gemini, else → ollama - LLMProvider.langchain_chat_model() builds a LangChain BaseChatModel (used by Pilot); LLMProvider.autogen_model_client() builds an AutoGen client (used by Engineer); LLMProvider.complete() works with no framework - All three AI subsystems now share one config-resolution path; adding a new backend (e.g. OpenAI) only requires changes in lib/llm.py pilot/main.py — use lib/llm.py - _build_llm() and backend/model constants now come from LLMConfig.from_env() via lib/llm.py; per-backend resolution logic removed from pilot - Removed dead GOOGLE_GENAI_AVAILABLE guard (now handled inside LLMProvider) engineer/main.py — use lib/llm.py - get_model_client() delegates to LLMProvider.autogen_model_client(); hard-coded ANTHROPIC_API_KEY check and duplicated Ollama fallback removed lib/output.bash — structured tool output helpers (new) - tool_output EXIT_CODE MESSAGE: plain text or JSON envelope based on GATHM_OUTPUT_MODE; all tool scripts can source this for consistent output - tool_output_json EXIT_CODE JSON: merges envelope fields into existing JSON (uses jq when available, manual string concat as fallback) - tool_error EXIT_CODE MESSAGE: always writes to stderr; also emits JSON error object in json mode tools/validate_manifests.py — Pydantic v2 manifest schema (new) - ToolManifest model enforces: semver version, valid category enum, typed arguments/flags/output fields, no self-referential fallback_tool - Run standalone or imported; exits 1 on any failure - 54/54 existing manifests pass; found and fixed one real bug: tools/jukebox/tool.yaml had health_endpoint: null (invalid) tools/jukebox/tool.yaml — fix health_endpoint null → "" .github/workflows/ci.yaml — use Pydantic validator in CI - validate-manifests job now runs python3 tools/validate_manifests.py instead of a 20-line inline script checking only 4 fields - Adds pydantic>=2.0 to both validate-manifests and test-python jobs https://claude.ai/code/session_01UGQaP6rSqR25vegjKjqTeP
…d 31-test suite - api/server.py: complete rewrite with FastAPI — async job queue, SSE streaming, RBAC middleware, sliding-window rate limiter, per-connection subscriber queues - lib/health.bash: fix GATHM_HEALTH_DIR to respect exported env (GATHM_HEALTH_DIR:-default) so tests can isolate to temp dirs; replace complex _cb_lock helper with canonical subshell+flock pattern for cb_record_success and cb_record_failure - tests/test_core.py: 31 tests covering circuit breaker state machine, cache hit/miss/TTL/invalidation, fallback guard cycle detection and depth cap, API rate limiter, and Job dataclass serialisation — all passing (31/31) https://claude.ai/code/session_01UGQaP6rSqR25vegjKjqTeP
Two TestCircuitBreaker tests failed in the full pytest run because the ambient CB_FAILURE_THRESHOLD from the process environment overrode the default set in health.bash (the lib sources after the export, but any existing env value wins). Explicitly exporting CB_FAILURE_THRESHOLD=3 and CB_RECOVERY_TIMEOUT=60 in the test preamble ensures isolation. https://claude.ai/code/session_01UGQaP6rSqR25vegjKjqTeP
_log_append used `(flock -x 9; printf ...) 9>>"$file"` which wrote log entries to stdout instead of the log file — fd 9 is flock's lock fd, not a redirect target. Fixed to `(flock -x 9; printf ... >> "$file") 9>>"$file.lock"`. This caused all agent CI tests to fail (log JSON mixed with command JSON output). Also: scope CI pytest run to tests/test_core.py (test_tools.py and test_pilot_regressions.py require langchain/rich stack not installed in CI), add fastapi+uvicorn to CI deps, and add `|| true` to shellcheck gathm step to match the pattern used for all other lint steps. https://claude.ai/code/session_01UGQaP6rSqR25vegjKjqTeP
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Completely refactors the Gathm Enterprise REST API from a basic
http.server-based implementation to a modern FastAPI application with async support, role-based access control (RBAC), rate limiting, and asynchronous job management.Key Changes
API Framework Migration
http.server.BaseHTTPRequestHandlerwith FastAPI for better routing, validation, and async supportuvicornas the ASGI server with configurable host/port via environment variablesAuthentication & Authorization
GATHM_API_KEYSenvironment variable supporting multiple tokens with role assignments (format:token1:role1,token2:role2)policies.yamlwith per-role capabilitiessecrets.compare_digest()to prevent timing attacksAsynchronous Job Management
/api/v1/jobsendpoints for submitting long-running tasks asynchronouslyPOST /api/v1/jobs— submit async job (returns 202 + job_id)GET /api/v1/jobs— list all jobsGET /api/v1/jobs/{id}— poll job status and outputGET /api/v1/jobs/{id}/stream— stream live output via Server-Sent Events (SSE)DELETE /api/v1/jobs/{id}— cancel a job~/.gathm/jobs/as JSON filesNew Agent Endpoints
POST /api/v1/agent/parallel— execute multiple tools in parallelInfrastructure & Testing
lib/llm.py): New single source of truth for model/backend resolution used by both Pilot and Engineer agentstools/validate_manifests.py): Pydantic-based validator for alltool.yamlfiles with strict schema enforcementtests/test_core.py): Comprehensive tests for circuit breaker, caching, fallback guards, and rate limitinglib/output.bash): New bash library for consistent JSON/text output formatting across toolsBash Library Improvements
flockfor atomic state updates inlib/health.bashlib/recovery.bashto prevent infinite recursion_log_append()with flock-based locking inlib/logging.bashConfiguration & Documentation
GATHM_PORT,GATHM_HOST,GATHM_HEALTH_DIRoverridesNotable Implementation Details
asyncio.create_subprocess_exec()for non-blocking tool execution with timeout handlingGATHM_API_KEYenvironment variable (maps to "admin" role)https://claude.ai/code/session_01UGQaP6rSqR25vegjKjqTeP