feat(npc): NPCs as agents with tools (optionally LLM-backed)#217
Merged
Conversation
Closes #74. Lifts NPCs from scripted-loop "smarter cron jobs" to agent loops with a persona, a tool surface bound over the runtime backing's interface, and an opt-in LLM. Strands SDK (already an optional dep) owns the loop, so we don't re-invent tool dispatch, retries, or streaming. Core: * `NPC.requires_llm: ClassVar[bool] = False` -- opt-in flag. * `AgentNPC(NPC)` base -- system prompt + cadence + abstract `_build_tools(interface)`. Lazy strands import; per-tick failures are silent; build failures mark the NPC broken so we don't retry every tick. * `EpisodeService(npc_llm_model=...)` injects the model id into the per-NPC `start()` context for NPCs that opt in. Plain NPCs see no `llm` key and pay nothing. * `RunConfig.npc_llm_model` threads it through `OpenRangeRun`. Reference NPC: * `cyber.curious_employee` -- a ~50 LOC `AgentNPC` subclass. Wraps `interface["http_get"]` as a strands `@tool` and acts as a casual internal employee browsing the company webapp. Goes silent if strands isn't installed; the episode is unaffected. Tests cover: requires_llm flag default + override, AgentNPC cadence + build failures + invocation errors + stop/cleanup, runtime LLM injection (opt-in vs. not, configured vs. not), and the new pack NPC's factory + entry-point registration. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…end, not a model id Lifts the NPC->LLM seam from a strands-shaped model id string to a provider-neutral ``AgentBackend`` protocol. NPCs no longer reach into strands directly; they ask their backend to ``build_agent``, and the backend handles the provider details. Two implementations ship: * ``StrandsAgentBackend`` -- canonical, wraps ``strands.Agent``. Lazy-imports strands so the optional extra is only required if this backend is actually instantiated. * ``CodexAgentBackend`` -- wraps the existing ``openrange.llm.CodexBackend`` for tool-less agent prompts. Same Codex binary the builder uses, no strands install needed. Errors loudly if handed any tools (Codex's tool surface isn't exposed for arbitrary callable injection). Both backends implement a ``preflight()`` method so the broken-state machinery surfaces missing deps (strands not installed, codex CLI not on PATH) at episode start rather than on the first acting tick. Wire-up: * ``EpisodeService(npc_agent_backend=...)`` -- pass any ``AgentBackend``. The legacy ``npc_llm_model="..."`` string still works; it auto-promotes to ``StrandsAgentBackend(model=...)``. Passing both is rejected. * ``RunConfig`` mirrors the same pair of knobs. * The per-NPC context delivered to ``AgentNPC.start()`` now carries ``agent_backend`` (an ``AgentBackend`` instance or ``None``), replacing the previous ``llm`` (model id string) key. * ``AgentNPC(agent_backend=...)`` -- explicit per-NPC override always wins over the runtime backend. The strands-only ``model=`` knob is gone (use ``agent_backend=StrandsAgentBackend(model=...)``). * ``cyber.curious_employee``: per-NPC ``model: str`` config still works as a YAML-friendly convenience (auto-promotes to a per-NPC ``StrandsAgentBackend``). Top-level ``openrange`` re-exports ``AgentBackend``, ``AgentBackendError``, ``StrandsAgentBackend``, ``CodexAgentBackend`` alongside the existing ``LLMBackend`` family. Tests cover: both backends' preflight + build paths, Strands tool-rejection, Codex tool-rejection, AgentNPC with constructor backend vs. runtime-supplied backend, broken-on-no-backend, broken on backend preflight failure, EpisodeService backend-vs-model knob mutual exclusion, and the cyber pack's per-NPC model promotion. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
No other core module declares one — package exports are managed via ``openrange/core/__init__.py``. Removing keeps the convention. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Public-facing protocols pack authors implement against don't belong under ``openrange.core`` — that namespace is for internal building blocks (manifest parsing, episode service internals, runtime backings, snapshot store, graph machinery). Pack authors should import from ``openrange`` (or top-level submodules), never from ``.core``. Moves: * ``openrange.core.agent_backend`` -> ``openrange.agent_backend`` * ``openrange.core.npc`` -> ``openrange.npc`` Both now sit at the same tier as ``openrange.llm`` and ``openrange.runtime`` — public, top-level, owned by users and pack authors. Top-level ``openrange`` re-exports the NPC family (``NPC``, ``AgentNPC``, ``NPCRegistry``, ``NPCError``, ``NPCS``) alongside the existing ``LLMBackend`` / ``AgentBackend`` exports, so the typical pack import becomes ``from openrange import NPC, AgentNPC, ...`` — no submodule reach-through. The cyber_webapp pack imports were updated to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…enrange.core` Same convention call as the npc + agent_backend relocations: ``openrange.core`` is internal building blocks; pack authors should import from the top-level ``openrange`` package, where everything public is already re-exported. All cyber_webapp imports of ``openrange.core.errors``, ``openrange.core.graph``, ``openrange.core.pack``, ``openrange.core.builder_protocol``, ``openrange.core.builder``, and ``openrange.core.manifest`` are rewritten to ``from openrange import ...``. Functionally identical (those names already lived at the top level via ``openrange/__init__.py`` re-exports); the diff is just the import paths. The cyber pack now has zero ``openrange.core.*`` imports, which matches the rule: packs talk to ``openrange``, not into its internals. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pure cosmetic reformat — multi-line function calls collapsed onto one line where they fit, trailing whitespace removed, no semantic changes. Pulled out as a standalone commit so the dashboard work that follows stays diffable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…projection
`uv run python -m openrange dashboard` (no args) now serves a
tensorboard-style viewer that watches `./or-runs/`, lists every run
that has dashboard artifacts, and lets the operator switch between
them live. The previous CLI required `--run-root <path>` and only
viewed one run at a time.
Architecture (env-owned, pack-agnostic):
- `dashboard/runs.py` — `RunsRegistry` discovers run subdirs that
carry `dashboard.events.jsonl` + `dashboard.json`, lazily mints
a `DashboardView` per run, and caches it. Path-traversal guard
rejects `?run=../../etc` style ids.
- `dashboard/server.py` — server holds either a registry (multi-run)
or a single view (embedded `OpenRangeRun.serve_dashboard()`).
All per-run routes resolve via `?run=<id>` (falls back to the
registry's newest). New `GET /api/runs` returns `{runs, default}`.
- `dashboard/topology.py` — graph -> view projection lives here, in
the dashboard module (not the cyber pack). Reads standard v1
ontology nodes/edges directly from `snapshot.world_graph` and
surfaces services / endpoints / vulns / zones / users. Other
packs can still ship their own `topology.json` artifact or
`world.topology` to override.
- `dashboard/static/` — left-side collapsible "RUNS" drawer with a
list view (cards per run, active run highlighted in accent),
"Follow latest" toggle (default on; auto-switches when newer
runs land), Esc closes drawer. SPA polls `/api/runs` every 5s
for live discovery.
CLI:
- `--runs-dir or-runs/` (default) — multi-run mode
- `--run-root <dir>` — single-run mode
- `--snapshot-id <id>` — explicit snapshot-store mode (no implicit
fallback when `./snapshots/` happens to exist; surprising
behaviour fixed)
- Adds `if __name__ == "__main__": main()` at module bottom — the
bare `python -m openrange dashboard` was previously a silent
no-op because `main()` was never invoked.
`OpenRangeRun.serve_dashboard()` stays as the embedded primitive
for callers that want an in-process server. Default mode just writes
events to disk; the standalone dashboard process is the canonical
viewer.
example/codex_eval.py prints a hint pointing at the dashboard CLI
after build instead of trying to wire its own server.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Eight cleanups found during the PR review: 1. NPC base-class docstring corrected — was "llm" key in context, is actually "agent_backend". 2. EpisodeService._start_npcs comment clarified — manifest-shape errors propagate; per-NPC SDK failures are caught and surfaced via broken_reason. 3. RunConfig mutual-exclusivity (npc_agent_backend vs npc_llm_model) validated at OpenRangeRun.__init__ instead of waiting for episode_service() to surface it. 4. _mark_broken: exc_info=exc directly. The previous "fallback to True" asked logging.warning to introspect sys.exc_info outside any except block — would either grab unrelated in-flight exceptions or print "NoneType: None". 5. auto_evolve emits an "auto_evolve_chosen" event before forwarding to evolve(), so the dashboard lineage view gets the full direction + note + parent narrative instead of seeing two snapshots appear back-to-back. 6. topology_from_world_graph docstring is honest about its cyber-pack-ontology coupling (was "pack-agnostic", which overstated the reusability). 7. Dashboard CSS title positioning consolidated through three custom properties (--sim-title-offset, --sim-toggle-width, --sim-title-left). No more cascade-override hack between two .sim-title rules. 8. LLMBackend gains a preflight() protocol method (default no-op); CodexBackend overrides to check the codex CLI is on PATH; CodexAgentBackend now delegates to it instead of skipping the check for caller-supplied custom backends. Test updated to match the new "every backend self-describes" contract. 177 tests + 2 skipped, ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…chatter, meet Three pieces, all on the path to the physical-simulation vision (#219): office NPCs walking around a floor and chatting, the dashboard actually streaming their events live, and CI mypy clean across the tree. Core wiring: * ``NPC.actor_id`` (default ``<ClassName>-<short hash>``; subclasses set ``self._actor_id`` for friendly names) plus a per-NPC ``record_action`` callable injected into ``start()`` context. NPCs use it to publish ``ActorTurn``-shaped events tagged with their own actor_id. Errors swallowed — recording is observational. Reference NPC (cyber.office_chatter): * Scripted, deterministic via seed. Each acting tick is either ``{"speak": phrase}`` (fading bubble in the dashboard) or ``{"move": "wandering"}`` (walks to a colleague's desk). Initial cooldown staggered by seed so 6 chatters don't fire in lock-step on tick 0. ``start()`` emits a ``{"present": True}`` event so characters appear at their desks immediately rather than waiting out the first cadence window. Dashboard: * Speech bubbles (3s hold + 1s fade) above characters on ``action.speak`` events, no movement. * Decoupled office desks from cyber services — 8-desk grid laid out on the south side of the floor, NPCs anchored to a stable home desk via ``homeDeskFor(actor_id)`` hash. ``move`` events walk visitor to a colleague's desk; host fires a one-line reply shortly after; visitor returns home 9-13s later. * Agent / runtime characters no longer render as walking bodies in the office — service ring colors flash for HTTP traffic instead. * Station + desk labels removed; cleaner scenery. * Fingerprint stability: ``simulationFingerprint`` is now topology-only, so every new event no longer triggers a full world rebuild that wiped mid-walk characters. * RUNS toggle moved to a tiny ``›`` chevron in the upper-left corner; clock back at the top-right. Live streaming (the main bug): * ``DashboardView`` gained a ``tail=True`` kwarg that polls ``dashboard.events.jsonl`` every 250ms and pushes new lines into the bridge. Single-run mode (``openrange dashboard --run-root``), multi-run mode (``RunsRegistry``), and the SSE pipeline all use it now. Embedded writer mode (``OpenRangeRun.serve_dashboard``) keeps ``tail=False`` so it doesn't re-publish its own writes. * ``_stored_section`` re-reads ``dashboard.json`` on each request in reader-mode so a snapshot the writer lands AFTER the view was constructed surfaces immediately — no need to restart the dashboard server when an eval starts mid-run. * ``/api/runs`` synthesizes a non-null default in single-run mode (using either the live snapshot or the stored topology id, or the literal ``"single"``) so the SPA actually sets ``activeRun`` and ``openStreams`` opens the SSE connection. Without this fix single-run mode showed events frozen at boot — events flowed to disk, the tail pushed them into the bridge, but the SPA never opened the SSE listener. * SPA also subscribes to ``builder_step`` events (in addition to ``env_turn`` / ``agent_step`` / ``note``) so the empty-state placeholder hands off to the live world the moment the builder lands ``snapshot_created``, with no env_turn yet to trigger a refresh. Tests: * New ``tests/test_dashboard_runs.py`` — tail picks up appended events within 2s, preserves history-on-open, ignores partial trailing lines, handles truncation, joins cleanly on close. * ``tests/test_cyber_npcs.py`` extended for ``OfficeChatter`` factory / cadence / speech / walk / silent-without-recorder / presence-on-start / staggered cooldown. * ``tests/test_v1_episode.py`` — episode runtime injects ``record_action`` into NPC context and the resulting events surface tagged with the NPC's actor_id. * CI cleanup: ``tests/test_curriculum.py`` / ``tests/test_cyber_auto_curriculum.py`` / runtime.py / test_agent_backend.py / test_v1_episode.py mypy fixes (was 26 errors blocking CI). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… about no-any-return CI-blocking: src/openrange/agent_backend.py:138 was returning Any from a function declared to return Callable[[str], Any]. Adding an AgentSession-typed local satisfies the no-any-return check without changing behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User feedback: chatters not visibly moving around. Three knobs: * OfficeChatter: cadence_ticks 9 -> 6 (act every 4s), walk_probability 0.3 -> 0.5 (half of acts are walks). Each NPC initiates a visit every ~8s on average; across 6 chatters that is one walk every ~1.3s globally. * JS walk speed: 4.2 -> 8.0 units/s plus stronger leg swing. Crossing the office takes ~1-2s instead of 3-5s. * Visit duration: 9-13s -> 4-6s; host reply delay 2.5-4s -> 1.5-2.5s so the reply lands while the visitor is actually there. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User-reported: NPCs stand still and speak way too often. Root cause: ``OfficeChatter.step`` only emitted ``move`` events when ``self._known_targets`` was non-empty, and that set was only populated in ``start()`` when a ``home`` config field was passed. The example manifest never set ``home``, so chatters never had targets, so every act fell through to the speech branch. Result: zero motion, all chatter, exactly the symptom the user kept seeing. Fix: walk vs. speak is now a clean ``random() < walk_probability`` flip. The dashboard picks the destination desk client-side from the visitor home; the backend does not need to know any service ids to emit a walk. ``home`` is still honored as the optional ``target`` field on the move event for callers that want to override the dashboard pick. Also tuning: cadence_ticks 6 -> 8, walk_probability 0.5 -> 0.8. Gives ~1 walk/sec and ~1 speech/4s across the office. Smoke (6 chatters, 30 ticks each): 20 walks vs 4 speaks. Was 0 walks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User-reported: post-build the dashboard never transitions out of the "no admitted world" empty state, and after a server restart the backlog dumps in a burst then nothing else updates. Two robustness changes on the SPA side: * Debounce SSE event handlers via ``scheduleRefresh`` (150 ms coalesce). The backlog burst on reconnect was firing up to 200 parallel refresh() calls, each doing 6 concurrent API fetches. That race can land the model on stale snapshots and pegs the server during the burst. Coalescing to one refresh per 150 ms keeps the UI accurate without the thundering herd. * 2s polling refresh as an SSE fallback. SSE is still primary, but if the connection ever drops silently (browser disconnect, intermediary timeout, etc.) the poll keeps the UI in sync with on-disk state — including the post-build snapshot transition. Backend was already correct end-to-end (verified via curl: tail picks up appended events, /api/topology returns snapshot_id, SSE delivers live). This is purely SPA-side robustness. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related issues found while end-to-end-testing the eval's auto_evolve loop: 1. ``auto_evolve`` previously called ``evolve()`` on the single highest-relevance candidate and let any ``BuildFailed`` propagate. A pack proposing a mutation tag that doesn't admit (e.g. the cyber pack's ``add`` placing a vuln off the oracle path) crashes the whole eval mid-curriculum. Now it walks candidates by descending relevance, surfaces each skip via the ``event_sink`` (``auto_evolve_skipped``), and returns ``None`` only when every candidate fails. Eval continues with the previous snapshot. 2. The cyber pack's ``add`` mutation in ``cyber_webapp.mutation._add_vulns_by_kind`` placed the new vuln on whatever endpoint came first in iteration order, with no awareness of the oracle path. When the prior ``patch`` had stripped the only vuln on the oracle service, the next ``add`` would land elsewhere and the resulting graph would fail ``OraclePathExistsConstraint``. Added ``_oracle_path_targets`` that walks ``flag → record → data_store → service → endpoint``, and re-ordered the ``add`` candidate list so oracle-path endpoints / services come first. End-to-end verified: a pass→harden→fail→soften→fail→soften→pass→harden walk now produces 4 distinct snapshots with the world genuinely mutating each step (vulns added/removed in line with the curriculum direction), no crashes, no skipped events. 191 tests + 2 skipped, ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User-reported: hitting Ctrl+C on the eval leaves the cyber webapp subprocess (uvicorn / generated app.py) reparented to PID 1. After many interrupted sessions there are 20+ orphan ports bound and file descriptors leaked. Three changes: * ``start_runtime_process`` uses ``start_new_session=True`` so the runtime subprocess gets its own session/process group. Without it, Ctrl+C in the harness terminal sends SIGINT to every child too — some of those (uvicorn, HTTPServer) handle SIGINT via graceful- shutdown paths that race with the parent cleanup and leak the process when reparented. * ``stop_process`` SIGTERMs the whole process group when the subprocess actually has its own group (different pgid from the caller) — catches uvicorn workers, request threads, anything spawned downstream. Critically: when the subprocess shares the caller pgid (bare Popen, e.g. test fixtures), it falls back to process.terminate() instead of killing the whole group, which would otherwise SIGTERM the test runner itself. * ``EpisodeService`` registers an ``atexit`` hook that walks any still-running episode artifacts and stops them. Backstop for the case where a try/finally somewhere upstream missed cleanup — KeyboardInterrupt fired during cleanup, an unrelated exception, etc. Idempotent and best-effort. Verified end-to-end: starting an eval and sending SIGINT 4s in now leaves zero orphan app.py processes (was leaking one per interrupted run). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User-reported (10th time): events do not stream to the SPA, even though server-side polling /api/state shows event_count growing live (29 -> 34 -> 38 -> 43 -> 48 over 10s). The hang must be somewhere in the browser layer that I cannot see from here, but the SPA can be made resilient regardless. Three changes: * Polling refresh moved to 1s (was 2s). Polling is now the *primary* live-update path; SSE is a nice-to-have optimization. If SSE drops silently for any reason (browser quirk, proxy timeout, connection-pool exhaustion) the 1s poll keeps the UI in sync. * ``safeRefresh`` / ``safeRefreshRuns`` wrap fetch loops so a single failed poll does not kill the interval chain. * Visible freshness indicator: when a refresh has not landed in >5s, the subtitle suffixes "· last update Ns ago" so the operator can immediately tell whether the SPA is stuck (with no need to open dev tools). Cleared back to the bare title when fresh. Server-side was already correct (verified end-to-end via curl against a live eval). This commit is purely client-side resilience for the case where the user cannot see why the page appears frozen. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User-reported: chatter speech was disconnected one-liners — visitor says "did the prod deploy go out?", host replies "huh" — that did not read as a workplace exchange. Rebuilt around a JS-side ``EXCHANGES`` table of 16 short coherent opener-and-reply pairs. When an NPC walks to a colleague, the dashboard: 1. picks one exchange, 2. has the visitor speak the opener ~1.6s after starting the walk (right when they arrive at the host desk), 3. has the host speak one of the matching replies ~1.6-2.2s after that. Result: every visit reads as a real exchange — "deploy went out?" → "yeah, just now"; "build is red on main" → "ill take a look" — with sensible diversity across visits. Backend simplification: ``OfficeChatter`` is now walks-only. It no longer emits standalone ``speak`` events; speech is entirely dashboard-orchestrated as part of a visit. Drops the per-NPC ``lines`` config (the example manifest no longer carries 4 phrases per staff member — every chatter draws from the shared 16-exchange pool, which is far more diverse than 4 hand-coded mutterings). Tests updated: walks-only emission, presence event still fires, cadence still respected, walk_probability=0 means quiet ticks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four cleanups for the merge-to-main pass: * Move the EXCHANGES dialogue corpus from ``dashboard.js`` into ``cyber.office_chatter``. The chatter now picks an opener-and-reply pair per walk and ships them on the move event; the dashboard just reads ``action.opener`` and ``action.reply`` and times the bubbles. Scenario content lives in the pack; the renderer stays generic. * Tighten history-shaped comments. ``.rules`` reserves PR/commit context for PRs/commits — comments should explain *why* (subtle invariants, non-obvious constraints), not what changed. * Drop boilerplate docstrings from internal helpers (\_user_prompt, \_build_agent, \_invoke_agent) — name + body speak for themselves. Keep the one on \_mark_broken because the ``exc_info=exc`` choice is non-obvious; collapse from a four-paragraph rationale to two lines. * Tighten _EventLogTail / DashboardView.close docstrings to the same standard. 194 tests pass; ruff and mypy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…enders OfficeChatter now ships home_index (stable SHA1 of name) and target_name (chosen colleague) on its events. The dashboard reads them straight off the event stream instead of recomputing seating with its own per-name hash and inventing who-visits-whom with a random pick over desks. SHA1 (not Python's randomized hash) so the same chatter lands on the same desk across runs. Drops homeDeskFor and pickColleagueDesk from dashboard.js, plus the rebuild-time NPC pre-spawn — chatters announce themselves via their present event in start(), which carries the same home_index. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`_seed_for_state` was mixing the configured seed with `hash(pack.id)`, but Python's built-in `hash` is randomized per process — different PYTHONHASHSEED values yielded different graphs for the same manifest, and a handful of seeds (PYTHONHASHSEED=17, 20, 31 reproduce locally) generated worlds that failed feasibility, flaking `test_evolve_adds_new_vulns` in CI. Swap to SHA1 so the same pack id always derives the same offset, matching the `_stable_home_index` pattern used in the office-chatter NPC. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The eval scripts are about agent + builder behavior, not the office demo scenery. Remove the cyber.office_chatter spam (and the shared _office_demo helper) from codex_eval and strands_eval; users who want the populated 3D office can add chatters to their own manifest. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
npc.py: the module/class/method docstrings were each restating the lifecycle, the requires_llm contract, and the AgentBackend injection flow. Collapse to one statement per concept. Drop the empty "Default: no-op" docstring on stop(), the empty NPCError docstring, and the six-line preamble explaining why we don't isinstance-check the runtime backend. runtime.py: drop the "validate up front rather than waiting" preamble that just narrates the next two lines, and tighten the RunConfig field comments — both already say what's needed without repeating the obvious. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #74. Lifts NPCs from scripted-loop "smarter cron jobs" to agent loops with a persona, a tool surface, and an opt-in LLM. Reuses the optional
strands-agentsSDK so we don't re-invent tool dispatch / retries / streaming.The pack-author surface for an LLM-backed NPC is now ~50 lines: persona string, an
_build_tools(interface)hook that wraps interface methods as@strands.tool-decorated callables, an entry point. Seecyber.curious_employeeas the reference.What changed
Core (
src/openrange/core/npc.py):NPC.requires_llm: ClassVar[bool] = False— opt-in flag. Plain NPCs ignore it; LLM-backed NPCs set itTrue.AgentNPC(NPC)base —requires_llm = True, owns the cadence machine, lazy-imports strands, swallows per-tick LLM errors, marks itself broken on build failure so we don't retry every tick. Subclasses provide_build_tools(interface)and (optionally)_user_prompt(interface).Episode runtime (
src/openrange/core/episode.py,src/openrange/runtime.py):EpisodeService(npc_llm_model=None)constructor kwarg._start_npcsbuilds a per-NPC context: NPCs withrequires_llm = Trueadditionally receive anllmkey (string model id orNone); plain NPCs see nollmkey and pay nothing.RunConfig.npc_llm_modelthreads throughOpenRangeRun.episode_service().Reference NPC (
packs/cyber_webapp/cyber_webapp/npcs/curious_employee.py):cyber.curious_employee— wrapsinterface["http_get"]as a strands tool and acts as a casual internal employee browsing the company webapp. Goes silent ifstrands-agentsisn't installed (the episode is unaffected). Registered via the existingopenrange.npcsentry-point group.Tooling:
pyproject.toml— addedstrands/strands.*/strands_tools.*to mypyignore_missing_importsso core type-checks cleanly without the optional extra installed.Tests
All 141 tests pass under
uv run --extra strands pytest.New coverage:
requires_llmflag default + opt-in override.AgentNPC: cadence math, agent reuse across acting ticks, model override precedence (constructor wins over runtime), broken-state on build failure (no per-tick retries), invocation-error swallowing, stop() clears the agent reference, default_build_agentraises a clearNPCErrorwhen strands is missing.llmkey injected only for opt-in NPCs;llm = Nonewhennpc_llm_modelisn't configured; plain NPCs see nollmkey.cyber.curious_employeefactory (defaults, overrides, bad-config rejection) and entry-point registration.Test plan
uv run ruff check src tests packs— cleanuv run mypy src packs/cyber_webapp— no new errors (6 pre-existing on this branch are unrelated dashboard/codegen work)uv run --extra strands pytest -q— 141 passedcyber.curious_employeeinto a manifest and confirm it issues HTTP requests against the live cyber webapp. (Out of scope for CI; requires a live API key.)Notes
This supersedes the prior phasing of #74 (memory → schedule → reactive → LLM). Memory, scheduling, and reactivity all fall out of "NPCs are agents with tools" — an agent loop with state and tool calls covers them without per-feature primitives in the NPC ABC.
The two existing scripted NPCs (
cyber.browsing_user,cyber.admin_audit) remain as the "no LLM, hand-writtenstep()" reference and continue to pass tests unchanged.🤖 Generated with Claude Code