Skip to content

feat(npc): NPCs as agents with tools (optionally LLM-backed)#217

Merged
larstalian merged 23 commits into
mainfrom
feat/agent-npcs-strands
May 4, 2026
Merged

feat(npc): NPCs as agents with tools (optionally LLM-backed)#217
larstalian merged 23 commits into
mainfrom
feat/agent-npcs-strands

Conversation

@larstalian
Copy link
Copy Markdown
Collaborator

Summary

Closes #74. Lifts NPCs from scripted-loop "smarter cron jobs" to agent loops with a persona, a tool surface, and an opt-in LLM. Reuses the optional strands-agents SDK so we don't re-invent tool dispatch / retries / streaming.

The pack-author surface for an LLM-backed NPC is now ~50 lines: persona string, an _build_tools(interface) hook that wraps interface methods as @strands.tool-decorated callables, an entry point. See cyber.curious_employee as the reference.

What changed

Core (src/openrange/core/npc.py):

  • NPC.requires_llm: ClassVar[bool] = False — opt-in flag. Plain NPCs ignore it; LLM-backed NPCs set it True.
  • AgentNPC(NPC) base — requires_llm = True, owns the cadence machine, lazy-imports strands, swallows per-tick LLM errors, marks itself broken on build failure so we don't retry every tick. Subclasses provide _build_tools(interface) and (optionally) _user_prompt(interface).

Episode runtime (src/openrange/core/episode.py, src/openrange/runtime.py):

  • EpisodeService(npc_llm_model=None) constructor kwarg.
  • _start_npcs builds a per-NPC context: NPCs with requires_llm = True additionally receive an llm key (string model id or None); plain NPCs see no llm key and pay nothing.
  • RunConfig.npc_llm_model threads through OpenRangeRun.episode_service().

Reference NPC (packs/cyber_webapp/cyber_webapp/npcs/curious_employee.py):

  • cyber.curious_employee — wraps interface["http_get"] as a strands tool and acts as a casual internal employee browsing the company webapp. Goes silent if strands-agents isn't installed (the episode is unaffected). Registered via the existing openrange.npcs entry-point group.

Tooling:

  • pyproject.toml — added strands / strands.* / strands_tools.* to mypy ignore_missing_imports so core type-checks cleanly without the optional extra installed.

Tests

All 141 tests pass under uv run --extra strands pytest.

New coverage:

  • requires_llm flag default + opt-in override.
  • AgentNPC: cadence math, agent reuse across acting ticks, model override precedence (constructor wins over runtime), broken-state on build failure (no per-tick retries), invocation-error swallowing, stop() clears the agent reference, default _build_agent raises a clear NPCError when strands is missing.
  • Episode runtime: llm key injected only for opt-in NPCs; llm = None when npc_llm_model isn't configured; plain NPCs see no llm key.
  • Pack: cyber.curious_employee factory (defaults, overrides, bad-config rejection) and entry-point registration.

Test plan

  • uv run ruff check src tests packs — clean
  • uv run mypy src packs/cyber_webapp — no new errors (6 pre-existing on this branch are unrelated dashboard/codegen work)
  • uv run --extra strands pytest -q — 141 passed
  • End-to-end smoke with a real Anthropic key: drop cyber.curious_employee into a manifest and confirm it issues HTTP requests against the live cyber webapp. (Out of scope for CI; requires a live API key.)

Notes

This supersedes the prior phasing of #74 (memory → schedule → reactive → LLM). Memory, scheduling, and reactivity all fall out of "NPCs are agents with tools" — an agent loop with state and tool calls covers them without per-feature primitives in the NPC ABC.

The two existing scripted NPCs (cyber.browsing_user, cyber.admin_audit) remain as the "no LLM, hand-written step()" reference and continue to pass tests unchanged.

🤖 Generated with Claude Code

larstalian and others added 8 commits May 4, 2026 10:36
Closes #74. Lifts NPCs from scripted-loop "smarter cron jobs" to
agent loops with a persona, a tool surface bound over the runtime
backing's interface, and an opt-in LLM. Strands SDK (already an
optional dep) owns the loop, so we don't re-invent tool dispatch,
retries, or streaming.

Core:
* `NPC.requires_llm: ClassVar[bool] = False` -- opt-in flag.
* `AgentNPC(NPC)` base -- system prompt + cadence + abstract
  `_build_tools(interface)`. Lazy strands import; per-tick failures
  are silent; build failures mark the NPC broken so we don't retry
  every tick.
* `EpisodeService(npc_llm_model=...)` injects the model id into the
  per-NPC `start()` context for NPCs that opt in. Plain NPCs see
  no `llm` key and pay nothing.
* `RunConfig.npc_llm_model` threads it through `OpenRangeRun`.

Reference NPC:
* `cyber.curious_employee` -- a ~50 LOC `AgentNPC` subclass. Wraps
  `interface["http_get"]` as a strands `@tool` and acts as a casual
  internal employee browsing the company webapp. Goes silent if
  strands isn't installed; the episode is unaffected.

Tests cover: requires_llm flag default + override, AgentNPC cadence
+ build failures + invocation errors + stop/cleanup, runtime LLM
injection (opt-in vs. not, configured vs. not), and the new pack
NPC's factory + entry-point registration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…end, not a model id

Lifts the NPC->LLM seam from a strands-shaped model id string to a
provider-neutral ``AgentBackend`` protocol. NPCs no longer reach
into strands directly; they ask their backend to ``build_agent``,
and the backend handles the provider details. Two implementations
ship:

* ``StrandsAgentBackend`` -- canonical, wraps ``strands.Agent``.
  Lazy-imports strands so the optional extra is only required if
  this backend is actually instantiated.
* ``CodexAgentBackend`` -- wraps the existing
  ``openrange.llm.CodexBackend`` for tool-less agent prompts. Same
  Codex binary the builder uses, no strands install needed. Errors
  loudly if handed any tools (Codex's tool surface isn't exposed
  for arbitrary callable injection).

Both backends implement a ``preflight()`` method so the broken-state
machinery surfaces missing deps (strands not installed, codex CLI
not on PATH) at episode start rather than on the first acting tick.

Wire-up:

* ``EpisodeService(npc_agent_backend=...)`` -- pass any
  ``AgentBackend``. The legacy ``npc_llm_model="..."`` string still
  works; it auto-promotes to ``StrandsAgentBackend(model=...)``.
  Passing both is rejected.
* ``RunConfig`` mirrors the same pair of knobs.
* The per-NPC context delivered to ``AgentNPC.start()`` now carries
  ``agent_backend`` (an ``AgentBackend`` instance or ``None``),
  replacing the previous ``llm`` (model id string) key.
* ``AgentNPC(agent_backend=...)`` -- explicit per-NPC override
  always wins over the runtime backend. The strands-only ``model=``
  knob is gone (use ``agent_backend=StrandsAgentBackend(model=...)``).
* ``cyber.curious_employee``: per-NPC ``model: str`` config still
  works as a YAML-friendly convenience (auto-promotes to a per-NPC
  ``StrandsAgentBackend``).

Top-level ``openrange`` re-exports ``AgentBackend``,
``AgentBackendError``, ``StrandsAgentBackend``, ``CodexAgentBackend``
alongside the existing ``LLMBackend`` family.

Tests cover: both backends' preflight + build paths, Strands
tool-rejection, Codex tool-rejection, AgentNPC with constructor
backend vs. runtime-supplied backend, broken-on-no-backend, broken
on backend preflight failure, EpisodeService backend-vs-model knob
mutual exclusion, and the cyber pack's per-NPC model promotion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
No other core module declares one — package exports are managed via
``openrange/core/__init__.py``. Removing keeps the convention.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Public-facing protocols pack authors implement against don't belong
under ``openrange.core`` — that namespace is for internal building
blocks (manifest parsing, episode service internals, runtime
backings, snapshot store, graph machinery). Pack authors should
import from ``openrange`` (or top-level submodules), never from
``.core``.

Moves:
* ``openrange.core.agent_backend`` -> ``openrange.agent_backend``
* ``openrange.core.npc`` -> ``openrange.npc``

Both now sit at the same tier as ``openrange.llm`` and
``openrange.runtime`` — public, top-level, owned by users and pack
authors.

Top-level ``openrange`` re-exports the NPC family
(``NPC``, ``AgentNPC``, ``NPCRegistry``, ``NPCError``, ``NPCS``)
alongside the existing ``LLMBackend`` / ``AgentBackend`` exports,
so the typical pack import becomes ``from openrange import NPC,
AgentNPC, ...`` — no submodule reach-through.

The cyber_webapp pack imports were updated to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…enrange.core`

Same convention call as the npc + agent_backend relocations:
``openrange.core`` is internal building blocks; pack authors should
import from the top-level ``openrange`` package, where everything
public is already re-exported.

All cyber_webapp imports of ``openrange.core.errors``,
``openrange.core.graph``, ``openrange.core.pack``,
``openrange.core.builder_protocol``, ``openrange.core.builder``, and
``openrange.core.manifest`` are rewritten to ``from openrange import
...``. Functionally identical (those names already lived at the top
level via ``openrange/__init__.py`` re-exports); the diff is just
the import paths.

The cyber pack now has zero ``openrange.core.*`` imports, which
matches the rule: packs talk to ``openrange``, not into its
internals.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pure cosmetic reformat — multi-line function calls collapsed onto one
line where they fit, trailing whitespace removed, no semantic changes.
Pulled out as a standalone commit so the dashboard work that follows
stays diffable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…projection

`uv run python -m openrange dashboard` (no args) now serves a
tensorboard-style viewer that watches `./or-runs/`, lists every run
that has dashboard artifacts, and lets the operator switch between
them live. The previous CLI required `--run-root <path>` and only
viewed one run at a time.

Architecture (env-owned, pack-agnostic):

  - `dashboard/runs.py` — `RunsRegistry` discovers run subdirs that
    carry `dashboard.events.jsonl` + `dashboard.json`, lazily mints
    a `DashboardView` per run, and caches it. Path-traversal guard
    rejects `?run=../../etc` style ids.
  - `dashboard/server.py` — server holds either a registry (multi-run)
    or a single view (embedded `OpenRangeRun.serve_dashboard()`).
    All per-run routes resolve via `?run=<id>` (falls back to the
    registry's newest). New `GET /api/runs` returns `{runs, default}`.
  - `dashboard/topology.py` — graph -> view projection lives here, in
    the dashboard module (not the cyber pack). Reads standard v1
    ontology nodes/edges directly from `snapshot.world_graph` and
    surfaces services / endpoints / vulns / zones / users. Other
    packs can still ship their own `topology.json` artifact or
    `world.topology` to override.
  - `dashboard/static/` — left-side collapsible "RUNS" drawer with a
    list view (cards per run, active run highlighted in accent),
    "Follow latest" toggle (default on; auto-switches when newer
    runs land), Esc closes drawer. SPA polls `/api/runs` every 5s
    for live discovery.

CLI:
  - `--runs-dir or-runs/` (default) — multi-run mode
  - `--run-root <dir>` — single-run mode
  - `--snapshot-id <id>` — explicit snapshot-store mode (no implicit
    fallback when `./snapshots/` happens to exist; surprising
    behaviour fixed)
  - Adds `if __name__ == "__main__": main()` at module bottom — the
    bare `python -m openrange dashboard` was previously a silent
    no-op because `main()` was never invoked.

`OpenRangeRun.serve_dashboard()` stays as the embedded primitive
for callers that want an in-process server. Default mode just writes
events to disk; the standalone dashboard process is the canonical
viewer.

example/codex_eval.py prints a hint pointing at the dashboard CLI
after build instead of trying to wire its own server.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Eight cleanups found during the PR review:

1. NPC base-class docstring corrected — was "llm" key in context, is
   actually "agent_backend".
2. EpisodeService._start_npcs comment clarified — manifest-shape
   errors propagate; per-NPC SDK failures are caught and surfaced via
   broken_reason.
3. RunConfig mutual-exclusivity (npc_agent_backend vs npc_llm_model)
   validated at OpenRangeRun.__init__ instead of waiting for
   episode_service() to surface it.
4. _mark_broken: exc_info=exc directly. The previous "fallback to True"
   asked logging.warning to introspect sys.exc_info outside any
   except block — would either grab unrelated in-flight exceptions or
   print "NoneType: None".
5. auto_evolve emits an "auto_evolve_chosen" event before forwarding
   to evolve(), so the dashboard lineage view gets the full
   direction + note + parent narrative instead of seeing two snapshots
   appear back-to-back.
6. topology_from_world_graph docstring is honest about its
   cyber-pack-ontology coupling (was "pack-agnostic", which overstated
   the reusability).
7. Dashboard CSS title positioning consolidated through three custom
   properties (--sim-title-offset, --sim-toggle-width, --sim-title-left).
   No more cascade-override hack between two .sim-title rules.
8. LLMBackend gains a preflight() protocol method (default no-op);
   CodexBackend overrides to check the codex CLI is on PATH;
   CodexAgentBackend now delegates to it instead of skipping the
   check for caller-supplied custom backends. Test updated to match
   the new "every backend self-describes" contract.

177 tests + 2 skipped, ruff clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@larstalian larstalian marked this pull request as ready for review May 4, 2026 18:11
larstalian and others added 14 commits May 4, 2026 13:51
…chatter, meet

Three pieces, all on the path to the physical-simulation vision (#219):
office NPCs walking around a floor and chatting, the dashboard
actually streaming their events live, and CI mypy clean across the
tree.

Core wiring:

* ``NPC.actor_id`` (default ``<ClassName>-<short hash>``; subclasses
  set ``self._actor_id`` for friendly names) plus a per-NPC
  ``record_action`` callable injected into ``start()`` context. NPCs
  use it to publish ``ActorTurn``-shaped events tagged with their
  own actor_id. Errors swallowed — recording is observational.

Reference NPC (cyber.office_chatter):

* Scripted, deterministic via seed. Each acting tick is either
  ``{"speak": phrase}`` (fading bubble in the dashboard) or
  ``{"move": "wandering"}`` (walks to a colleague's desk). Initial
  cooldown staggered by seed so 6 chatters don't fire in lock-step
  on tick 0. ``start()`` emits a ``{"present": True}`` event so
  characters appear at their desks immediately rather than waiting
  out the first cadence window.

Dashboard:

* Speech bubbles (3s hold + 1s fade) above characters on
  ``action.speak`` events, no movement.
* Decoupled office desks from cyber services — 8-desk grid laid out
  on the south side of the floor, NPCs anchored to a stable home
  desk via ``homeDeskFor(actor_id)`` hash. ``move`` events walk
  visitor to a colleague's desk; host fires a one-line reply
  shortly after; visitor returns home 9-13s later.
* Agent / runtime characters no longer render as walking bodies in
  the office — service ring colors flash for HTTP traffic instead.
* Station + desk labels removed; cleaner scenery.
* Fingerprint stability: ``simulationFingerprint`` is now
  topology-only, so every new event no longer triggers a full world
  rebuild that wiped mid-walk characters.
* RUNS toggle moved to a tiny ``›`` chevron in the upper-left
  corner; clock back at the top-right.

Live streaming (the main bug):

* ``DashboardView`` gained a ``tail=True`` kwarg that polls
  ``dashboard.events.jsonl`` every 250ms and pushes new lines into
  the bridge. Single-run mode (``openrange dashboard --run-root``),
  multi-run mode (``RunsRegistry``), and the SSE pipeline all use
  it now. Embedded writer mode (``OpenRangeRun.serve_dashboard``)
  keeps ``tail=False`` so it doesn't re-publish its own writes.
* ``_stored_section`` re-reads ``dashboard.json`` on each request
  in reader-mode so a snapshot the writer lands AFTER the view was
  constructed surfaces immediately — no need to restart the
  dashboard server when an eval starts mid-run.
* ``/api/runs`` synthesizes a non-null default in single-run mode
  (using either the live snapshot or the stored topology id, or
  the literal ``"single"``) so the SPA actually sets ``activeRun``
  and ``openStreams`` opens the SSE connection. Without this fix
  single-run mode showed events frozen at boot — events flowed to
  disk, the tail pushed them into the bridge, but the SPA never
  opened the SSE listener.
* SPA also subscribes to ``builder_step`` events (in addition to
  ``env_turn`` / ``agent_step`` / ``note``) so the empty-state
  placeholder hands off to the live world the moment the builder
  lands ``snapshot_created``, with no env_turn yet to trigger a
  refresh.

Tests:

* New ``tests/test_dashboard_runs.py`` — tail picks up appended
  events within 2s, preserves history-on-open, ignores partial
  trailing lines, handles truncation, joins cleanly on close.
* ``tests/test_cyber_npcs.py`` extended for ``OfficeChatter``
  factory / cadence / speech / walk / silent-without-recorder /
  presence-on-start / staggered cooldown.
* ``tests/test_v1_episode.py`` — episode runtime injects
  ``record_action`` into NPC context and the resulting events
  surface tagged with the NPC's actor_id.
* CI cleanup: ``tests/test_curriculum.py`` /
  ``tests/test_cyber_auto_curriculum.py`` / runtime.py /
  test_agent_backend.py / test_v1_episode.py mypy fixes (was 26
  errors blocking CI).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… about no-any-return

CI-blocking: src/openrange/agent_backend.py:138 was returning Any
from a function declared to return Callable[[str], Any]. Adding
an AgentSession-typed local satisfies the no-any-return check
without changing behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User feedback: chatters not visibly moving around. Three knobs:

* OfficeChatter: cadence_ticks 9 -> 6 (act every 4s), walk_probability
  0.3 -> 0.5 (half of acts are walks). Each NPC initiates a visit
  every ~8s on average; across 6 chatters that is one walk every
  ~1.3s globally.
* JS walk speed: 4.2 -> 8.0 units/s plus stronger leg swing. Crossing
  the office takes ~1-2s instead of 3-5s.
* Visit duration: 9-13s -> 4-6s; host reply delay 2.5-4s -> 1.5-2.5s
  so the reply lands while the visitor is actually there.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User-reported: NPCs stand still and speak way too often.

Root cause: ``OfficeChatter.step`` only emitted ``move`` events when
``self._known_targets`` was non-empty, and that set was only populated
in ``start()`` when a ``home`` config field was passed. The example
manifest never set ``home``, so chatters never had targets, so every
act fell through to the speech branch. Result: zero motion, all
chatter, exactly the symptom the user kept seeing.

Fix: walk vs. speak is now a clean ``random() < walk_probability``
flip. The dashboard picks the destination desk client-side from the
visitor home; the backend does not need to know any service ids to
emit a walk. ``home`` is still honored as the optional ``target``
field on the move event for callers that want to override the
dashboard pick.

Also tuning: cadence_ticks 6 -> 8, walk_probability 0.5 -> 0.8.
Gives ~1 walk/sec and ~1 speech/4s across the office.

Smoke (6 chatters, 30 ticks each): 20 walks vs 4 speaks. Was 0 walks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User-reported: post-build the dashboard never transitions out of the
"no admitted world" empty state, and after a server restart the
backlog dumps in a burst then nothing else updates.

Two robustness changes on the SPA side:

* Debounce SSE event handlers via ``scheduleRefresh`` (150 ms
  coalesce). The backlog burst on reconnect was firing up to 200
  parallel refresh() calls, each doing 6 concurrent API fetches.
  That race can land the model on stale snapshots and pegs the
  server during the burst. Coalescing to one refresh per 150 ms
  keeps the UI accurate without the thundering herd.
* 2s polling refresh as an SSE fallback. SSE is still primary, but
  if the connection ever drops silently (browser disconnect,
  intermediary timeout, etc.) the poll keeps the UI in sync with
  on-disk state — including the post-build snapshot transition.

Backend was already correct end-to-end (verified via curl: tail
picks up appended events, /api/topology returns snapshot_id, SSE
delivers live). This is purely SPA-side robustness.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related issues found while end-to-end-testing the eval's
auto_evolve loop:

1. ``auto_evolve`` previously called ``evolve()`` on the single
   highest-relevance candidate and let any ``BuildFailed`` propagate.
   A pack proposing a mutation tag that doesn't admit (e.g. the
   cyber pack's ``add`` placing a vuln off the oracle path) crashes
   the whole eval mid-curriculum. Now it walks candidates by
   descending relevance, surfaces each skip via the ``event_sink``
   (``auto_evolve_skipped``), and returns ``None`` only when every
   candidate fails. Eval continues with the previous snapshot.

2. The cyber pack's ``add`` mutation in
   ``cyber_webapp.mutation._add_vulns_by_kind`` placed the new vuln
   on whatever endpoint came first in iteration order, with no
   awareness of the oracle path. When the prior ``patch`` had
   stripped the only vuln on the oracle service, the next ``add``
   would land elsewhere and the resulting graph would fail
   ``OraclePathExistsConstraint``. Added ``_oracle_path_targets``
   that walks ``flag → record → data_store → service → endpoint``,
   and re-ordered the ``add`` candidate list so oracle-path
   endpoints / services come first.

End-to-end verified: a pass→harden→fail→soften→fail→soften→pass→harden
walk now produces 4 distinct snapshots with the world genuinely
mutating each step (vulns added/removed in line with the curriculum
direction), no crashes, no skipped events.

191 tests + 2 skipped, ruff clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User-reported: hitting Ctrl+C on the eval leaves the cyber webapp
subprocess (uvicorn / generated app.py) reparented to PID 1. After
many interrupted sessions there are 20+ orphan ports bound and
file descriptors leaked.

Three changes:

* ``start_runtime_process`` uses ``start_new_session=True`` so the
  runtime subprocess gets its own session/process group. Without it,
  Ctrl+C in the harness terminal sends SIGINT to every child too —
  some of those (uvicorn, HTTPServer) handle SIGINT via graceful-
  shutdown paths that race with the parent cleanup and leak the
  process when reparented.
* ``stop_process`` SIGTERMs the whole process group when the
  subprocess actually has its own group (different pgid from the
  caller) — catches uvicorn workers, request threads, anything
  spawned downstream. Critically: when the subprocess shares the
  caller pgid (bare Popen, e.g. test fixtures), it falls back to
  process.terminate() instead of killing the whole group, which
  would otherwise SIGTERM the test runner itself.
* ``EpisodeService`` registers an ``atexit`` hook that walks any
  still-running episode artifacts and stops them. Backstop for the
  case where a try/finally somewhere upstream missed cleanup —
  KeyboardInterrupt fired during cleanup, an unrelated exception,
  etc. Idempotent and best-effort.

Verified end-to-end: starting an eval and sending SIGINT 4s in now
leaves zero orphan app.py processes (was leaking one per
interrupted run).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User-reported (10th time): events do not stream to the SPA, even
though server-side polling /api/state shows event_count growing
live (29 -> 34 -> 38 -> 43 -> 48 over 10s). The hang must be
somewhere in the browser layer that I cannot see from here, but
the SPA can be made resilient regardless.

Three changes:

* Polling refresh moved to 1s (was 2s). Polling is now the
  *primary* live-update path; SSE is a nice-to-have optimization.
  If SSE drops silently for any reason (browser quirk, proxy
  timeout, connection-pool exhaustion) the 1s poll keeps the UI
  in sync.
* ``safeRefresh`` / ``safeRefreshRuns`` wrap fetch loops so a
  single failed poll does not kill the interval chain.
* Visible freshness indicator: when a refresh has not landed in
  >5s, the subtitle suffixes "· last update Ns ago" so the
  operator can immediately tell whether the SPA is stuck (with
  no need to open dev tools). Cleared back to the bare title
  when fresh.

Server-side was already correct (verified end-to-end via curl
against a live eval). This commit is purely client-side
resilience for the case where the user cannot see why the page
appears frozen.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User-reported: chatter speech was disconnected one-liners — visitor
says "did the prod deploy go out?", host replies "huh" — that did
not read as a workplace exchange.

Rebuilt around a JS-side ``EXCHANGES`` table of 16 short coherent
opener-and-reply pairs. When an NPC walks to a colleague, the
dashboard:

1. picks one exchange,
2. has the visitor speak the opener ~1.6s after starting the walk
   (right when they arrive at the host desk),
3. has the host speak one of the matching replies ~1.6-2.2s after
   that.

Result: every visit reads as a real exchange — "deploy went out?"
→ "yeah, just now"; "build is red on main" → "ill take a look"
— with sensible diversity across visits.

Backend simplification: ``OfficeChatter`` is now walks-only. It no
longer emits standalone ``speak`` events; speech is entirely
dashboard-orchestrated as part of a visit. Drops the per-NPC
``lines`` config (the example manifest no longer carries 4 phrases
per staff member — every chatter draws from the shared 16-exchange
pool, which is far more diverse than 4 hand-coded mutterings).

Tests updated: walks-only emission, presence event still fires,
cadence still respected, walk_probability=0 means quiet ticks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four cleanups for the merge-to-main pass:

* Move the EXCHANGES dialogue corpus from ``dashboard.js`` into
  ``cyber.office_chatter``. The chatter now picks an opener-and-reply
  pair per walk and ships them on the move event; the dashboard just
  reads ``action.opener`` and ``action.reply`` and times the bubbles.
  Scenario content lives in the pack; the renderer stays generic.
* Tighten history-shaped comments. ``.rules`` reserves PR/commit
  context for PRs/commits — comments should explain *why* (subtle
  invariants, non-obvious constraints), not what changed.
* Drop boilerplate docstrings from internal helpers (\_user_prompt,
  \_build_agent, \_invoke_agent) — name + body speak for themselves.
  Keep the one on \_mark_broken because the ``exc_info=exc`` choice
  is non-obvious; collapse from a four-paragraph rationale to two
  lines.
* Tighten _EventLogTail / DashboardView.close docstrings to the
  same standard.

194 tests pass; ruff and mypy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…enders

OfficeChatter now ships home_index (stable SHA1 of name) and
target_name (chosen colleague) on its events. The dashboard reads
them straight off the event stream instead of recomputing seating
with its own per-name hash and inventing who-visits-whom with a
random pick over desks. SHA1 (not Python's randomized hash) so the
same chatter lands on the same desk across runs.

Drops homeDeskFor and pickColleagueDesk from dashboard.js, plus the
rebuild-time NPC pre-spawn — chatters announce themselves via their
present event in start(), which carries the same home_index.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`_seed_for_state` was mixing the configured seed with `hash(pack.id)`,
but Python's built-in `hash` is randomized per process — different
PYTHONHASHSEED values yielded different graphs for the same manifest,
and a handful of seeds (PYTHONHASHSEED=17, 20, 31 reproduce locally)
generated worlds that failed feasibility, flaking
`test_evolve_adds_new_vulns` in CI. Swap to SHA1 so the same pack id
always derives the same offset, matching the `_stable_home_index`
pattern used in the office-chatter NPC.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The eval scripts are about agent + builder behavior, not the office
demo scenery. Remove the cyber.office_chatter spam (and the shared
_office_demo helper) from codex_eval and strands_eval; users who want
the populated 3D office can add chatters to their own manifest.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
npc.py: the module/class/method docstrings were each restating the
lifecycle, the requires_llm contract, and the AgentBackend injection
flow. Collapse to one statement per concept. Drop the empty "Default:
no-op" docstring on stop(), the empty NPCError docstring, and the
six-line preamble explaining why we don't isinstance-check the runtime
backend.

runtime.py: drop the "validate up front rather than waiting" preamble
that just narrates the next two lines, and tighten the RunConfig field
comments — both already say what's needed without repeating the obvious.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@larstalian larstalian merged commit 1eb1805 into main May 4, 2026
1 check passed
@larstalian larstalian deleted the feat/agent-npcs-strands branch May 4, 2026 23:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

NPCs as agents with tools (optionally LLM-backed)

1 participant