Skip to content

fix(creator): raise max_history_messages + repair refresh-cache CI#91

Merged
houko merged 3 commits into
mainfrom
fix/creator-hand-max-history
May 11, 2026
Merged

fix(creator): raise max_history_messages + repair refresh-cache CI#91
houko merged 3 commits into
mainfrom
fix/creator-hand-max-history

Conversation

@houko
Copy link
Copy Markdown
Contributor

@houko houko commented May 11, 2026

Two related fixes shipped together because the second blocks the first from reaching users.

Summary

1. hands/creator/HAND.tomlmax_history_messages = 80 on [agents.main]

Creator Hand's async video_generate flow polls video_status every 15-20s until completion (1-3 min typical), consuming 5-15 turns per video request. Combined workflows (video + TTS + music) plus normal back-and-forth cross the kernel default of 40 messages quickly.

User log (real-world observation):

WARN run_agent_loop: Trimming old messages at safe turn boundary
  agent=creator:creator-hand total_messages=41 trimming=2
INFO run_agent_loop: prompt cache metrics for turn
  hit_ratio=0.0 creation=0 read=0

Every turn was hitting the trim cap and invalidating the prompt-cache prefix. 80 covers ~30 polling iterations plus a comfortable pre-context window without runaway memory growth. Other hands keep the default 40.

2. .github/workflows/refresh-cache.yml — open PR instead of git push to main

Branch protection on main started rejecting the workflow's auto-commit with GH006 "Changes must be made through a pull request" — see run 25632824585 on 2026-05-10. Direct push is precisely what the file's own security comment (mitigation #1) warns against ("Compromised maintainer pushes a malicious plugins-index.json directly to main. Mitigation: GitHub branch protection on main requires PR review"), so this PR preserves that gate rather than working around it.

The workflow now:

  • creates a short-lived automation/refresh-indexes-<sha> branch
  • commits the regen there
  • pushes
  • opens a PR back to main via gh pr create

Maintainers see a one-click squash-merge.

Permissions: added pull-requests: write to the existing contents: write so gh pr create is authorised through the default GITHUB_TOKEN.

No loop risk: the post-merge run on the index PR is a no-op (no diff under hands/**, plugins/**, etc. between consecutive states), so no [skip ci] marker is needed.

Why bundled

Without the workflow fix, the creator-hand max_history_messages change would land on main but plugins-index.json would stay stale (last successful regen: 2026-05-04). Daemons fetching the index wouldn't see the new value. Same blast radius applies to every other content PR currently waiting on main.

Verification

  • python3 -c "import yaml; yaml.safe_load(open('.github/workflows/refresh-cache.yml'))" → OK
  • Diff of hands/creator/HAND.toml: 7-line addition (config + comment), no other changes
  • Workflow diff: only the "Commit regenerated indexes if changed" step (renamed → "Open PR with regenerated indexes if changed") and the permissions: block

Test plan

  • On merge, the next push that touches hands/** etc. triggers refresh-cache; verify it opens a fresh automation/refresh-indexes-<sha> PR (not red GH006).
  • Auto-opened PR is mergeable and the post-merge run is a no-op.
  • After auto-PR merges, plugins-index.json and registry-index.json reflect the new content, and https://stats.librefang.ai/api/registry/refresh returns 200.

Once this lands and a fresh refresh-cache run succeeds, the user-side fix is for daemons running Creator Hand to pull the regenerated index (or restart). No code change in librefang is required for this half of the PR.

houko added 2 commits May 11, 2026 15:19
Creator Hand's async video_generate path polls video_status every 15-20s
until completion (1-3 min typical), consuming ~5-15 turns per video
request. Combined workflows (video + TTS + music) plus normal back-and-
forth cross the kernel default of 40 messages quickly, which surfaced
in user logs as:

  WARN run_agent_loop: Trimming old messages at safe turn boundary
    agent=creator:creator-hand total_messages=41 trimming=2
  INFO run_agent_loop: prompt cache metrics for turn
    hit_ratio=0.0 creation=0 read=0

Every turn was hitting the trim cap and invalidating the prompt-cache
prefix. 80 covers ~30 polling iterations plus a comfortable pre-context
window without runaway memory growth. Other hands keep the default 40.
Branch protection on `main` started rejecting the workflow's auto-commit
with GH006 "Changes must be made through a pull request" — see run
25632824585 on 2026-05-10 against commit 6785807 (the first push that
hit the tightened protection). Direct push is precisely what the file's
own security comment (#1) warns against ("Compromised maintainer pushes
a malicious plugins-index.json directly to main. Mitigation: GitHub
branch protection on main requires PR review"), so the fix preserves
that gate rather than working around it.

The workflow now creates a short-lived `automation/refresh-indexes-<sha>`
branch, commits the regen there, pushes, and opens a PR back to main
via `gh pr create`. Maintainers see a one-click squash-merge.

Permissions: add `pull-requests: write` to the existing `contents: write`
so `gh pr create` can be authorised through the default GITHUB_TOKEN.

The post-merge run on the index PR is a no-op (no diff under
`hands/**`, `plugins/**`, etc. between consecutive states), so no
`[skip ci]` marker is needed and no loop is possible.

Without this fix, every content PR landing on main leaves
plugins-index.json + registry-index.json stale, blocking new agents and
hands from reaching daemons until a maintainer manually regenerates.
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

Three hand coordinators have workflows that routinely exceed the kernel
default history cap on a single user turn:

- researcher (max_iterations=80) — deep web_search → web_fetch →
  summarize loops with multi-source synthesis. 80 iterations × ~4
  messages each → 200+ messages per user turn. Set to 120.
- devops    (max_iterations=60) — incident response and CI/CD fan out
  into long shell_exec chains (logs, retries, post-mortems). Set to 80.
- predictor (max_iterations=60) — long reasoning chains accumulating
  signals across many web/knowledge queries, with scheduled re-checks
  referring back. Set to 80.

Creator's existing override is rephrased "raise above the kernel
default" so the comment stays correct regardless of the order this PR
and the upstream kernel-default bump (librefang side) land in.

Other hands (lead/linkedin/reddit/clip/analytics/apitester/browser/
collector/strategist) stay on the kernel default; the upstream bump
covers them.
@houko houko enabled auto-merge (squash) May 11, 2026 06:47
@houko houko disabled auto-merge May 11, 2026 23:52
@houko houko enabled auto-merge (squash) May 11, 2026 23:52
@houko houko disabled auto-merge May 11, 2026 23:55
@houko houko merged commit 651ff1b into main May 11, 2026
3 checks passed
@houko houko deleted the fix/creator-hand-max-history branch May 11, 2026 23:55
houko added a commit that referenced this pull request May 12, 2026
…#87)

All 32 agent manifests and 17 hands shipped with empty mcp_servers /
skills lists, which the kernel interprets as "no filter" — every
globally-configured MCP server's tools and every installed skill get
injected into the prompt on every LLM call. On a typical instance (9
MCP servers, ~85 MCP tools + ~82 built-in tools) that's ~50k input
tokens per turn spent on definitions the agent never uses.

Changes
-------

32 agents/*/agent.toml:
  - mcp_servers: 1-4 per agent. memory wherever state persists across
    turns; fetch / exa-search / brave-search only where the prompt
    actually calls for web; git / github / filesystem on engineering
    agents; gmail / google-calendar / linear / jira on productivity
    agents whose prompts mention them.
  - skills: per-role allowlist driven by what the system_prompt names
    (e.g. coder → rust/python/typescript/git/shell-scripting; devops-
    lead → docker/kubernetes/terraform/ansible/ci-cd/helm/prometheus/
    sysadmin). Generalists (assistant) and low-surface agents keep
    skills = [].
  - max_history_messages: tiered by workload shape —
      40  hello-world, recipe-assistant, health-tracker, home-automation
          (short conversational agents)
      60  writer, translator, doc-writer, email-assistant, customer-
          support, sales-assistant, recruiter, social-media, personal-
          finance, tutor, travel-planner, meeting-assistant, ops,
          devops-lead, planner (single-turn task agents)
      80  coder, debugger, architect, code-reviewer, test-engineer,
          security-auditor, analyst, data-scientist, academic-
          researcher, researcher, legal-assistant (multi-step, tool-
          heavy work)
      120 assistant, orchestrator (long multi-agent coordination —
          prompt-cache continuity is critical)
    All four tiers sit at or above the rising kernel default (40
    today, rising upstream). Pinning lower would thrash the prompt
    cache; #91 already demonstrated this on the creator hand.

17 hands/*/HAND.toml:
  - hand-level mcp_servers / skills now declared on every hand, so
    every [agents.*] inside inherits a sensible allowlist.
  - skills_disabled = true on clip and creator (pure media pipelines
    that don't benefit from any skill).
  - devteam: expand existing mcp_servers = ["github"] to include
    memory / git / filesystem; populate skills with the expected
    dev-team expertise (replacing the placeholder skills = []).
  - wiki: replace placeholder mcp_servers = [] with [memory, fetch,
    filesystem]. Hand-level skills stays [] so the existing per-
    agent override (agents.main.skills = ["wiki-librarian"]) keeps
    its meaning.

schema.toml: register mcp_servers / skills / max_history_messages on
the agent field schema so machine consumers (RegistrySchema in
librefang-types) see the new top-level fields.

agents/README.md: update example block to show the allowlists, plus
a max_history_messages override commented out with a prompt-cache
caveat. "Adding a New Agent" checklist now calls out the allowlists.

Why not adopt PR #89's approach
-------------------------------

PR #89 covers similar ground but with three issues this PR avoids:

1. PR #89 uses mcp_servers = ["_none"] as a sentinel for "no MCPs",
   explicitly pending upstream librefang#4808 (mcp_disabled). Shipping
   a magic-string today means coming back later to clean it up after
   the upstream field lands. This PR omits "_none" and uses real
   allowlists where MCPs are needed.
2. PR #89 sets max_history_messages to 8 / 12 / 15 / 20 — at or
   below the kernel's MIN_HISTORY_MESSAGES floor (4) and far below
   today's default (40). Every turn that hits the cap invalidates
   the cached prompt prefix, costing more than the smaller history
   saves. The values in this PR (40-120) align with #91's direction
   for long-workflow hands.
3. PR #89 also doubles several max_llm_tokens_per_hour caps (coder
   200k→500k, assistant 300k→500k). That widens the per-agent
   budget — the opposite direction from #87's "reduce per-call cost"
   goal — and is left to the operator's instance-specific tuning.

Refs #87, #89
houko added a commit that referenced this pull request May 12, 2026
…#87)

All 32 agent manifests and 17 hands shipped with empty mcp_servers /
skills lists, which the kernel interprets as "no filter" — every
globally-configured MCP server's tools and every installed skill get
injected into the prompt on every LLM call. On a typical instance (9
MCP servers, ~85 MCP tools + ~82 built-in tools) that's ~50k input
tokens per turn spent on definitions the agent never uses.

Changes
-------

32 agents/*/agent.toml:
  - mcp_servers: 1-4 per agent. memory wherever state persists across
    turns; fetch / exa-search / brave-search only where the prompt
    actually calls for web; git / github / filesystem on engineering
    agents; gmail / google-calendar / linear / jira on productivity
    agents whose prompts mention them.
  - skills: per-role allowlist driven by what the system_prompt names
    (e.g. coder → rust/python/typescript/git/shell-scripting; devops-
    lead → docker/kubernetes/terraform/ansible/ci-cd/helm/prometheus/
    sysadmin). Generalists (assistant) keep skills = [] (see "Open
    items" below).
  - skills_disabled = true on the four short-conversational agents
    (hello-world, recipe-assistant, health-tracker, home-automation).
    Their system prompts never instruct the LLM to consult any skill,
    so loading all 60 was pure waste. They also drop the explicit
    max_history_messages override and inherit the kernel default (60).
  - max_history_messages tiered by workload shape:
      60  short conversational (hello-world, recipe, health-tracker,
          home-automation) — inherits the rising kernel default
          (`DEFAULT_MAX_HISTORY_MESSAGES = 60`); no override needed.
      60  single-turn task agents (writer, translator, doc-writer,
          email-assistant, customer-support, sales-assistant, recruit-
          er, social-media, personal-finance, tutor, travel-planner,
          meeting-assistant, ops, devops-lead, planner) — explicit
          override at the same value to lock the cap if the kernel
          default moves again.
      80  multi-step / tool-heavy (coder, debugger, architect, code-
          reviewer, test-engineer, security-auditor, analyst, data-
          scientist, academic-researcher, researcher, legal-assistant)
      120 coordinators (assistant, orchestrator) — long multi-agent
          sessions where prompt-cache continuity is critical
    All values sit at or above the kernel default. Pinning lower
    would thrash the prompt cache (the failure mode #91 fixed for
    the creator hand by *raising* the cap, not lowering it).

17 hands/*/HAND.toml:
  - hand-level mcp_servers / skills now declared on every hand, so
    every [agents.*] inside inherits a sensible allowlist.
  - skills_disabled = true placed on each [agents.*] inside clip and
    creator (pure media pipelines that don't benefit from any skill).
    HandDefinitionRaw in librefang-hands does NOT have a top-level
    skills_disabled field — declaring it at the hand top level would
    be silently dropped by serde, so the setting must live on the
    AgentManifest of each sub-agent role.
  - devteam: expand existing mcp_servers = ["github"] to include
    memory / git / filesystem; populate skills with the expected
    dev-team expertise (replacing the placeholder skills = []).
  - wiki: replace placeholder mcp_servers = [] with [memory, fetch,
    filesystem]. Hand-level skills stays [].
  - lead: hand-level skills was originally [email-writer, writing-
    coach, interview-prep]; interview-prep is for job-interview
    preparation, not lead generation. Replaced with data-analyst
    (used by the qualification-scoring step in the prompt).

schema.toml: register mcp_servers / skills / max_history_messages on
the agent field schema so machine consumers (RegistrySchema in
librefang-types) see the new top-level fields. The
max_history_messages description now points at
librefang_runtime::agent_loop::DEFAULT_MAX_HISTORY_MESSAGES (60
today) by name, so the schema doesn't go stale when the constant
moves again.

agents/README.md: example block + "Adding a New Agent" checklist
mention the allowlists; max_history_messages example is shown
commented out with a prompt-cache caveat.

Open items
----------

`assistant` (the default user-facing agent) keeps `skills = []`
deliberately. It is the generalist entry point — capping its skill
surface at a small allowlist would defeat its "delegate to any
specialist" job. The trade-off is that this single agent still pays
the full skill-definition load on every turn; operators who want a
strict allowlist for `assistant` can override it after install.

Why not adopt PR #89's approach
-------------------------------

#89 covers similar ground but with three issues this PR avoids:

1. mcp_servers = ["_none"] sentinel. #89's body explicitly notes
   it's pending upstream librefang#4808 (mcp_disabled). Shipping a
   magic-string today means coming back later to clean it up. This
   PR uses real allowlists.
2. max_history_messages = 8 / 12 / 15 / 20. Far below today's
   kernel default (60) and #91's direction for long-workflow hands
   (80–120). Every turn that hits the cap invalidates the cached
   prompt prefix; the cost of cache misses exceeds the saving from
   shorter history. This PR uses 60–120.
3. Doubling max_llm_tokens_per_hour (coder 200k→500k, assistant
   300k→500k) widens the per-agent budget — the opposite direction
   from #87's "reduce per-call cost" goal. Left to the operator's
   instance-specific tuning.

Refs #87, #89
houko added a commit that referenced this pull request May 12, 2026
…#87) (#92)

All 32 agent manifests and 17 hands shipped with empty mcp_servers /
skills lists, which the kernel interprets as "no filter" — every
globally-configured MCP server's tools and every installed skill get
injected into the prompt on every LLM call. On a typical instance (9
MCP servers, ~85 MCP tools + ~82 built-in tools) that's ~50k input
tokens per turn spent on definitions the agent never uses.

Changes
-------

32 agents/*/agent.toml:
  - mcp_servers: 1-4 per agent. memory wherever state persists across
    turns; fetch / exa-search / brave-search only where the prompt
    actually calls for web; git / github / filesystem on engineering
    agents; gmail / google-calendar / linear / jira on productivity
    agents whose prompts mention them.
  - skills: per-role allowlist driven by what the system_prompt names
    (e.g. coder → rust/python/typescript/git/shell-scripting; devops-
    lead → docker/kubernetes/terraform/ansible/ci-cd/helm/prometheus/
    sysadmin). Generalists (assistant) keep skills = [] (see "Open
    items" below).
  - skills_disabled = true on the four short-conversational agents
    (hello-world, recipe-assistant, health-tracker, home-automation).
    Their system prompts never instruct the LLM to consult any skill,
    so loading all 60 was pure waste. They also drop the explicit
    max_history_messages override and inherit the kernel default (60).
  - max_history_messages tiered by workload shape:
      60  short conversational (hello-world, recipe, health-tracker,
          home-automation) — inherits the rising kernel default
          (`DEFAULT_MAX_HISTORY_MESSAGES = 60`); no override needed.
      60  single-turn task agents (writer, translator, doc-writer,
          email-assistant, customer-support, sales-assistant, recruit-
          er, social-media, personal-finance, tutor, travel-planner,
          meeting-assistant, ops, devops-lead, planner) — explicit
          override at the same value to lock the cap if the kernel
          default moves again.
      80  multi-step / tool-heavy (coder, debugger, architect, code-
          reviewer, test-engineer, security-auditor, analyst, data-
          scientist, academic-researcher, researcher, legal-assistant)
      120 coordinators (assistant, orchestrator) — long multi-agent
          sessions where prompt-cache continuity is critical
    All values sit at or above the kernel default. Pinning lower
    would thrash the prompt cache (the failure mode #91 fixed for
    the creator hand by *raising* the cap, not lowering it).

17 hands/*/HAND.toml:
  - hand-level mcp_servers / skills now declared on every hand, so
    every [agents.*] inside inherits a sensible allowlist.
  - skills_disabled = true placed on each [agents.*] inside clip and
    creator (pure media pipelines that don't benefit from any skill).
    HandDefinitionRaw in librefang-hands does NOT have a top-level
    skills_disabled field — declaring it at the hand top level would
    be silently dropped by serde, so the setting must live on the
    AgentManifest of each sub-agent role.
  - devteam: expand existing mcp_servers = ["github"] to include
    memory / git / filesystem; populate skills with the expected
    dev-team expertise (replacing the placeholder skills = []).
  - wiki: replace placeholder mcp_servers = [] with [memory, fetch,
    filesystem]. Hand-level skills stays [].
  - lead: hand-level skills was originally [email-writer, writing-
    coach, interview-prep]; interview-prep is for job-interview
    preparation, not lead generation. Replaced with data-analyst
    (used by the qualification-scoring step in the prompt).

schema.toml: register mcp_servers / skills / max_history_messages on
the agent field schema so machine consumers (RegistrySchema in
librefang-types) see the new top-level fields. The
max_history_messages description now points at
librefang_runtime::agent_loop::DEFAULT_MAX_HISTORY_MESSAGES (60
today) by name, so the schema doesn't go stale when the constant
moves again.

agents/README.md: example block + "Adding a New Agent" checklist
mention the allowlists; max_history_messages example is shown
commented out with a prompt-cache caveat.

Open items
----------

`assistant` (the default user-facing agent) keeps `skills = []`
deliberately. It is the generalist entry point — capping its skill
surface at a small allowlist would defeat its "delegate to any
specialist" job. The trade-off is that this single agent still pays
the full skill-definition load on every turn; operators who want a
strict allowlist for `assistant` can override it after install.

Why not adopt PR #89's approach
-------------------------------

#89 covers similar ground but with three issues this PR avoids:

1. mcp_servers = ["_none"] sentinel. #89's body explicitly notes
   it's pending upstream librefang#4808 (mcp_disabled). Shipping a
   magic-string today means coming back later to clean it up. This
   PR uses real allowlists.
2. max_history_messages = 8 / 12 / 15 / 20. Far below today's
   kernel default (60) and #91's direction for long-workflow hands
   (80–120). Every turn that hits the cap invalidates the cached
   prompt prefix; the cost of cache misses exceeds the saving from
   shorter history. This PR uses 60–120.
3. Doubling max_llm_tokens_per_hour (coder 200k→500k, assistant
   300k→500k) widens the per-agent budget — the opposite direction
   from #87's "reduce per-call cost" goal. Left to the operator's
   instance-specific tuning.

Refs #87, #89
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant