See also — Parent: docs/README.md, architecture.md · Axis: the outer loop moves θ (prompt artifacts) — its y-axis counterpart that refines a single run's deliverable is refinement.md · Loss function: evaluation.md + critique.md + gate rejections from runtime.md — weighted per
LossBreakdownin this doc · Never active insideawp run: entered only viaawp optimize/awp optimize-rollback/awp optimize-inspect· Autonomy mapping: compliance.md — A5 sits outside the 7 layers of layer-model.md · Related: iterative-optimization.md (inner-run feedback loop — orthogonal axis again)
The outer loop treats AWP runs as a training problem. Prompt artifacts are the model parameters, a completed run is one forward pass, a deterministic scalar loss quantifies how well the run went, and an LLM-as-optimizer acts as a text-gradient to nudge the artifacts between runs. Over many runs across a suite of tasks, the artifacts that drive manager planning, worker pitfalls, critique rubric, and pattern priming learn from experience — in the stochastic-gradient-descent sense of that phrase.
Status: A5 is experimental. It ships fully-wired in the runtime and the UI, but is opt-in at the CLI (
--with-textgrad) and deliberately outside the compliance spec (§2.5 ofspec/versions/1.0/compliance.md). The autonomy levels A0–A4 continue to describe how one run behaves; A5 describes how the system learns across runs.
Five intuitions that make the rest of this document click into place.
AWP's inner delegation loop (A2–A4) is search under uncertainty over one task. The manager explores subtask decompositions, the workers try code, the critique gate rejects bad completions, repair spawns fresh attempts. Every run starts cold and ends when the task is done or the budget is hit. The inner loop has no memory of previous tasks — each run is a one-shot episode with its own random seed.
The outer loop adds the missing axis: it runs the inner loop many times across many tasks, observes a scalar per run, and edits the prompts that bias the inner loop's search. It is not doing any work on the task itself. It is tuning the priors — the worker pitfalls, the manager planning preamble, the critique rubric — that all future inner loops will inherit. This is what "learning" means in an SGD sense: the parameters change shape so the next forward pass is, on average, better.
Key consequence: the outer loop does not attempt to improve the current run. A run is already complete by the time its loss is computed. The only thing the outer loop can change is the starting condition of future runs.
The central abstraction is:
run(task, θ) = inner_loop(task, prompts_resolved_through(θ))
L = loss(run_completion.json, metrics.jsonl)
Think of a run the way you would think of a forward pass through a deep network: a deterministic (up to LLM sampling noise) mapping from (input, parameters) to output, and a scalar loss on top.
- input
task— a string describing what to build. - parameters
θ— the six prompt artifacts (§4). Each is a chunk of text, versioned in the registry. - output — the run's artifacts (
run_completion.json, deliverables, events) plus the derived scalarL ∈ [0, 1].
The function is black-box (no derivatives available), noisy (LLMs sample at temperatures > 0), and expensive per evaluation (one call ≈ 10⁴–10⁶ tokens). Those three properties alone tell you what kind of optimizer is viable: something that tolerates noise, reuses evaluations, and updates sparingly.
In classical SGD the gradient ∂L/∂θ is computed analytically via the chain rule. Here there is no chain rule: we have a black-box L and a discrete θ that lives in a textual space.
What TextGrad / OPRO / DSPy realised — and what the AWP optimizer implements — is that an LLM can serve as a finite-difference oracle over text. Given
- the current artifact content (current θ),
- the aggregated defect / failure summary from the recent epoch (∇L evidence: which of the declared deliverables were missing?, which critique defects recurred?, which subtasks hit repeated gate rejections?),
- a learning rate η,
the LLM is asked to produce new artifact content that it expects would
reduce the loss — together with a self-reported expected_loss_reduction
and confidence. That numeric self-report is what the optimizer ranks
candidates by. It is not a calibrated gradient in the mathematical
sense (the LLM cannot compute one). It is a rough, biased, but cheap
directional hint — which is the actual trait SGD needs to converge in
expectation.
The concrete optimizer prompt and parsing rules are in §7; here it is enough to hold in mind that "gradient" means "structured hint from an LLM that consumes the current artifact, evidence, and η, and returns a candidate update."
A concrete numerical example that runs end-to-end in your head.
Start of epoch 1. All six artifacts at v0 (default). The suite has
three tasks. parent_artifacts = {worker_pitfalls: 0, manager_planning_preamble: 0, …}.
Inner runs (epoch 1). Three runs produce losses L = [0.40, 0.53, 0.31]. mean_loss_1 = 0.413. The epochs row is written with
mean_loss = 0.413 and the all-zeros parent_artifacts_json.
Text-gradient (epoch 1 → epoch 2). The optimizer is called with the
three candidate names ["worker_pitfalls", "manager_planning_preamble", "critique_rubric"] and η=0.5. Three LLM calls later, the replies are:
| candidate | expected_loss_reduction | confidence | product |
|---|---|---|---|
| worker_pitfalls | 0.18 | 0.55 | 0.099 |
| manager_planning_preamble | 0.32 | 0.68 | 0.218 |
| critique_rubric | 0.12 | 0.60 | 0.072 |
Winner: manager_planning_preamble (argmax of the product). A new row is
written to artifact_versions as (manager_planning_preamble, v1, <new text>, parent_version=0, epoch_id=…, is_active=1). The v0 row's
is_active is flipped to 0. child_artifacts becomes
{manager_planning_preamble: 1, others: 0}.
Inner runs (epoch 2). Every manager call now resolves
manager_planning_preamble → v1 content. Runs produce
L = [0.045, 0.62, 0.365]. mean_loss_2 = 0.343.
Regression check (epoch 2 → epoch 3). 0.343 < 0.413 → no regression.
The next epoch may propose another update at η=0.5.
Counterfactual. Had mean_loss_2 come out 0.48 instead, the regression
check would fire: rollback_to("manager_planning_preamble", 0) flips
active back to v0, η halves to 0.25, the optimizer is not called for
epoch 3 (the current epoch's probe already cost three runs). Epoch 3 simply
re-measures the baseline with η remembered for the next non-regressive
step.
Holds cleanly:
- Batch × epoch × learning-rate structure carries over one-to-one.
- Monotone-in-expectation decrease of the training signal (§13 shows an empirical 17 % drop after one non-trivial step).
- Regularisation via sparsity (one artifact per epoch — analogue to L0 regularisation) and via rollback (analogue to line search / trust region).
Strained but workable:
- The parameter space is discrete and textual, not a continuous vector. SGD's convergence proofs do not apply directly; what we actually have is an MCMC-style acceptance loop with an LLM-proposed move kernel.
- The "gradient" is a single-step noisy hint, not derived from a smooth loss landscape. Variance across candidates is real (§13's per-task numbers swing from −89 % to +18 %).
- The minibatch is tiny (3–10 tasks). Variance reduction that comes for free in classical SGD at batch sizes of thousands must be replaced here by rollback + small η.
Breaks:
- There is no chain rule — updates are atomic per artifact, not a gradient
through a computational graph. You cannot blame a specific token in
worker_pitfallsfor a specific defect in a specific worker's output. - There is no guarantee of differentiability, Lipschitz continuity, or even monotonicity: a proposed artifact rewrite can make losses worse on some tasks while improving others. The mean-loss criterion is the only thing that makes this tolerable.
- LLM sampling noise contaminates both the forward pass (the run's own
randomness) and the gradient estimate (the optimizer's sampling). This
is why
rollback_on_regression=Trueis the default — without it the optimizer can chase its own noise.
The practical upshot: treat A5 as stochastic search with an LLM-biased proposal distribution, not as a proof-of-convergence optimizer. The SGD frame is the right pedagogical picture because it tells you what to tune (η, batch size, sparsity), but the implementation is closer in spirit to Bayesian optimisation with a learned proposal than to actual gradient descent.
| Element | Classical SGD | AWP Outer Loop |
|---|---|---|
| Parameters θ | Weight vector | Six versioned prompt artifacts (§4) |
| Batch | x_b from data distribution |
TaskSuite of N tasks (typically 3–10) |
| Forward pass | y = f(x_b; θ) |
AgentWorkflow.run(task, artifacts) → run_completion.json |
| Loss L | ‖y − y*‖² etc., continuous |
Deterministic scalar ∈ [0, 1] from eval, critique, gate, budget, status (§6) |
| Gradient ∂L/∂θ | Chain rule through the network | LLM call that proposes a full rewrite of exactly one artifact |
| Update step | θ ← θ − η · grad |
registry.put_version(...) + set_active(...) — discrete hop |
| Learning rate η | Step size | learning_rate ∈ [0, 1] passed into the optimizer prompt |
| Epoch | Full pass over training set | Full pass over all tasks in the suite |
| Regularization | Weight decay, dropout | One artifact updated per epoch + rollback-on-regression |
| Early stopping | Validation loss plateau | mean_loss regresses → rollback + η halves |
The analogy is intentional, not decorative. The outer loop is literally
a gradient-free optimizer over a discrete, textual parameter space, using
an LLM as the ∂L/∂θ black-box — akin to TextGrad / DSPy / OPRO. The
discrete step size is managed by the learning_rate parameter that the
LLM prompt consumes: at η=1 the optimizer may rewrite the whole artifact;
at η=0.2 it is instructed to change only a narrow section.
AWP's inner loop is a rejection-sampling process on top of a single task: plan → delegate → repair → complete. The inner loop correctly treats each run as a one-shot search problem. What the inner loop cannot do is learn across runs — every run starts from the same hard-coded prompt library. Every time a manager made the same planning mistake, every time a worker missed the same pitfall, every time critique caught the same defect class, that signal was discarded the moment the run ended.
The outer loop closes that gap. It harvests the signal that is already there (eval, critique, gates, budget envelope), reduces it to a scalar, and uses an LLM to propose a concrete textual delta to the artifact most likely to be responsible. Over 5–10 epochs on a moderately diverse suite, this compounds: the system gets better at its own meta-tasks — planning, pitfall awareness, rubric calibration — without any human editing the prompt library.
| # | Artifact name | What it controls |
|---|---|---|
| 1 | worker_pitfalls |
Hard-won pitfalls injected into every worker system prompt |
| 2 | manager_planning_preamble |
Preamble shown to the manager before the PLAN decision |
| 3 | experiment_context_hint_template |
Cross-run "Previous Runs" context section |
| 4 | pattern_library |
Header framing for the pattern registry rendered into prompts |
| 5 | tool_description_templates |
Boilerplate around induced tools in worker prompts |
| 6 | critique_rubric |
Rubric text shown to the critique LLM |
All six live under
packages/awp-runtime/src/awp/outer_loop/defaults/. The runtime fetches
each one through ArtifactRegistry.get_active(name).content — the default
implementation falls back to the hard-coded v0 string when the outer-loop
DB is absent, so turning A5 off is a no-op (A1 invariant).
Version 0 is synthetic: it is never written to the DB, cannot be deleted,
and is the guaranteed floor the registry can always fall back to. Versions
1, 2, … are rows in the artifact_versions table of
~/.awp/outer_loop.db (path overridable via $AWP_OUTER_LOOP_DB). At any
moment, at most one version of each artifact has is_active = 1.
Concretely, SuiteRunner.optimize(suite, n_epochs, learning_rate, optimizer, rollback_on_regression) walks the pipeline as follows:
- Read parent state. Snapshot the currently-active version of each of
the six artifacts →
parent_artifacts: dict[str, int]. Insert a row inepochswithstarted_at,parent_artifacts_json,mean_loss = NULL. - Run the inner loop N_tasks times. For each task in the suite, invoke
AgentWorkflow.run(task, budget=task.budget)— a full AWP A2+ delegation-loop run with critique + eval + metrics emission. The run producesrun_completion.json,events.jsonl, andmetrics.jsonl. - Compute per-task loss.
compute_run_loss(run_dir)reads the run directory and returns aLossBreakdown(total, eval_component, …). Insert one row inepoch_runswithlossandscores_json. - Aggregate.
mean_loss_e = mean(per_task_losses). Update theepochsrow withcompleted_at+mean_loss. - Regression check. If
e > 1andmean_loss_e > mean_loss_prevandrollback_on_regression=True→ pop the last update off the artifact stack (registry.rollback_to(name, parent_version)), halve the learning rate, record arollbackevent inchild_artifacts_json, and skip the optimizer call for this epoch. The counter reset makes the next epoch a fresh probe with a smaller step. - Text-gradient (non-regressive path). Call
TextGradOptimizer. propose_update(epoch_result, candidate_artifacts, learning_rate). For each candidate (§6) the optimizer emits one LLM call; the winner is the proposal with the highestexpected_loss_reduction × confidence. - Apply.
registry.put_version(name, new_content, parent_version, epoch_id)writes the new version;registry.set_active(name, new_ver)flips the active flag. The update is recorded inepochs.child_artifacts_json.eventswith the full metadata (rationale, expected_loss_reduction, confidence, learning_rate). - Next epoch.
mean_loss_prev = mean_loss_e; repeat from step 1 untiln_epochsis exhausted or the CLI hitsCtrl-C.
awp optimize (without --with-textgrad) stops at step 4 and iterates: it
runs the suite N times against unchanged artifacts, so the user can observe
the natural variance of the inner loop before turning on the optimizer.
L = w_e · (1 − eval_score)
+ w_c · (1 − critique_score)
+ w_g · min(1, gate_rejection_count / max_rejections)
+ w_b · max(0, 1 − budget_remaining_pct / 100)
+ w_s · status_penalty (complete=0, partial=0.5, failed=1, aborted=1)
Default weights (w_e, w_c, w_g, w_b, w_s) = (0.4, 0.3, 0.15, 0.05, 0.1) sum
to 1.0, so L ∈ [0, 1]. Each per-component value is clamped into [0, 1]
independently. Missing signals default to a neutral 0.5 — the loss is
always defined, even on a partial artifact set.
Implementation: packages/awp-runtime/src/awp/outer_loop/loss.py
(LossWeights, LossBreakdown, compute_run_loss).
Why five components:
evalandcritiquecarry the domain-specific quality signal.gate_rejection_countpenalises completion attempts the deterministic gate chain had to reject, which is the dominant failure mode the optimizer can directly attack by rewritingmanager_planning_preambleorcritique_rubric.budget_remaining_pctlightly rewards runs that finish under budget, so the optimizer gets a second-order push toward more efficient plans.status_penaltyensuresfailedandabortedruns cannot hide behind partial good signals.
packages/awp-runtime/src/awp/outer_loop/textgrad.py:
- System prompt (
_OPTIMIZER_SYSTEM_PROMPT) is hard-coded. It is not a learnable artifact by design — making it one would need a second-order stabiliser. - For each
candidate_namein the candidate list, the optimizer emits onechat_textcall with:- the current artifact content (
registry.get_active(name).content), - the aggregated defect summary from
EpochResult.task_results[*].scoresand (if available) critique output, - the explicit
learning_rate(copied verbatim into the prompt so the LLM can scale its proposal accordingly).
- the current artifact content (
- The reply is parsed as strict JSON with a lenient fallback (markdown
fences are unwrapped; brace-matching recovers from trailing prose).
Expected fields:
{ "artifact_name": "<one of the candidates, or null>", "proposed_content": "<full new artifact content>", "rationale": "<1–2 sentences>", "expected_loss_reduction": 0.0, "confidence": 0.0 } - Proposal rejection. A candidate reply is dropped from the ranking if
any of these hold:
artifact_nameis null or not in the candidate set,proposed_contentis identical to the current content, content length exceeds 20 000 characters, the JSON fails to parse after the fallback, or the LLM call raised an exception. - Winner selection.
argmax(expected_loss_reduction × confidence). If no candidate survives,propose_updatereturnsNoneand the outer loop treats the epoch as a no-op.
This design keeps the optimizer sparse: at most one artifact is updated per epoch. Sparse updates are easier to diagnose, trivial to roll back, and deliberately slower-converging — which is the right trade-off when each epoch costs multiple LLM-heavy runs.
At every epoch boundary the outer loop makes exactly one of three moves
per artifact: leave unchanged, promote to a new version, or (if a
regression is detected) roll back to the parent version. The
artifact_versions table keeps the full (name, version, content, parent_version, epoch_id, is_active) history, so every rollback is a
deterministic pointer flip — no content is ever deleted.
ArtifactRegistry.rollback_to(name, version) is the public surface. The
runner uses it automatically when rollback_on_regression=True (the
default) and mean_loss_e > mean_loss_prev. The CLI exposes it via
awp optimize-rollback ARTIFACT_NAME VERSION for manual intervention.
After every rollback, the runner halves the next epoch's learning rate. This is the outer-loop analogue to SGD's line search: when a step made things worse, step more cautiously next time.
name: research_writeup_v1 # required, unique per suite
description: "Short summary" # optional
baseline_artifacts: # optional, pinned at suite start
worker_pitfalls: 0 # 0 = default, n = specific version
manager_planning_preamble: 0
tasks:
- name: summarise_paper # required, stored in epoch_runs.task_name
task: "Write a 200-word summary..."
workflow: path/to/workflow.awp.yaml # optional, uses a default if omitted
model: "openai/gpt-5-mini" # optional per-task override
budget: # optional per-task budget overrides
max_loops: 6
max_total_workers: 5
max_total_tokens: 400000
max_wall_time: 240
weights: # optional LossWeights override
eval: 0.5
critique: 0.3
gate_rejections: 0.15
budget: 0.02
status: 0.03Full schema: packages/awp-runtime/src/awp/outer_loop/suite.py
(TaskSuiteSpec, SuiteTask, load_suite).
Example:
examples/outer_loop/research_writeup.suite.yaml
and examples/e2e/fixtures/fact_card.suite.yaml.
# Run suite once, no updates (A2 mode, good for baselining):
awp optimize examples/outer_loop/research_writeup.suite.yaml
# Run 5 epochs with TextGrad and auto-rollback (A3 mode):
awp optimize examples/outer_loop/research_writeup.suite.yaml \
--epochs 5 --learning-rate 0.5 --with-textgrad
# Disable rollback (study natural variance without the safety net):
awp optimize suite.yaml --epochs 5 --with-textgrad --no-rollback
# Inspect every epoch of a suite:
awp optimize-inspect research_writeup_v1
# Inspect a single artifact's full version history with unified diffs:
awp optimize-inspect --artifact worker_pitfalls
# Manually roll one artifact back to a specific version:
awp optimize-rollback worker_pitfalls 2All commands respect $AWP_OUTER_LOOP_DB for an alternative DB path.
The UI server exposes a read-only view onto ~/.awp/outer_loop.db via the
Optimizer top-nav tab (OptimizerPanel). Two modes:
- Suite list — every suite in the DB with its epoch count, latest mean loss, and two actions per row: Charts (opens analytics) and Graph (pushes the chained-epoch graph into Graph-Vis).
- Charts view — three analytics charts side-by-side:
| Chart | What it shows |
|---|---|
| LossCurve | Mean loss per epoch (line) + dashed 3-epoch moving average. Dot color: green = update, red = rollback, grey = no event. Tooltip shows v{from}→v{to} transition. |
| ArtifactDeltaTimeline | One row per artifact × one column per epoch. Circle = update, red × = rollback. Clicking a marker opens a side drawer with a line-based LCS diff. |
| PerTaskLossBoxplot | Per epoch: min–max range as a vertical bar, dot at the median. Tooltip lists every task's loss. |
Endpoints (all degrade to empty-state when the DB is absent):
| Method | Path | Purpose |
|---|---|---|
GET |
/api/runs/{run_id}/epoch |
Per-run epoch context (shown in Graph-Vis) |
GET |
/api/suites |
Suite list for the picker |
GET |
/api/suites/{suite_id}/graph |
Chained React-Flow graph for Graph-Vis |
GET |
/api/suites/{suite_id}/epochs |
Per-epoch details + per-task losses + artifact events |
GET |
/api/artifacts/{name}/versions |
Full version history of one artifact (incl. v0) |
In a normal run, the Manager node in Graph-Vis additionally renders an
artifact-version pill under the model line
(e.g. pitfalls v2 · rubric v1) whenever the run belongs to an epoch with
non-default artifacts. Worker nodes get a small confidence dot
(green/amber/red) in the top-right corner, derived from metric.confidence.
SqliteArtifactStore owns four tables in ~/.awp/outer_loop.db:
| Table | Key columns |
|---|---|
artifact_versions |
(id, artifact_name, version, content, parent_version, created_at, epoch_id, is_active), unique on (artifact_name, version) |
task_suites |
(id, name, tasks_json, baseline_artifacts_json, created_at) |
epochs |
(id, suite_id, epoch_num, started_at, completed_at, mean_loss, parent_artifacts_json, child_artifacts_json) |
epoch_runs |
(epoch_id, run_id, task_name, loss, scores_json) |
child_artifacts_json holds both the active-version map and the event
log:
{
"artifacts": {"worker_pitfalls": 0, "manager_planning_preamble": 1, ...},
"events": [
{"type": "update", "artifact": "manager_planning_preamble",
"from_version": 0, "to_version": 1,
"rationale": "Add explicit required-output checklist",
"expected_loss_reduction": 0.32,
"confidence": 0.68, "learning_rate": 0.5},
{"type": "rollback", "artifact": "manager_planning_preamble",
"from_version": 1, "to_version": 0,
"mean_loss_prev": 0.34, "mean_loss_current": 0.45,
"new_learning_rate": 0.25}
]
}The store runs SQLite in WAL mode with synchronous=NORMAL and a
5 s busy_timeout, so concurrent SuiteRunner.run_epoch workers can
record epoch_runs rows without serialising. If WAL is not supported on
the host, the store logs a warning and falls back to the default journal.
DB location: ~/.awp/outer_loop.db by default, overridable via
$AWP_OUTER_LOOP_DB. The path is intentionally separate from the UI
DB (~/.awp/awp_ui.db) so the outer loop never competes with the UI's
event store for locks.
The E2E test at examples/e2e/outer_loop_full_coverage.py runs a
3-task × 2-epoch suite (octopi · neutron stars · silk road) against
openai/gpt-5-mini. A representative run produced:
| Task | Epoch 1 (v0 baseline) | Epoch 2 (after update) | Δ loss |
|---|---|---|---|
| octopi | failed, loss 0.4038 |
complete, loss 0.0454 |
−89 % |
| neutron_stars | partial, loss 0.5250 |
failed, loss 0.6200 |
+18 % |
| silk_road | partial, loss 0.3110 |
failed, loss 0.3650 |
+17 % |
| mean_loss | 0.4133 | 0.3435 | −17 % |
The TextGrad update picked manager_planning_preamble with
expected_loss_reduction = 0.32, confidence = 0.72. The observed mean-loss
reduction (0.17) is smaller than the expected (0.32) but of the same sign
and order. The per-task variance (−89 % vs +18 %) is the canonical SGD
minibatch-noise pattern; the mean over tasks is what the optimizer is
actually following.
- No LLM judge for the loss. Every loss component is deterministic and
derived from artifacts the inner loop already writes (
run_completion. json,metrics.jsonl). This keeps the optimization signal cheap, reproducible, and immune to judge-drift. - No multi-artifact updates per epoch. One artifact, one step. This keeps rollback clean and diagnosis tractable.
- No automatic deployment. Version promotion happens inside the outer
loop's own DB. Pinning an improved artifact into
prompts.py(i.e. promoting it from "learned" to "shipped") is an explicit human step. - No learnable optimizer prompt.
_OPTIMIZER_SYSTEM_PROMPTis hard-coded on purpose — making the meta-prompt learnable would require a stabiliser we have not built yet. - No conformance claim. The outer loop lives outside
compliance.md §2. A5 will be promoted into the spec only after A3 shows stable behaviour on a wider set of suites.
When awp optimize <suite>.yaml --target <experiment_id>:<task_id> is
invoked, the CLI overrides --db to point to an isolated DB under the
experiment hierarchy: <experiment>/outer_loop.db. This implements spec
decision β (isolation) — each experiment/task pair has its own artifact
store, preventing cross-task contamination and enabling independent
optimization campaigns. Epoch-runs are persisted under
<experiment>/tasks/<task_id>/optimizations/ as timestamped directories.
The per-experiment DB allows an optimization suite to be scoped and
re-run against a specific task, with loss computed using that task's
specific workflow artifacts and metrics. Implementation lives in
packages/awp-core/src/awp/experiment/cli_handlers.py::optimize_task_aware.
docs/refinement.md— y-axis optimisation (a single run's deliverable).docs/continuation.md— y-axis carry-over across tasks.packages/awp-runtime/src/awp/outer_loop/runner.py—SuiteRunner.- After Plan 4: use
awp optimize --target <exp>:<task> SUITE.yamlto attach optimizations under<task>/optimizations/suite_<ts>/, using the per-experiment<experiment>/outer_loop.db(decision β).