reasoning-bank

A framework-agnostic implementation of ReasoningBank (Ouyang et al., ICLR 2026): agents that distill lessons from past trajectories and retrieve them on new tasks. Includes MaTTS (memory-aware test-time scaling).

What this is

ReasoningBank is a memory mechanism for LLM agents. After each task, it judges the trajectory, distills generalizable reasoning strategies from both successes and failures, and indexes them. On future tasks it retrieves the relevant strategies and injects them into the agent's context.

This package is a clean, framework-agnostic implementation. It works with any agent loop, the raw Anthropic or OpenAI SDK, or your own ReAct loop. It does not depend on any agent framework. (It is used as the memory layer in bottensor-fleet, but does not require it.)

The reference implementation from the paper is welded to specific benchmark harnesses. This one is a standalone library you can attach to anything.

Install

pip install reasoning-bank

How it plugs into any agent loop

Three calls:

from reasoning_bank import ReasoningBank, Turn

async def my_llm(prompt, *, system=None):
    # wrap your provider however you like
    ...

bank = ReasoningBank(llm=my_llm, scope="my-project")

# 1. retrieve relevant past lessons before your agent runs
memories = await bank.retrieve(task)
system_block = bank.format_as_system_block(memories) if memories else None

# 2. run your agent however you want, optionally prepending system_block
answer = await my_agent(task, system=system_block)

# 3. ingest the trajectory so the bank learns from it
trajectory = [Turn("user", task), Turn("assistant", answer)]
await bank.ingest_trajectory(trajectory, task=task)

The bank accepts any trajectory shape (its own Turn type, plain dicts, or duck-typed objects with .role/.content) and any async LLM callable. Pluggable embedder (MiniLM default) and store (SQLite+vec default, in-memory for tests).

MaTTS

Memory-aware test-time scaling: run k rollouts in parallel, contrast them, distill higher-quality memories.

from reasoning_bank import matts_run

async def rollout():
    answer = await my_agent(task)
    return [Turn("user", task), Turn("assistant", answer)]

trajectories, memories = await matts_run(rollout, task=task, bank=bank, k=3)

Does it actually work?

This is the honest part, and the reason the repo includes a full experiment suite.

The machinery works end-to-end: it judges trajectories, distills sensible transferable lessons, retrieves them, and injects them, with no framework dependency. That is verified.

Whether it produces a measurable performance lift is a separate question, and the answer depends heavily on the task distribution and the model. I ran four controlled experiments to find out, including a SWE-bench-lite harness with a no-retry control, a naive-retry control, a per-instance bank, and a persistent cross-instance bank, scored by the official SWE-bench Docker evaluator.

Headline findings (Haiku 4.5, full methodology in experiments/):

On task suites where the base model is already at its capability ceiling, the bank cannot help, because there is no failure to learn from. Two early experiments hit this.
On SWE-bench-lite (a real spread of difficulty, ~50% baseline), the persistent cross-instance bank produced no measurable lift over a no-retry baseline. The cross-task transfer hypothesis was not supported in this setup (n=45 clean cells on the persistent arm; late instances did not outperform early ones).
The one consistently positive observation was defensive: naive retry (re-prompting with the raw error) hurt performance, and structured-reflection retry recovered it. The value was in how a failure is framed to the model, not in accumulating a memory bank.

In short: the implementation is faithful and the machinery is sound, but I did not find a regime in these experiments where cross-task memory accumulation produced net positive value. The full data, including infrastructure caveats and threats to validity, is in the experiments directory. Negative results are results.

Design

Framework-agnostic: neutral trajectory type and a plain async LLM callable at every boundary. No agent framework required.
Pluggable store: SQLite + sqlite-vec by default; in-memory for tests.
Pluggable embedder: sentence-transformers MiniLM by default.
Memory items are distilled strategies, not raw traces, per the paper: a title, a description, and an actionable lesson, derived from both successes and failures.

Paper

Ouyang et al., ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory, ICLR 2026. arXiv:2509.25140

This package is an independent implementation and is not affiliated with the paper's authors.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
experiments		experiments
src/reasoning_bank		src/reasoning_bank
tests		tests
.gitignore		.gitignore
.python-version		.python-version
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

reasoning-bank

What this is

Install

How it plugs into any agent loop

MaTTS

Does it actually work?

Design

Paper

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

reasoning-bank

What this is

Install

How it plugs into any agent loop

MaTTS

Does it actually work?

Design

Paper

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages