Maabarium is a Rust-native, local-first continuous improvement engine inspired by Karpathy's Autoresearch pattern. It implements a keep-winner loop: propose → apply → evaluate → keep-or-revert, generalised beyond ML training to arbitrary optimisation domains.
- Local-first, private, free — native Rust orchestration with strong support for local runtimes such as Ollama on Apple Silicon; no cloud required by default
- Pure Rust control plane — Tokio async runtime, no Python in the orchestration layer
- Autoresearch keep-winner loop — propose → apply → evaluate → keep/revert
- Generalized domains — pluggable Evaluator trait, not ML-only
- Beautiful desktop UX — Tauri desktop console with live dashboards
- Future-proof — explicit extension points, documented trade-offs, and OSS-ready structure
maabarium/
├── crates/
│ ├── maabarium-core/ # Engine, agents, git, LLM, evaluator, persistence
│ ├── maabarium-cli/ # Terminal CLI binary (Phase 1)
│ └── maabarium-desktop/ # Tauri desktop console
The workspace is split so that maabarium-core can be built and tested independently of the Tauri desktop shell.
for iteration in 1..=max_iterations {
branch = experiment_branch_name(iteration)
proposal = council.propose(context, metrics)
workspace = git.apply_proposal(branch, proposal, reusable_workspace)
result = timeout(evaluator.evaluate(proposal, EvaluationContext { workspace_path: workspace }))
if result.weighted_total > baseline + min_improvement:
git.create_branch_at_workspace_head(workspace, branch)
git.promote_branch(branch) // fast-forward main
baseline = result.weighted_total
outcome = promoted
else:
git.detach_experiment_workspace(workspace)
outcome = rejected
persistence.log_experiment(result, outcome)
}
Key design decisions:
CancellationToken(fromtokio-util) drives graceful shutdown on Ctrl-C- Every fallible step uses
continuewith atracing::warn!— no panics in production paths tokio::time::timeoutenforces per-experiment wall-clock limits- All results persist to SQLite with the engine's explicit promotion outcome
- Detached experiment worktrees are reused across iterations when safe, then cleaned up once at the end of the run
- Experiment branch refs are materialized only on promoted iterations; rejected runs stay as detached worktree state and never create branch history
- The CLI prints an end-of-run timing summary aggregated from per-phase engine instrumentation
- Sandbox snapshot materialization uses a dedicated workspace materializer with macOS clone-on-write support where available and a portable copy fallback everywhere else
| Module | Responsibility |
|---|---|
blueprint |
TOML config parsing + validation |
engine |
Keep-winner loop orchestration |
agent |
Single Agent + Council (multi-agent debate) |
git_manager |
git2 operations, reusable detached experiment worktrees, all wrapped in spawn_blocking |
llm/ |
LLMProvider trait, Ollama backend, OpenAI-compat backend, ModelPool with routing + rate limiting |
evaluator/ |
Evaluator trait, ExperimentResult, PromptEvaluator |
metrics |
Weighted scoring, improvement detection, normalization |
persistence |
SQLite read/write (WAL mode, parameterised queries) |
error |
Typed error enums via thiserror |
git2 (libgit2 bindings) is synchronous and not designed for async Tokio tasks. All git2 calls in git_manager.rs are wrapped in tokio::task::spawn_blocking to prevent stalling the Tokio executor. This is the standard pattern for calling blocking code from async Rust.
The LLMProvider trait decouples the engine from any specific LLM backend:
#[async_trait]
pub trait LLMProvider: Send + Sync {
async fn complete(&self, request: &CompletionRequest) -> Result<CompletionResponse, LLMError>;
fn provider_name(&self) -> &str;
fn model_name(&self) -> &str;
}Implementations:
OllamaProvider— calls Ollama REST API (POST /api/generate) viareqwestOpenAICompatProvider— generic OpenAI-compatible endpoint (OpenAI, Groq, OpenRouter, DeepSeek, xAI, compatible gateways)AnthropicProvider— native Anthropic Messages API clientGeminiProvider— native GeminigenerateContentAPI clientModelPool— wraps one or more providers, enforces per-model request pacing, and supportsexplicitorround_robinblueprint assignment
No external ollama-rs crate is used; the Ollama REST API is called directly.
When blueprints use assignment = "explicit", each agent receives a pool containing just its configured model. When they use assignment = "round_robin", the pool rotates across the entire configured model list.
#[async_trait]
pub trait Evaluator: Send + Sync {
async fn evaluate(&self, proposal: &Proposal, iteration: u64, context: &EvaluationContext) -> Result<ExperimentResult, EvalError>;
}ExperimentResult carries multi-dimensional scores, weighted total, duration, and the original proposal — not just a bare f64.
Three tables:
experiments— one row per experiment runmetrics— one row per metric dimension per experimentproposals— proposal metadata
SQLite runs in WAL mode for concurrent reads from a future dashboard while the engine writes.
The default database and log paths are:
data/maabarium.dbdata/maabarium.log
The Tauri desktop console reads both sources to render live score, duration, and token-usage cards.
- Agent writes to arbitrary paths: reused git worktrees or sandbox snapshots plus path sanitization and Wasmtime-backed policy validation
- Untrusted code execution: subprocess-based evaluator execution inside isolated worktrees or fallback sandbox roots
- API key leakage:
keyringcrate → OS keychain. Never logged, never serialized to disk - Runaway resource usage: per-experiment timeout via
tokio::time::timeout+max_iterationscap in blueprint - Supply chain attacks:
deny.tomlforcargo-denyaudits CVEs, licenses, and duplicate crates - Git history pollution: experiment branches under the
experiment/prefix with explicit cleanup paths - SQL injection: all queries use
rusqlite::params![]parameterised binding
The original phase model is now mostly complete through the working core runtime and Tauri desktop console.
Implemented in the current repository:
- workspace split across
maabarium-core,maabarium-cli, andmaabarium-desktop - council-driven proposal generation and engine loop orchestration
- git-backed experiment isolation and branch promotion/revert flow
- SQLite persistence and export
- live Tauri desktop cards, history, diff, and logs backed by persisted runtime data
- blueprint-driven multi-model routing with per-model pacing
- tracing spans on engine, pool, evaluator, and sandbox hot paths
- Wasmtime-backed sandbox policy validation and subprocess-based code evaluation
- reusable experiment worktrees plus CLI run timing summaries for profiling and operator visibility
- APFS-friendly sandbox workspace materialization for macOS plus a portable fallback path for Linux and Windows
Portable optimised local builds can use:
cargo build --profile release-ltoMachine-specific local benchmarking can opt into native CPU tuning explicitly:
RUSTFLAGS="-C target-cpu=native" cargo build --profile release-ltoThe native-tuned command is intentionally separate from portable release builds so distributed artefacts do not assume the build host's CPU feature set.
The historical closure items are now explicitly resolved in code and docs:
- Desktop packaging/distribution is documented for the Tauri desktop app
- Evaluator selection is routed through an internal built-in registry
- OSS launch artefacts exist and match the repository
- The LoRA path is explicitly scoped to external artefact validation with reproducibility manifests
The supported desktop distribution path is the Tauri app bundle built from the workspace.
Current packaging expectation:
- build with
cd crates/maabarium-desktop && pnpm tauri build - distribute the generated platform bundle from the Tauri output directory
- keep runtime data outside the app binary at
data/maabarium.dbanddata/maabarium.log
The desktop stack is Tauri-based. A manual signing/notarisation process is documented, but not yet automated.
The detailed packaging/release expectations are documented in DESKTOP_PACKAGING.md.
The following are not active implementation commitments in the current roadmap:
- No second desktop shell is planned; the supported desktop shell is the Tauri app.
- No runtime shared-library evaluator plugin ABI is promised; external plugins remain deferred behind the built-in evaluator registry because ABI stability and supply-chain trust are not solved yet.
- No native Rust MLX-first path is promised; the supported LoRA path validates externally produced artefacts and reproducibility metadata instead of claiming in-engine training.
- No CI-backed signing/notarisation automation is promised yet.
- No return to the old broad phase-table format is planned for active docs.
Evaluator choice is now resolved through EvaluatorRegistry in maabarium-core.
evaluator.kind = "process"selectsProcessPluginEvaluatorevaluator.kind = "builtin"withevaluator.builtin = "code" | "prompt" | "research" | "lora"selects the matching built-in evaluator directly- built-in template metadata is the next routing signal when there is no explicit evaluator override
- language, metric names, blueprint name, and target-path patterns remain as backward-compatible fallback heuristics
This keeps evaluator selection deterministic and typed without exposing a dynamic shared-library plugin surface.
The supported LoRA workflow is intentionally narrow:
- training or fine-tuning happens outside the engine,
- proposals carry adapter artefacts plus
maabarium-lora-run.json, LoraEvaluatorscores artefact completeness, metadata hygiene, and reproducibility hints from that manifest.
This closes the roadmap item without overstating native MLX support.
- The live desktop app is Tauri-based and lives in
crates/maabarium-desktop. - The runtime does not use a pure WASI-only execution model for evaluator execution; it uses a hybrid sandboxing approach.