AlphaEvo finds short-term stock return predictors — alpha factors — by having an LLM propose formulas that an evolutionary optimizer then breeds and stress-tests across multiple time windows. The LLM never sees returns; it only shapes the search space. The optimizer decides what survives.
Market data ──► Stage I: LLM seeds ──► Stage II: GP search ──► Stage III: EoH refine
regime-aware retrieval 20 generations 8 generations
10 candidate formulas multi-window IC fitness 5 evolution operators
in safe expression DSL train/test split per-window IC feedback
decorrelation filter
──► Deployment bundle
Three design decisions that matter: (1) seeding adapts to the current market regime, not a fixed library; (2) every candidate is scored on four non-overlapping sub-windows, so overfitting to one period does not survive; (3) EoH operators mutate structure guided by which time windows are weakest — a feedback loop that deterministic GP cannot replicate.
| Method | IC (test) | IC gap | Sharpe | Description |
|---|---|---|---|---|
| Buy & Hold | — | — | 0.65 | Passive benchmark |
| RSI Rule | 0.05 ± 0.15 | — | 0.45 | Standard momentum rule |
| Random GP | 0.31 ± 0.09 | 0.10 | 0.58 | GP without LLM seeding |
| LLM-Only | 0.01 ± 0.15 | — | 0.41 | LLM formula, no evolution |
| GP + Anchors | 0.15 ± 0.10 | 0.12 | 0.56 | GP with fixed seed library |
| AlphaEvo (ours) | 0.52 ± 0.08 | 0.09 | 4.78 | LLM seed + GP + EoH |
IC gap = train IC − test IC. Lower is better. Paired t-test vs. Random GP: p < 0.01, n = 120.
| Method | Daily IC (mean) | ICIR | L/S Sharpe | L/S MDD | Hit rate |
|---|---|---|---|---|---|
| RSI Rule | 0.008 | 0.14 | 0.38 | −28.7% | 51.9% |
| Random GP | 0.031 | 0.49 | 1.24 | −18.4% | 54.2% |
| LLM-Only | 0.019 | 0.31 | 0.72 | −24.1% | 52.5% |
| GP + Anchors | 0.044 | 0.68 | 1.71 | −14.3% | 56.8% |
| AlphaEvo (ours) | 0.052 | 0.82 | 1.89 | −13.1% | 58.7% |
| SPY (passive) | — | — | 0.68 | −33.9% | — |
L/S = equal-weighted long top decile, short bottom decile, daily rebalance. 3,780 trading days.
| Configuration | IC (test) | Sharpe | IC gap |
|---|---|---|---|
| AlphaEvo full | 0.524 | 4.78 | 0.092 |
| − EoH (GP only, same budget) | 0.483 | 4.42 | 0.112 |
| − LLM seeding (anchors only) | 0.422 | 4.04 | 0.096 |
| − multi-window fitness | 0.507 | 4.65 | 0.144 |
| − decorrelation filter | 0.602 | 4.61 | 0.285 |
Removing decorrelation raises IC but more than triples IC gap — the filter is essential for robustness.
All three free-tier providers achieve within 3% of each other. No paid API is required.
| Provider | Model | Final IC | Seed IC | Lift |
|---|---|---|---|---|
| Groq (default) | llama-3.1-8b-instant | 0.521 | 0.336 | +0.185 |
| Groq | llama-3.3-70b-versatile | 0.535 | 0.293 | +0.243 |
| gemini-2.0-flash | 0.528 | 0.294 | +0.234 | |
| None | deterministic anchors | 0.452 | 0.260 | +0.192 |
git clone https://github.com/Zangir/AlphaEvo.git
cd alphaevo
python3.11 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txtNo API key needed — reproduce all paper experiments deterministically:
python scripts/run_paper_experiments.py --experiments E1 --out-dir paper/results/generatedOne instance takes ~10 seconds on a laptop. The full 120-instance E1 run takes ~4 CPU hours.
With a free Groq key (500k tokens/day — enough for dozens of sessions):
cp .env.example .env # add GROQ_API_KEY=your_key
python scripts/run_paper_experiments.py --experiments E1,E2,E3 --out-dir paper/results/generatedRegenerate all paper figures:
python scripts/generate_paper_figures.py \
--results-dir paper/results \
--out-dir paper/figures/generatedfrom alphaevo import AlphaEvoPipeline
from trading.data import fetch_historical, row_to_market_data
# Fetch 6 months of AAPL data (free, no API key)
df = fetch_historical("AAPL", "2024-01-01", "2024-06-30")
history = [row_to_market_data("AAPL", row, ts) for ts, row in df.iterrows()]
# Deterministic mode — no LLM calls, fully reproducible
pipeline = AlphaEvoPipeline(use_llm=False, use_eoh=True)
result = pipeline.run_cycle(history, symbol="AAPL", position_state="flat")
print(result.selected_alpha.formula)
# e.g.: ts_zscore_scale(cwise_mul(ts_delta(close, 3), ts_zscore_scale(volume_ratio_10, 10)), 15)
print(f"Test IC: {result.selected_alpha.metrics.test_ic:.3f}")
print(f"Sharpe: {result.selected_alpha.metrics.sharpe_ratio:.2f}")The PipelineResult exposes the full search trace: every candidate evaluated, the GP search history, EoH iteration summaries, the analyst review, and a deployment bundle — all as Pydantic models.
alphaevo/
├── alphaevo/
│ ├── dsl.py # Safe expression DSL — 44 operators, sandboxed eval
│ ├── evaluator.py # IC / Sharpe / fitness computation
│ ├── catalog.py # Field + operator specs, seed examples
│ ├── search.py # GP search engine (deterministic, seeded RNG)
│ ├── eoh.py # EoH refinement (E1, E2, M1, M2, M3 operators)
│ ├── pipeline.py # Full 3-stage LangGraph workflow
│ ├── knowledge.py # Regime-aware knowledge compiler
│ ├── selection.py # Qualified-alpha filtering + decorrelation
│ └── prompts/ # System + user prompt templates
├── evaluation/
│ └── metrics.py # Sharpe, max drawdown, total return
├── trading/
│ └── data.py # yfinance OHLCV + RSI / MA20 / MA50
├── scripts/
│ ├── run_paper_experiments.py # Reproduce E1–E9 (Table 1, ablation, …)
│ ├── run_large_scale_experiment.py # Reproduce E10 (503 stocks, 15 years)
│ └── generate_paper_figures.py # Regenerate all paper figures
├── data/
│ └── sp500_universe.csv # 503-stock universe snapshot (as of 2024-12)
├── paper/
│ ├── results/ # Included paper artifact JSONs (E1–E10)
│ └── figures/ # Included paper figures (PDF)
├── tests/ # Smoke tests — run with pytest
├── requirements.txt
├── pyproject.toml
└── .env.example
Add a new DSL operator: define a function in alphaevo/dsl.py, add it to FUNCTION_ENV and the allowed-node whitelist, then add an OperatorSpec entry in alphaevo/catalog.py. The search engine and EoH optimizer pick it up automatically.
Add a new EoH evolution operator: subclass or extend EoHRefinementEngine in alphaevo/eoh.py. The operator receives the current population and should return a mutated formula string.
Use a different LLM provider: set LLM_PROVIDER in .env. All providers listed below work with the existing prompt templates.
| Provider | Free? | LLM_PROVIDER= |
Key env var |
|---|---|---|---|
| Groq — Llama 3.1 8B (paper default) | ✅ 500k tokens/day | groq |
GROQ_API_KEY |
| Google Gemini — Gemini 2.0 Flash | ✅ 1,500 req/day | gemini |
GEMINI_API_KEY |
| OpenRouter — many free models | ✅ free tier | openrouter |
OPENROUTER_API_KEY |
| Ollama — runs locally, offline | ✅ unlimited | ollama |
none |
| OpenAI GPT-4o | 💳 paid | openai |
OPENAI_API_KEY |
| Anthropic Claude | 💳 paid | anthropic |
ANTHROPIC_API_KEY |
Run the test suite:
pip install pytest
pytest tests/ -v| ID | Description | Command |
|---|---|---|
| E1 | Per-asset IC benchmark (Table 1) | --experiments E1 |
| E2 | Ablation study | --experiments E2 |
| E3 | Regime analysis | --experiments E3 |
| E4 | LLM provider sensitivity | --experiments E4 |
| E5 | Operator analysis | --experiments E5 |
| E6 | Diversity / efficiency curves | --experiments E6 |
| E7 | Qualitative alpha examples | --experiments E7 |
| E8 | Per-symbol breakdown | --experiments E8 |
| E9 | IC gap distribution | --experiments E9 |
| E10 | Large-scale cross-sectional | python scripts/run_large_scale_experiment.py |
All E1–E9 are deterministic (no LLM calls). E10 fetches ~503 tickers from yfinance; plan for ~2–4 hours on a 4-core machine.
Fresh reruns write to paper/results/generated/ — the included paper/results/*.json files are the fixed paper artifacts and should not be overwritten.
If you use AlphaEvo in your research, please cite:
@inproceedings{alphaevo2026,
title = {{AlphaEvo}: {LLM}-Seeded Evolutionary Discovery of Robust Quantitative Trading Signals},
author = {Machavariani, Temiko and Tsourekas, Kyriakos and Rotte, Matvei and Tsulaia, Luka and Tazhibaev, Iskhak and Nwadike, Munachiso Samuel and Lahlou, Salem and Jamwal, Prashant Kumar and Inui, Kentaro and Tak{\'a}{\v{c}}, Martin and Iklassov, Zangir},
booktitle = {Advances in Neural Information Processing Systems},
year = {2026},
note = {Preprint. Under review.}
}Code: MIT. Market data sourced from Yahoo Finance under their terms of use. This repository is a research artifact — not financial advice. Past backtest performance does not guarantee future results.