Autonomous prediction market strategy discovery. You are a meta-agent that improves a trading strategy agent harness.
Your job is NOT to trade directly. Your job is to improve the harness in
agent.py so the agent discovers better trading strategies on its own.
Build an agent that discovers profitable strategies for Kalshi KXHIGH weather prediction markets using only weather observation data and market price data.
The agent has access to:
- Historical weather observations (Therminal API / local parquet files)
- OHLCV market data (1-min candles with bracket definitions)
- A backtester that simulates fills and computes PnL after fees
- Python execution for custom analysis
The agent does NOT have access to any pre-built models, features, or existing trading strategies. It must discover edges from raw data.
- KXHIGH markets: daily high temperature for 20 US cities
- Brackets: 2°F wide, e.g., "56° to 57°" → YES wins if actual high is 56 or 57
- Bracket alignment (even/odd) VARIES per event — never assume
- Settlement: NWS CLI daily high, rounded to nearest integer °F
- Boundaries: [floor, cap] inclusive on both sides
- Tail brackets: "X° or below" (lower), "X° or above" (upper)
- Taker: ceil(7 * contracts * price * (1 - price)) / 100
- Maker: ceil(1.75 * contracts * price * (1 - price)) / 100
- Fees eat into edge — the strategy needs enough edge to overcome them
- Identifies mispricings: market price ≠ true probability
- Has an information advantage: better temperature forecast than the market
- Manages risk: doesn't bet the farm on one bracket
- Works across cities and seasons (not overfit to one event)
Everything above the FIXED ADAPTER BOUNDARY comment in agent.py:
SYSTEM_PROMPT— agent instructions and strategy guidanceCUSTOM_TOOLS— add analysis tools, data loading helpers- Tool implementations — improve data access, add features
- Agent orchestration — multi-step reasoning, sub-agents
MODEL,MAX_TURNS,THINKING— agent configuration
The Harbor adapter and container entrypoint sections (below the boundary).
The backtester.py settlement logic (this is ground truth).
Maximize score from the backtester (= net PnL after fees).
Secondary metrics:
win_rate— fraction of trades that are profitablesharpe_annual— risk-adjusted returnsmax_drawdown— worst peak-to-trough decline
docker build -f Dockerfile.base -t autoagent-base .
rm -rf jobs; mkdir -p jobs && uv run harbor run -p tasks/ -n 4 --agent-import-path agent:AutoAgent -o jobs --job-name latest > run.log 2>&1Log every experiment to results.tsv:
commit score win_rate n_trades sharpe status description
commit: short git hashscore: net PnL ($)win_rate: fraction (0-1)n_trades: total trades executedsharpe: annualized Sharpe ratiostatus:keep,discard, orcrashdescription: what changed
- Read the latest
run.logand task results - Analyze which cities/dates the strategy works on vs fails
- Look for patterns: does it fail on certain weather types? Price ranges?
- Choose one improvement to the harness:
- Better prompt engineering for the strategy agent
- New tools (e.g., forecast features, market microstructure analysis)
- Better signal generation logic
- Improved risk management
- Edit
agent.py - Commit, rebuild, rerun
- Record results
- Keep if score improved, discard if not
High-leverage improvements to try:
- Temperature forecasting tools: Add tools that compute simple forecasts (e.g., persistence model, climatology, recent trend extrapolation)
- Market inefficiency detection: Tools that compare market-implied probabilities against simple forecast models
- Multi-bracket analysis: Look at the full bracket distribution, not just one bracket — find where markets are most mispriced
- Timing: The agent can look at pre-prediction prices and post-prediction prices — earlier signals may get better fills
- Weather feature engineering: Temperature lags, trends, seasonality, humidity, wind — what predicts the daily high?
- Cross-city signals: Does NYC weather predict Philly? Do coastal cities behave differently from inland?
- If
scoreimproved → keep - If
scoresame and simpler harness → keep - Otherwise → discard
Do not add city-specific or date-specific hacks.
Test: "If these exact dates disappeared, would this still work?" If no, it's overfitting.
Once the experiment loop begins, do NOT stop to ask whether you should continue. Keep iterating until the human explicitly interrupts.