SkyPlan is a multi-agent OpenEnv environment for autonomous product planning. Six specialized agents collaborate to turn a product idea into a research summary, PRD, technical design, roadmap, task breakdown, validation report, and final strategy approval.
Product planning is a real workflow teams perform before implementation. SkyPlan models that work directly: market research, requirement definition, architecture design, execution planning, validation, and executive approval. The environment rewards useful partial progress, tracks document quality, and now includes explicit feedback and document-status transitions across the workflow.
Actions are represented by SkyPlanAction in models.py. Each action includes:
agent_id: one ofmaya,elon,jordan,robert,taylor,samaction_type: a role-valid action such asSEARCH_MARKET,WRITE_PRD, orAPPROVE_STRATEGYreasoning: the agent's rationalecontent: the document content or review output
Observations are represented by SkyPlanObservation in models.py. Agents receive:
- task description and current phase
- shared planning documents with statuses
- feedback history and unresolved feedback
- last action result
- document status summary and documents awaiting review
- reward and done signals
The workflow order is defined in workflow.py:
- Maya researches the market and problem space.
- Elon writes the PRD.
- Jordan produces the TRD and architecture.
- Robert creates the roadmap and task plan.
- Taylor validates the package and issues structured feedback.
- Sam provides strategic approval or requests revision.
Documents move through draft -> in_review -> approved/rejected, and feedback can be generated, targeted, and later resolved by downstream actions.
The benchmark ships with three graded tasks in tasks.py:
easy_user_authentication: simple authentication planningmedium_chat_app: real-time chat application planninghard_saas_platform: multi-tenant SaaS platform planning
Each task includes deterministic grading inputs such as required keywords, required sections, and difficulty-specific expectations.
cd AgentEnv
uv sync --extra devRun the server locally:
uv run --project AgentEnv server --port 8000Validate the OpenEnv package:
cd AgentEnv
openenv validateRun the baseline inference script from the repo root:
set HF_TOKEN=...
python inference.pyBy default, inference.py runs all three tasks and emits the required [START], [STEP], and [END] lines for each episode. Set SKYPLAN_TASK to a specific task id to run a single task.
Feedback integration, grading checks, and inference-contract coverage live in:
uv run --project AgentEnv pytest test_feedback_integration.py test_grading_quality.py test_inference_contract.py -qCurrent deterministic local smoke-policy scores (use_llm_reward=False, one task-aligned workflow pass):
| Task | Final Score |
|---|---|
easy_user_authentication |
0.7214 |
medium_chat_app |
0.7214 |
hard_saas_platform |
0.6878 |
The token-backed inference baseline is reproducible through inference.py. Re-run it with HF_TOKEN to record model-specific baseline scores for your chosen MODEL_NAME.