| title | Ambiguity & Constraint-Aware Decision Environment |
|---|---|
| emoji | 🧠 |
| colorFrom | blue |
| colorTo | purple |
| sdk | docker |
| pinned | false |
| license | mit |
Evaluating the robustness of AI decision-making under uncertainty and logical constraints.
- 🎯 Live Benchmark API: Hugging Face Space
- 🖥️ Interactive Demo: Gradio Demo
- 🐙 Source Code: GitHub Repository
This environment evaluates whether AI agents can effectively handle ambiguity and adhere to complex logical constraints. In a realistic scheduling scenario, agents must:
- Detect missing information in user instructions.
- Navigate constraints such as time conflicts (unavailability) and hard deadlines.
- Exercise multi-step reasoning to clarify parameters before taking action.
It focuses on decision-making under uncertainty, where guessing leads to penalties and clarification is the optimal path.
Most real-world AI failures occur because agents:
- Act on incomplete information (e.g., scheduling a meeting without knowing the time).
- Ignore logical constraints (e.g., scheduling during a forbidden time slot).
- Hallucinate solutions that satisfy part of the prompt while violating hidden boundaries.
This project provides a realistic benchmark to quantify these failure modes.
- Dynamic Ambiguity: Env secrets (times/teams) are randomized per session.
- Constraint-Aware Reasoning: Handles hard-coded unavailability and temporal deadlines (e.g., "before 3 PM").
- Multi-Step Interaction: Detailed
observation -> reasoning -> actionfeedback loop. - Partial Reward Scoring: Non-binary grading rewards partial correctness while penalizing violations and inefficiency.
- Deterministic Evaluation: A pure observation-based baseline ensures reproducibility.
The evaluation suite contains tasks of increasing cognitive load:
| Difficulty | Description |
|---|---|
| Easy | No ambiguity. All parameters are explicit. Tests basic execution logic. |
| Medium | One missing field (Time or Participants). Requires a single clarification step. |
| Hard | Multiple missing fields + constraints. Requires context retention and logical satisfaction. |
Across the benchmarked tasks, the current baseline achieves an Average Score of ~0.70.
- Easy: 0.99 (Perfect execution).
- Medium: ~0.65 (Successful targeted clarification).
- Hard: ~0.53 (Complexity of multi-step retrieval + constraint satisfaction).
This trend confirms that the environment successfully measures increasing levels of difficulty as ambiguity and constraints rise.
git clone <repo_url>
pip install -r requirements.txtdocker build -t ambiguity-env .
docker run -p 7860:7860 ambiguity-envpython inference.pyT Mohamed Yaser
- Solo Participant
- LinkedIn: mohamedyaser08
- Email: 1ammar.yaser@gmail.com
Built for OpenEnv Hackathon 🚀