| title | ShopOps Environment Server | |
|---|---|---|
| emoji | 🛒 | |
| colorFrom | indigo | |
| colorTo | blue | |
| sdk | docker | |
| pinned | false | |
| app_port | 8000 | |
| tags |
|
ShopOps is an OpenEnv environment for real customer-operations work. The agent is not picking a final label from a tiny action set anymore. It has to operate a support queue, inspect policies and customer history, manage scarce replacement inventory, wait for delayed carrier or evidence responses, and close cases without creating downstream damage.
This is designed to evaluate long-horizon business operations behavior:
- tool use instead of single-shot classification
- coupled state across multiple cases
- delayed consequences from premature closure
- real tradeoffs between SLA, budget, fraud loss, and stock availability
Each episode is a deterministic task scenario exposed through the standard OpenEnv API:
reset()returns the first observation for the selected taskstep(action)applies one tool/action and returns observation, reward, done, and infostatereturns the currentepisode_idandstep_count
The environment is implemented with typed Pydantic models and ships with:
openenv.yaml- deterministic graders for all tasks
inference.pyin the repo root- a Dockerfile for local and Hugging Face Space deployment
ShopopsAction is a typed tool invocation with these fields:
action_typecase_idrefund_amount_usdexpediteescalation_reasonnote_code
Supported action_type values:
inspect_orderinspect_policyinspect_inventoryinspect_customer_historyrequest_evidencecontact_carrierissue_refundship_replacementescalate_riskadd_internal_noteclose_caseswitch_case
Key constraints:
issue_refundrequiresrefund_amount_usdship_replacementmay setexpediteescalate_riskrequiresescalation_reasonadd_internal_noterequiresnote_code- non-switch actions must target the active case
Each ShopopsObservation contains:
active_case: full working view of the active casequeue: visible queue summary for all caseslatest_tool_result: persistent result from the last actionresources: time, budget, and inventory statemetrics: resolved cases, reopened cases, SLA breaches, fraud loss, satisfaction, stockoutsunresolved_blockers: blockers still preventing safe closurecurrent_task,difficulty,step_index,episode_id,env_schema_version
The active case includes persistent tool-discovered summaries:
order_summarypolicy_summaryhistory_summaryinventory_summary
This lets an agent build working memory from prior inspection actions instead of re-querying everything every step.
Single-case recovery task.
The agent must:
- inspect order facts
- inspect policy
- choose a compliant partial refund
- add the required internal note
- close the case cleanly
Five-case queue with mixed urgency.
The agent must:
- switch cases intentionally
- prioritize SLA-critical work
- inspect inventory and history where needed
- avoid wasting budget
- close all five cases
Seven-case scenario with coupled consequences.
The agent must:
- preserve scarce inventory for the right case
- avoid refunding suspicious orders before evidence arrives
- handle fraud escalation correctly
- juggle delayed carrier/evidence events
- prevent reopen cascades and fraud loss
Reward is dense over the trajectory and combines:
- information gain from useful inspections
- workflow progress from moving a case forward correctly
- business outcome from the quality of the chosen resolution
Undesirable behavior is penalized:
- invalid tool calls
- duplicate inspections
- unnecessary external requests
- refunds without required review
- premature closure that causes reopen or fraud loss
The terminal episode summary tracks:
final_scoreclosed_casesreopened_casessla_breachesfraud_loss_usdstockoutscustomer_satisfaction
shopOps.eval contains a deterministic rule baseline that uses the same public observation space as an agent.
10-seed baseline results:
| Task | Avg final score | Avg total reward |
|---|---|---|
refund_policy_recovery |
0.9840 |
1.5920 |
sla_queue_juggle |
0.9360 |
5.0384 |
fraud_stockout_cascade |
0.9246 |
7.1421 |
Reproduce:
./venv/bin/python -m shopOps.eval --task all --total-seeds 10The required root-level inference.py uses the OpenAI client and emits strict:
[START][STEP][END]
Required environment variables:
API_BASE_URLMODEL_NAMEHF_TOKEN
Optional:
ENV_URLdefaulthttp://localhost:8000
Example:
export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
export HF_TOKEN="<your_token>"
export ENV_URL="http://localhost:8000"
python inference.pypython3 -m venv venv
source venv/bin/activate
pip install -r server/requirements.txt
pip install -e .Run tests:
../venv/bin/python -m pytest -qRun local server:
uvicorn server.app:app --host 0.0.0.0 --port 8000Validate OpenEnv packaging:
../venv/bin/openenv validateBuild:
docker build -t shopops-env:latest .Run:
docker run -p 8000:8000 shopops-env:latestImportant entrypoints:
server/shopOps_environment.pymodels.pygraders.pyeval.pyinference.pyopenenv.yaml