This folder is a starter for a planner + executor Amazon flow, modeled after
amazon_shopping_with_assertions (snapshots + AgentRuntime assertions per step).
- Planner: Qwen 3.5 9B (MLX 4-bit quantized) — produces a step plan
- Executor: Qwen 3.5 4B (MLX 4-bit quantized) — executes each step deterministically
Upgrade note: We now use Qwen 3.5 models (9B planner, 4B executor) which provide better JSON output and reasoning compared to the older Qwen 2.5 models (7B/3B). The MLX 4-bit quantization runs efficiently on Apple Silicon.
Default: MLX with 4-bit quantized Qwen 3.5 models.
- Apple Silicon (recommended): Uses MLX backend with
mlx-community/Qwen3.5-9B-MLX-4bit(planner) andmlx-community/Qwen3.5-4B-MLX-4bit(executor). Fast and memory-efficient. - CUDA GPU available: You can use HuggingFace Transformers with 4-bit quantization (
load_in_4bit=True). - CPU-only: Not recommended for 9B models; expect very slow runtimes.
Install the MLX language model library:
pip install mlx-lm>=0.31.1Important: Version 0.31.1+ is required for Qwen 3.5 model support.
MLX is Apple Silicon only. For Linux or Windows, use one of the following approaches:
Use HuggingFace Transformers with 4-bit quantization via bitsandbytes:
Prerequisites:
# Install PyTorch with CUDA support (adjust cuda version as needed)
pip install torch --index-url https://download.pytorch.org/whl/cu121
# Install transformers and quantization dependencies
pip install transformers>=4.40.0 accelerate bitsandbytes>=0.43.0Run with HuggingFace provider:
export PLANNER_PROVIDER=hf \
PLANNER_MODEL=Qwen/Qwen2.5-7B-Instruct \
EXECUTOR_PROVIDER=hf \
EXECUTOR_MODEL=Qwen/Qwen2.5-3B-Instruct
python main.pyNote: The HuggingFace provider automatically enables 4-bit quantization (
load_in_4bit=True) whenbitsandbytesis installed. For Qwen 3.5 models on HuggingFace, check model availability as MLX-specific quantizations won't work.
VRAM requirements (approximate):
- Qwen 2.5 7B (4-bit): ~6GB VRAM
- Qwen 2.5 3B (4-bit): ~3GB VRAM
- Combined: ~10GB VRAM minimum
If running on Windows, we recommend using WSL2 (Windows Subsystem for Linux) with Ubuntu:
- Install WSL2:
wsl --install - Install CUDA drivers for WSL2 from NVIDIA
- Follow the Linux CUDA instructions above inside WSL2
Running 7B+ models on CPU is extremely slow (expect 10-30+ seconds per inference).
export PLANNER_PROVIDER=hf \
PLANNER_MODEL=Qwen/Qwen2.5-3B-Instruct \
EXECUTOR_PROVIDER=hf \
EXECUTOR_MODEL=Qwen/Qwen2.5-1.5B-Instruct
python main.pyWarning: CPU inference for the planner will be very slow. Consider using smaller models (1.5B-3B) or a cloud API for the planner while running the executor locally.
For best results on limited hardware, use a cloud API for the planner (reasoning-heavy) and run only the executor locally:
export PLANNER_PROVIDER=openai \
PLANNER_MODEL=gpt-4o-mini \
EXECUTOR_PROVIDER=hf \
EXECUTOR_MODEL=Qwen/Qwen2.5-3B-Instruct
python main.pyThis hybrid approach gives you cloud-quality planning with local execution (no screenshots sent to cloud).
Run the scaffold:
python main.pyThis scaffold includes:
- Planner feedback loop: executor failures are summarized back to the planner for a revised plan.
- JSON schema validation: plan output is validated against the advanced plan format.
- Planner feedback JSONL: per-run file in
planner_feedback/<run_id>.jsonl. - Summary JSON: compact summary at
planner_feedback/<run_id>.summary.json. - Vision fallback (optional): set
ENABLE_VISION_FALLBACK=1.
The script defaults to Qwen 3.5 MLX models on Apple Silicon:
# Default (no env vars needed on Apple Silicon)
python main.pyThis uses:
- Planner:
mlx-community/Qwen3.5-9B-MLX-4bit - Executor:
mlx-community/Qwen3.5-4B-MLX-4bit
To override models or use different providers:
export PLANNER_PROVIDER=mlx \
PLANNER_MODEL=mlx-community/Qwen3.5-9B-MLX-4bit \
EXECUTOR_PROVIDER=mlx \
EXECUTOR_MODEL=mlx-community/Qwen3.5-4B-MLX-4bit
python main.pyFor HuggingFace Transformers (CUDA):
export PLANNER_PROVIDER=hf \
PLANNER_MODEL=Qwen/Qwen2.5-7B-Instruct \
EXECUTOR_PROVIDER=hf \
EXECUTOR_MODEL=Qwen/Qwen2.5-3B-Instruct
python main.pyBy default, the executor is text-only. To enable vision fallback:
ENABLE_VISION_FALLBACK=1 \
VISION_PROVIDER=local \
VISION_MODEL=Qwen/Qwen3-VL-8B-Instruct \
python main.pyOn Apple Silicon, you can use MLX-VLM:
ENABLE_VISION_FALLBACK=1 \
VISION_PROVIDER=mlx \
VISION_MODEL=mlx-community/Qwen3-VL-8B-Instruct-3bit \
python main.pyVision fallback behavior:
- If executor cannot produce a
CLICK(<id>), vision selects an element ID from the snapshot list. - If required verification fails after a click, vision can re-select a better element ID and retry.
- Vision responses are logged as
vision_selectevents in the JSONL feedback.
If you want, I can add:
- A planner/executor scaffold script
- JSON step schema + validator
- Executor loop that maps plan steps to
AgentRuntimeassertions - Planner feedback channel (executor writes assertion outcomes back to planner)