Add sandbox agent example with Docker and Daytona support by Jerryguan777 · Pull Request #14 · NVIDIA/NeMo-Agent-Toolkit-Examples

Jerryguan777 · 2026-01-29T19:05:58Z

Summary

This PR adds Sandbox Agent to examples/, a general-purpose AI agent that executes tasks within secure, isolated Docker containers or Daytona cloud sandboxes.

Motivation

Why Sandboxing Matters

Sandbox isolation provides security and consistency for agents that need to execute shell commands, Python code, or file operations, which are common requirements for agent types such as coding/testing agents, data analysis agents, and security-oriented agents.

Task Examples

Example 1: Data Analysis Workflow

User: "Analyze sales_data.csv and create visualizations"

Agent Actions:
  ✓ Read file using file_read tool
  ✓ Execute Python analysis using python tool
  ✓ Generate matplotlib chart
  ✓ Save output using file_write tool
  
Result: Interactive analysis with saved visualizations

Example 2: Multi-Step Research

User: "Find information about NVIDIA NIMs and create a comparison document"

Agent Actions:
  ✓ Search web using web_search tool
  ✓ Browse multiple pages using web_browse tool
  ✓ Analyze and synthesize using python tool
  ✓ Generate formatted document
  
Result: Comprehensive research document with sources

Features

1. Dual Sandbox Backend Support

Backend	Use Case
Docker	Local development
Daytona	Cloud-scale deployments, on-demand resources

2. Comprehensive Tool Suite (8 tools)

Tool	Location	Description
`shell`	Sandbox	Execute bash commands
`python`	Sandbox	Execute Python code with data science libraries
`file_read`	Sandbox	Read file contents
`file_write`	Sandbox	Write files
`web_browse`	Sandbox	Browse URLs with Playwright (headless Chromium)
`web_search`	Host	Tavily AI search API
`web_fetch`	Host	Fetch URLs and convert HTML to clean Markdown
`image_describe`	Host	Analyze images using a vision LLM (optional)

Architecture insight: Host-side tools keep API keys secure and reduce latency. Sandbox-side tools provide isolation for untrusted operations.

4. GAIA Benchmark Results (February 2026, GPT-5.2 + GPT-5.2 Vision)

Level	Tasks	Accuracy
Level 1	53	69.8%
Level 2	86	60.5%
Level 3	26	38.5%

These results demonstrate the agent's capability on real-world tasks from the GAIA benchmark, including web research, file analysis, mathematical reasoning, and image understanding.

4. Integration with NAT

Uses @register_function for workflow registration
Compatible with nat run and nat serve
Works with all NAT LLMs (NIM, OpenAI, Anthropic, etc.)
Integrates with NAT evaluation framework
Supports Phoenix observability
All new code in isolated example directory

Code Quality

Test Coverage: ~90% (141 unit tests, 14 integration tests, 155 total — all pass)
Follows NAT code style (ruff-formatted)
Complete type annotations (Python 3.10+)
Comprehensive docstrings (Google style)
Proper error handling and logging
Async/await throughout

Project Structure

examples/sandbox_agent/
├── README.md                     # Comprehensive documentation
├── Dockerfile                    # Sandbox image (pandas, playwright, etc.)
├── pyproject.toml               # Dependencies with NAT entry point
├── configs/
│   ├── config.yaml              # Basic configuration
│   ├── config_daytona.yaml      # Daytona cloud configuration
│   ├── config_gaia.yaml         # GAIA evaluation configuration
│   └── config_daytona_gaia.yaml # Daytona + GAIA configuration
├── src/nat_sandbox_agent/
│   ├── register.py              # NAT workflow registration
│   ├── sandbox/                 # Docker & Daytona implementations
│   ├── tools/                   # Modular tool system
│   │   ├── host/                # web_search, web_fetch, image_describe
│   │   └── sandbox/             # shell, python, file_read, file_write, web_browse
│   ├── prompts/                 # System prompts
│   └── utils/                   # LLM-based answer cleaning
├── scripts/
│   └── enrich_gaia_dataset.py   # GAIA dataset preprocessing
├── data/
│   ├── gaia_validation_enriched.parquet
│   └── attachments/             # GAIA task attachments
└── tests/                       # 155 unit & integration tests

Description

Closes

By Submitting this PR I confirm:

I am familiar with the Contributing Guidelines.
We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
- Any contribution which contains commits that are not Signed-Off will not be accepted.
When the PR is ready for review, new or existing tests cover these changes.
When the PR is ready for review, the documentation is up to date with these changes.

- General-purpose AI agent with sandboxed code execution - Supports Docker containers and Daytona cloud sandboxes - Tools: shell, python, file_read, file_write, web_browse, web_search, youtube_transcript - GAIA benchmark evaluation support - Comprehensive test suite (159 tests) Signed-off-by: Jerry Guan <jerryguan777@gmail.com>

copy-pr-bot · 2026-01-29T19:06:02Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

willkill07 · 2026-01-29T19:28:44Z

/ok to test 56b43de

- Add technical terms to Vale accept.txt: Daytona, httpx, matplotlib, openpyxl, pyyaml, reportlab, sandbox variants, seaborn - Fix NumPy capitalization in README.md Signed-off-by: Jerry Guan <jerryguan777@gmail.com>

willkill07 · 2026-02-02T14:04:50Z

/ok to test 4e2bd17

willkill07 · 2026-02-05T20:39:55Z

@Jerryguan777 it looks like not all of the code is formatted according to the CI Pipeline Check

Signed-off-by: Jerry Guan <jerryguan777@gmail.com>

Jerryguan777 · 2026-02-06T01:06:09Z

@willkill07 Thanks for catching this. I've fixed the formatting.

@Jerryguan777 it looks like not all of the code is formatted according to the CI Pipeline Check

willkill07 · 2026-02-17T19:32:26Z

/ok to test 3ad8836

Tool changes: - Remove youtube_transcript, add web_fetch (lightweight HTTP GET) - Add web_fetch vs web_browse decision rule with 403 fallback New guidance (not in previous version): - Python execution rules: print() requirement, no variable persistence, empty stdout handling, with code examples - Input file handling: attached file path detection and file extension-to-tool mapping (.xlsx, .pdf, .mp3, .pptx, etc.) - Format Verification checklist (number, unit, delimiter, case, date) - Calculation and Reasoning Verification strategies - Multi-Step Web Research strategies - Data Extraction Best Practices - Environment: root privileges, pip/apt-get install, error handling Removed redundant sections: - Guidelines (covered by Rules), Response Format (covered by §4), Problem-Solving Strategy (split into Rules and Environment), trailing "ALWAYS use tools" reminder Structure: 10+ nested sections → 5 flat (Tools, Environment, Rules, Strategies). Generalized GAIA-specific language for broader use. Signed-off-by: Jerry Guan <jerryguan777@gmail.com>

willkill07 · 2026-02-18T16:10:11Z

/ok to test c25015d

…em prompt - Add image_describe host tool: reads images from sandbox via read_file_bytes, sends base64 to a configurable vision LLM, returns text description. Supports png/jpg/jpeg/gif/webp/bmp/tiff. - Add read_file_bytes abstract method to BaseSandbox with Docker and Daytona implementations. - Add vision_llm_name config option to SandboxAgentWorkflowConfig. - Enhance answer cleaning prompt: add extraction/formatting/output grouping, embedded scale handling, case sensitivity rules. - Remove rule-based clean_answer function (only LLM-based remains). - Streamline system prompt: merge Rules and Strategies sections, fold web tool decision rule into tool descriptions, condense Python examples, reduce prompt size by 35% with no accuracy loss. - Update configs to support vision_llm configuration. - Add tests for image_describe tool and update existing tests. Signed-off-by: Jerry Guan <jerryguan777@gmail.com>

willkill07 · 2026-02-19T20:13:03Z

/ok to test 18d850d

Signed-off-by: Jerry Guan <jerryguan777@gmail.com>

willkill07 · 2026-03-16T23:41:02Z

/ok to test eafbf33

willkill07 · 2026-03-18T13:35:25Z

@Jerryguan777 CI is still failing. Can you please try to ensure CI passes locally before your next update? I do want to get this merged in :)

Formatting
[Formatting[(https://github.com/NVIDIA/NeMo-Agent-Toolkit-Examples/actions/runs/23171189318/job/67323197199?pr=14#step:5:209)
https://github.com/NVIDIA/NeMo-Agent-Toolkit-Examples/actions/runs/23171189318/job/67323197199?pr=14#step:5:538
Vocabulary -- if you are listing python packages, escape them with backticks. For FFmpeg, you can add to accept.txt

I will work on a PR that handles the copyright errors.

willkill07 added external This issue was filed by someone outside of the NeMo Agent toolkit team feature request New feature or request non-breaking Non-breaking change labels Jan 29, 2026

willkill07 approved these changes Jan 29, 2026

View reviewed changes

Fix Vale spelling check errors for sandbox_agent

4e2bd17

- Add technical terms to Vale accept.txt: Daytona, httpx, matplotlib, openpyxl, pyyaml, reportlab, sandbox variants, seaborn - Fix NumPy capitalization in README.md Signed-off-by: Jerry Guan <jerryguan777@gmail.com>

fixing formatting

3ad8836

Signed-off-by: Jerry Guan <jerryguan777@gmail.com>

Jerryguan777 added 2 commits February 17, 2026 21:55

fix vale check error, add 6 words in vale vocab

eafbf33

Signed-off-by: Jerry Guan <jerryguan777@gmail.com>

willkill07 self-assigned this Mar 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add sandbox agent example with Docker and Daytona support#14

Add sandbox agent example with Docker and Daytona support#14
Jerryguan777 wants to merge 7 commits intoNVIDIA:mainfrom
Jerryguan777:feat/sandbox_agent

Jerryguan777 commented Jan 29, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Jan 29, 2026

Uh oh!

willkill07 commented Jan 29, 2026

Uh oh!

willkill07 commented Feb 2, 2026

Uh oh!

willkill07 commented Feb 5, 2026

Uh oh!

Jerryguan777 commented Feb 6, 2026

Uh oh!

willkill07 commented Feb 17, 2026

Uh oh!

willkill07 commented Feb 18, 2026

Uh oh!

willkill07 commented Feb 19, 2026

Uh oh!

willkill07 commented Mar 16, 2026

Uh oh!

willkill07 commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Jerryguan777 commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Why Sandboxing Matters

Task Examples

Features

1. Dual Sandbox Backend Support

2. Comprehensive Tool Suite (8 tools)

4. GAIA Benchmark Results (February 2026, GPT-5.2 + GPT-5.2 Vision)

4. Integration with NAT

Code Quality

Project Structure

Description

By Submitting this PR I confirm:

Uh oh!

copy-pr-bot bot commented Jan 29, 2026

Uh oh!

willkill07 commented Jan 29, 2026

Uh oh!

willkill07 commented Feb 2, 2026

Uh oh!

willkill07 commented Feb 5, 2026

Uh oh!

Jerryguan777 commented Feb 6, 2026

Uh oh!

willkill07 commented Feb 17, 2026

Uh oh!

willkill07 commented Feb 18, 2026

Uh oh!

willkill07 commented Feb 19, 2026

Uh oh!

willkill07 commented Mar 16, 2026

Uh oh!

willkill07 commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Jerryguan777 commented Jan 29, 2026 •

edited

Loading