Skip to content

Add sandbox agent example with Docker and Daytona support#14

Open
Jerryguan777 wants to merge 7 commits intoNVIDIA:mainfrom
Jerryguan777:feat/sandbox_agent
Open

Add sandbox agent example with Docker and Daytona support#14
Jerryguan777 wants to merge 7 commits intoNVIDIA:mainfrom
Jerryguan777:feat/sandbox_agent

Conversation

@Jerryguan777
Copy link

@Jerryguan777 Jerryguan777 commented Jan 29, 2026

Summary

This PR adds Sandbox Agent to examples/, a general-purpose AI agent that executes tasks within secure, isolated Docker containers or Daytona cloud sandboxes.

Motivation

Why Sandboxing Matters

Sandbox isolation provides security and consistency for agents that need to execute shell commands, Python code, or file operations, which are common requirements for agent types such as coding/testing agents, data analysis agents, and security-oriented agents.

Task Examples

Example 1: Data Analysis Workflow

User: "Analyze sales_data.csv and create visualizations"

Agent Actions:
  ✓ Read file using file_read tool
  ✓ Execute Python analysis using python tool
  ✓ Generate matplotlib chart
  ✓ Save output using file_write tool
  
Result: Interactive analysis with saved visualizations

Example 2: Multi-Step Research

User: "Find information about NVIDIA NIMs and create a comparison document"

Agent Actions:
  ✓ Search web using web_search tool
  ✓ Browse multiple pages using web_browse tool
  ✓ Analyze and synthesize using python tool
  ✓ Generate formatted document
  
Result: Comprehensive research document with sources

Features

1. Dual Sandbox Backend Support
Backend Use Case
Docker Local development
Daytona Cloud-scale deployments, on-demand resources
2. Comprehensive Tool Suite (8 tools)
Tool Location Description
shell Sandbox Execute bash commands
python Sandbox Execute Python code with data science libraries
file_read Sandbox Read file contents
file_write Sandbox Write files
web_browse Sandbox Browse URLs with Playwright (headless Chromium)
web_search Host Tavily AI search API
web_fetch Host Fetch URLs and convert HTML to clean Markdown
image_describe Host Analyze images using a vision LLM (optional)

Architecture insight: Host-side tools keep API keys secure and reduce latency. Sandbox-side tools provide isolation for untrusted operations.

4. GAIA Benchmark Results (February 2026, GPT-5.2 + GPT-5.2 Vision)
Level Tasks Accuracy
Level 1 53 69.8%
Level 2 86 60.5%
Level 3 26 38.5%

These results demonstrate the agent's capability on real-world tasks from the GAIA benchmark, including web research, file analysis, mathematical reasoning, and image understanding.

4. Integration with NAT
  • Uses @register_function for workflow registration
  • Compatible with nat run and nat serve
  • Works with all NAT LLMs (NIM, OpenAI, Anthropic, etc.)
  • Integrates with NAT evaluation framework
  • Supports Phoenix observability
  • All new code in isolated example directory

Code Quality

  • Test Coverage: ~90% (141 unit tests, 14 integration tests, 155 total — all pass)
  • Follows NAT code style (ruff-formatted)
  • Complete type annotations (Python 3.10+)
  • Comprehensive docstrings (Google style)
  • Proper error handling and logging
  • Async/await throughout

Project Structure

examples/sandbox_agent/
├── README.md                     # Comprehensive documentation
├── Dockerfile                    # Sandbox image (pandas, playwright, etc.)
├── pyproject.toml               # Dependencies with NAT entry point
├── configs/
│   ├── config.yaml              # Basic configuration
│   ├── config_daytona.yaml      # Daytona cloud configuration
│   ├── config_gaia.yaml         # GAIA evaluation configuration
│   └── config_daytona_gaia.yaml # Daytona + GAIA configuration
├── src/nat_sandbox_agent/
│   ├── register.py              # NAT workflow registration
│   ├── sandbox/                 # Docker & Daytona implementations
│   ├── tools/                   # Modular tool system
│   │   ├── host/                # web_search, web_fetch, image_describe
│   │   └── sandbox/             # shell, python, file_read, file_write, web_browse
│   ├── prompts/                 # System prompts
│   └── utils/                   # LLM-based answer cleaning
├── scripts/
│   └── enrich_gaia_dataset.py   # GAIA dataset preprocessing
├── data/
│   ├── gaia_validation_enriched.parquet
│   └── attachments/             # GAIA task attachments
└── tests/                       # 155 unit & integration tests

Description

Closes

By Submitting this PR I confirm:

  • I am familiar with the Contributing Guidelines.
  • We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
    • Any contribution which contains commits that are not Signed-Off will not be accepted.
  • When the PR is ready for review, new or existing tests cover these changes.
  • When the PR is ready for review, the documentation is up to date with these changes.

- General-purpose AI agent with sandboxed code execution
- Supports Docker containers and Daytona cloud sandboxes
- Tools: shell, python, file_read, file_write, web_browse, web_search, youtube_transcript
- GAIA benchmark evaluation support
- Comprehensive test suite (159 tests)

Signed-off-by: Jerry Guan <jerryguan777@gmail.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 29, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@willkill07 willkill07 added external This issue was filed by someone outside of the NeMo Agent toolkit team feature request New feature or request non-breaking Non-breaking change labels Jan 29, 2026
@willkill07
Copy link
Member

/ok to test 56b43de

- Add technical terms to Vale accept.txt: Daytona, httpx, matplotlib,
  openpyxl, pyyaml, reportlab, sandbox variants, seaborn
- Fix NumPy capitalization in README.md

Signed-off-by: Jerry Guan <jerryguan777@gmail.com>
@willkill07
Copy link
Member

/ok to test 4e2bd17

@willkill07
Copy link
Member

@Jerryguan777 it looks like not all of the code is formatted according to the CI Pipeline Check

Signed-off-by: Jerry Guan <jerryguan777@gmail.com>
@Jerryguan777
Copy link
Author

@willkill07 Thanks for catching this. I've fixed the formatting.

@Jerryguan777 it looks like not all of the code is formatted according to the CI Pipeline Check

@willkill07
Copy link
Member

/ok to test 3ad8836

Tool changes:
- Remove youtube_transcript, add web_fetch (lightweight HTTP GET)
- Add web_fetch vs web_browse decision rule with 403 fallback

New guidance (not in previous version):
- Python execution rules: print() requirement, no variable
  persistence, empty stdout handling, with code examples
- Input file handling: attached file path detection and file
  extension-to-tool mapping (.xlsx, .pdf, .mp3, .pptx, etc.)
- Format Verification checklist (number, unit, delimiter, case, date)
- Calculation and Reasoning Verification strategies
- Multi-Step Web Research strategies
- Data Extraction Best Practices
- Environment: root privileges, pip/apt-get install, error handling

Removed redundant sections:
- Guidelines (covered by Rules), Response Format (covered by §4),
  Problem-Solving Strategy (split into Rules and Environment),
  trailing "ALWAYS use tools" reminder

Structure: 10+ nested sections → 5 flat (Tools, Environment, Rules,
Strategies). Generalized GAIA-specific language for broader use.

Signed-off-by: Jerry Guan <jerryguan777@gmail.com>
Tool changes:
- Remove youtube_transcript, add web_fetch (lightweight HTTP GET)
- Add web_fetch vs web_browse decision rule with 403 fallback

New guidance (not in previous version):
- Python execution rules: print() requirement, no variable
  persistence, empty stdout handling, with code examples
- Input file handling: attached file path detection and file
  extension-to-tool mapping (.xlsx, .pdf, .mp3, .pptx, etc.)
- Format Verification checklist (number, unit, delimiter, case, date)
- Calculation and Reasoning Verification strategies
- Multi-Step Web Research strategies
- Data Extraction Best Practices
- Environment: root privileges, pip/apt-get install, error handling

Removed redundant sections:
- Guidelines (covered by Rules), Response Format (covered by §4),
  Problem-Solving Strategy (split into Rules and Environment),
  trailing "ALWAYS use tools" reminder

Structure: 10+ nested sections → 5 flat (Tools, Environment, Rules,
Strategies). Generalized GAIA-specific language for broader use.

Signed-off-by: Jerry Guan <jerryguan777@gmail.com>
@willkill07
Copy link
Member

/ok to test c25015d

…em prompt

- Add image_describe host tool: reads images from sandbox via
  read_file_bytes, sends base64 to a configurable vision LLM, returns
  text description. Supports png/jpg/jpeg/gif/webp/bmp/tiff.
- Add read_file_bytes abstract method to BaseSandbox with Docker and
  Daytona implementations.
- Add vision_llm_name config option to SandboxAgentWorkflowConfig.
- Enhance answer cleaning prompt: add extraction/formatting/output
  grouping, embedded scale handling, case sensitivity rules.
- Remove rule-based clean_answer function (only LLM-based remains).
- Streamline system prompt: merge Rules and Strategies sections,
  fold web tool decision rule into tool descriptions, condense Python
  examples, reduce prompt size by 35% with no accuracy loss.
- Update configs to support vision_llm configuration.
- Add tests for image_describe tool and update existing tests.

Signed-off-by: Jerry Guan <jerryguan777@gmail.com>
@willkill07
Copy link
Member

/ok to test 18d850d

Signed-off-by: Jerry Guan <jerryguan777@gmail.com>
@willkill07
Copy link
Member

/ok to test eafbf33

@willkill07
Copy link
Member

@Jerryguan777 CI is still failing. Can you please try to ensure CI passes locally before your next update? I do want to get this merged in :)

  • Formatting
  • [Formatting[(https://github.com/NVIDIA/NeMo-Agent-Toolkit-Examples/actions/runs/23171189318/job/67323197199?pr=14#step:5:209)
    https://github.com/NVIDIA/NeMo-Agent-Toolkit-Examples/actions/runs/23171189318/job/67323197199?pr=14#step:5:538
  • Vocabulary -- if you are listing python packages, escape them with backticks. For FFmpeg, you can add to accept.txt

I will work on a PR that handles the copyright errors.

@willkill07 willkill07 self-assigned this Mar 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

external This issue was filed by someone outside of the NeMo Agent toolkit team feature request New feature or request non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants