Automatically transform multi-source scientific data into AI-Ready data assets
NanoData is built on top of HKUDS/NanoBot, specializing in automated processing of multi-source scientific data. It completes the full pipeline of data acquisition, analysis, cleaning, format conversion, and multimodal export. Through integration with the Easy Dataset MCP server, it exports high-quality SFT fine-tuning datasets ready for use with LLaMA Factory.
It also retains NanoBot's core capabilities: persistent memory, multi-platform connections, SubAgent parallel execution, Cron scheduled tasks, and the Skill system.
| Dimension | NanoBot | NanoData |
|---|---|---|
| Positioning | General personal AI assistant | AI4Science data engineering agent |
| New Tools | — | data_profile, python_exec, file_reader, image_generate |
| New Skills | — | data-cleaning, data-to-text, scientific-data-parser, sft-dataset, multimodal-augmentation |
| MCP Server | External configuration only | Built-in Easy Dataset MCP Server (676 lines, 18 tools) |
| Agent Role | General assistant | Senior AI Data Engineer |
Multi-source scientific data (CSV/Parquet/HDF5/NetCDF/FITS/CIF/PDB/...)
│
▼
┌─────────────┐
│ data_profile │ ← Auto-analyze schema, missing values, data types
└──────┬──────┘
▼
┌─────────────┐
│ data-cleaning│ ← Cleaning, type fixing, deduplication, standardization
└──────┬──────┘
▼
┌─────────────┐
│ data-to-text │ ← Structured data → Semantic Markdown document
└──────┬──────┘
▼
┌─────────────┐
│ sft-dataset │ ← Easy Dataset MCP → QA pair generation → Alpaca/ShareGPT export
└──────┬──────┘
▼
┌──────────────────────┐
│ multimodal-augmentation│ ← SubAgent parallel image generation → Multimodal dataset
└──────────┬───────────┘
▼
LLaMA Factory / Other training frameworks
git clone https://github.com/oNya685/NanoData.git
cd nanobot
pip install -e .nanodata onboard{
"providers": {
"openrouter": { "apiKey": "sk-or-v1-xxx" }
},
"agents": {
"defaults": { "model": "openrouter/claude-opus-4-6" }
}
}nanodata agent> Help me analyze the data quality of experiments.csv
> Clean the data and export as Parquet
> Convert the cleaned data into an experiment report document
> Use Easy Dataset to generate SFT training data, export in Alpaca format
Automatically identifies CSV/Excel/Parquet/JSON/JSONL formats, outputs schema, missing value ratios, data type distributions, and sample data. Supports automatic multi-encoding detection (UTF-8, GBK, Latin1, etc.).
Executes Python scripts in isolated subprocesses for pandas/numpy data processing. Each call is completely isolated with configurable timeout (default 60s, max 300s). Returns full traceback on failure for Agent self-correction.
Supports reading text files by line range, automatically detects binary files and prompts for use of specialized libraries (h5py, netCDF4, etc.). Used for preliminary exploration of scientific data formats.
Calls OpenAI-compatible APIs to generate images, supports batch concurrent requests (configurable max_parallel), automatically saves to disk. Used for multimodal dataset augmentation.
Standardized data cleaning workflow: Profile → Diagnosis → Script Generation → Execution & Verification → Data Card Generation. Follows immutability principle (never overwrites original files), prefers Parquet output to preserve type information.
Converts tabular data into semantic natural language documents. Includes four phases: semantic inference, template design, script generation, and quality check. Outputs AI-Ready Markdown documents ready for RAG, knowledge bases, or SFT dataset generation.
Handles specialized scientific formats: HDF5, NetCDF, FITS, CIF, PDB, etc. Follows zero-data-loss principle, automatically detects unparsed content and emits [UNPARSED_WARNING], iteratively refines until complete extraction.
Implements complete SFT dataset generation pipeline via built-in Easy Dataset MCP Server:
- Create project → Configure model → Upload documents
- Semantic chunking → Generate questions → Generate answers
- Data cleaning → Quality evaluation → Export
Supports Alpaca, ShareGPT, and multilingual-thinking format exports. Includes Cron async polling orchestration template for automatic long-running task scheduling.
Based on SubAgent parallel architecture: dataset sharding → multiple SubAgents parallel image generation → merge output. Adds image_description, image_prompt, and image_path fields to each QA pair.
Built-in MCP Server (nanodata/mcp_servers/easy_dataset_server.py), providing 18 tools:
| Tool | Function |
|---|---|
create_project / list_projects / delete_project |
Project management |
upload_file / list_files / delete_file |
File management |
split_text / list_chunks |
Semantic chunking |
generate_questions / list_questions |
Question generation |
generate_answer / generate_answers_batch |
Answer generation |
clean_data |
Data cleaning |
evaluate_datasets |
Quality evaluation |
list_datasets / export_dataset |
Dataset management & export |
configure_model / list_model_configs |
Model configuration |
Auto-injected on startup, no manual configuration needed. Supports automatic selection of available model configurations.
- Persistent Memory:
memory/MEMORY.md+memory/HISTORY.md - Multi-platform Support: Telegram, Discord, WhatsApp, Feishu, Slack, Email, QQ, DingTalk, Mochat
- SubAgent: Background parallel task execution
- Cron Scheduled Tasks: Supports cron expressions and interval scheduling
- Heartbeat: Auto-wakes every 30 minutes to execute periodic tasks
- Skill System: Defined with Markdown + YAML frontmatter, dynamically loaded
- Multi LLM Provider: OpenRouter, Anthropic, OpenAI, DeepSeek, Gemini, vLLM, etc.
- MCP Protocol: Supports both Stdio and HTTP transport modes
nanodata/
├── agent/
│ ├── loop.py # Agent main loop (register new tools)
│ ├── context.py # Prompt building (inject data engineering instructions)
│ ├── memory.py # Persistent memory
│ ├── subagent.py # SubAgent management (supports multimodal tasks)
│ └── tools/
│ ├── data_profile.py # [NEW] Dataset analysis (235 lines)
│ ├── python_exec.py # [NEW] Sandboxed Python execution (134 lines)
│ ├── file_reader.py # [NEW] Multi-format file reading (123 lines)
│ ├── image_generate.py# [NEW] Image generation (239 lines)
│ ├── mcp.py # MCP tool integration (enhanced type hints)
│ └── ... # Original tools (shell, filesystem, web, spawn, cron)
├── skills/
│ ├── data-cleaning/ # [NEW] Data cleaning skill
│ ├── data-to-text/ # [NEW] Data-to-text skill
│ ├── scientific-data-parser/ # [NEW] Scientific data parsing skill
│ ├── sft-dataset/ # [NEW] SFT dataset generation skill
│ ├── multimodal-augmentation/ # [NEW] Multimodal augmentation skill
│ └── ... # Original skills (github, weather, tmux, summarize, clawhub)
├── mcp_servers/
│ └── easy_dataset_server.py # [NEW] Easy Dataset MCP Server (676 lines)
├── templates/
│ ├── AGENTS.md # [REWRITTEN] Data engineering agent instructions
│ └── TOOLS.md # [REWRITTEN] Tool usage documentation
├── config/
│ └── loader.py # [MODIFIED] Auto-inject built-in MCP Server
├── channels/ # Multi-platform support (retained)
├── providers/ # LLM Provider (retained)
└── ...
Changes from NanoBot to NanoData:
- New Code: ~3,500 lines
- New Tools: 4 (
data_profile,python_exec,file_reader,image_generate) - New Skills: 5 (including script templates and reference docs)
- New MCP Server: 1 (18 tools, 676 lines)
- Modified Files: 114
- Package Refactoring:
nanobot.*→nanodata.*
docker build -t nanodata .
docker run -v ~/.nanodata:/root/.nanodata -p 18790:18790 nanodata gateway- HKUDS/NanoBot — Base framework
- Easy Dataset — SFT dataset generation
- LiteLLM — Multi-provider support
- Model Context Protocol — Tool extension protocol
MIT