Skip to content

Latest commit

 

History

History
232 lines (170 loc) · 11 KB

File metadata and controls

232 lines (170 loc) · 11 KB

A Semantically Consistent Dataset for Data-Efficient Query-Based Universal Sound Separation

Logo

Kai Li*, Jintao Cheng*, Chang Zeng, Zijun Yan, Helin Wang, Zixiong Su, Bo Zheng, Xiaolin Hu
Tsinghua University, Shanda AI, Johns Hopkins University
*Equal contribution
Completed during Kai Li's internship at Shanda AI.
📜 Arxiv 2026 | 🎶 Demo | 🤗 Dataset | 🤗 Space

访客统计 GitHub stars Static Badge


💥 News

  • [2026-03-09] We release infer_audiosep.py for one-command AudioSep inference in Hive. The script automatically downloads config and checkpoints from ShandaAI/AudioSep-hive. 🚀
  • [2026-03-09] We release infer_flowsep.py for one-command FlowSep inference in Hive, with automatic config and checkpoint download from ShandaAI/FlowSep-hive. 🚀
  • [2026-03-09] We release app.py, a unified Gradio demo that supports both AudioSep-hive and FlowSep-hive in a single interface. 🚀
  • [2026-03-09] Community Hugging Face Space is available at JusperLee/Hive for quick interactive demo. 🚀
  • [2026-02-09] Thanks to @faiteamartaliius for using this codebase to synthesize data and publicly sharing a third-party Hive-style dataset: faiteamartaliius/Hive. 🙏

📌 Table of Contents

📄 Abstract

Query-based universal sound separation is fundamental to intelligent auditory systems, aiming to isolate specific sources from unconstrained mixtures. Despite recent advances, existing methods continue to suffer from residual interference in complex acoustic scenes. This performance limitation stems largely from a data bottleneck: ubiquitous in-the-wild datasets contain weak labels and severe event co-occurrence. These flaws induce models to learn spurious correlations between background noise and target categories instead of robust acoustic features. To address this, we propose an automated pipeline that eliminates co-occurrence noise by mining high-purity single-event segments from unconstrained recordings and synthesizing mixtures via semantically consistent strategies. Utilizing this pipeline, we constructed Hive, a high-quality synthetic dataset comprising 2k hours of audio. Experimental results demonstrate that, despite using only ~0.2% of the data scale of million-hour baselines, models trained on Hive achieve competitive separation accuracy and perceptual quality. Moreover, these models exhibit remarkable zero-shot generalization on out-of-distribution evaluation benchmarks such as MUSDB18-HQ and USS-Bench. These findings highlight that prioritizing supervision purity enables significant data efficiency, offering a new paradigm for training robust auditory foundation models with reduced computational costs.

📂 Repository Structure

.
├── hive_dataset/                           # Hive Dataset generation and curation
│   ├── mix_from_metadata/                  # Generate mixtures from metadata
│   │   ├── mix_from_metadata.py
│   │   └── dataset_paths.json
│   ├── mix_curation/                       # Data curation for mix audio
│   │   ├── mix_data_curation.py
│   │   └── ontology.json
│   ├── README.md                           # Dataset documentation
│   ├── requirements.txt
│   └── LICENSE
├── pipeline/                               # Single-Event Data Collection Pipeline
│   ├── code/                               # Pipeline scripts
│   │   ├── 01_audio_chunking.py
│   │   ├── 02_filter_single_label.py
│   │   ├── 03_filter_single_event_qwen.py
│   │   ├── 04_audioset_label_audiotag.py
│   │   ├── 05_leaf_label_qwen.py
│   │   └── 06_superres_apollo.py
│   ├── data/                               # Pipeline data directories
│   ├── ontology/                           # AudioSet ontologies
│   ├── icefall/                            # AudioTag model repository
│   ├── Apollo/                             # Apollo model repository
│   ├── requirements.txt                    # Pipeline dependencies
│   └── README.md                           # Pipeline documentation
├── LICENCE                                 # MIT License
└── README.md                              

🍯 Hive Dataset

Hive is a high-quality synthetic dataset with 2,442 hours of raw audio and 19.6M mixtures for Universal Sound Separation.

Features:

  • 283 sound categories from AudioSet ontology
  • Semantically consistent mixing logic
  • 44.1kHz sample rate

Please refer to hive_dataset/ for details

⚙️ Data Collection Pipeline

An automated 6-step pipeline for mining high-purity single-event audio from weakly-labeled sources.

Pipeline Stages:

  1. Audio Chunking - Split long audio into segments
  2. Single Label Filtering - Remove multi-label samples
  3. Single Event Filtering - Verify acoustic purity with Qwen3-Omni
  4. AudioSet Label Tagging - Assign ontology labels with AudioTag
  5. Leaf Label Classification - Refine to leaf nodes with Qwen3-Omni
  6. Audio Super-Resolution - Upsample to 44.1kHz with Apollo

Please refer to pipeline/ for details

🎧 Inference

Hive provides two inference scripts with automatic checkpoint/config download from Hugging Face:

1) Install dependencies

cd Hive
pip install torch torchaudio librosa pyyaml pytorch-lightning huggingface_hub gradio

2) AudioSep inference

python infer_audiosep.py \
  --audio_file /path/to/mixture.wav \
  --text "acoustic guitar" \
  --output_file /path/to/audiosep_output.wav

3) FlowSep inference

python infer_flowsep.py \
  --audio_file /path/to/mixture.wav \
  --text "acoustic guitar" \
  --output_file /path/to/flowsep_output.wav

🖥️ Gradio App

app.py launches an interactive local demo with both models in one UI:

  • Model choices: AudioSep-hive, FlowSep-hive
  • Input: mixed audio + text query
  • Output: separated waveform

Run:

cd Hive
python app.py

Then open the local Gradio URL printed in terminal.

📖 Citation

If you use this code or the Hive Dataset, please cite:

@article{li2026semantically,
  title={A Semantically Consistent Dataset for Data-Efficient Query-Based Universal Sound Separation},
  author={Li, Kai and Cheng, Jintao and Zeng, Chang and Yan, Zijun and Wang, Helin and Su, Zixiong and Zheng, Bo and Hu, Xiaolin},
  journal={arXiv preprint arXiv:2601.22599},
  year={2026}
}

⚖️ License

Project License

This project is licensed under the Apache License 2.0. See LICENSE for details.

Model Licenses

  • Qwen3-Omni: Apache 2.0
  • AudioTag: Apache 2.0
  • Apollo: Check model repository for specific license

🙏 Acknowledgments

The Hive dataset is a collaborative achievement built upon the foundation of the open-source audio community. We extend our deepest gratitude to the researchers and organizations who curated the twelve foundational datasets. Their work provides the essential long-tailed acoustic space for advancing Universal Sound Separation.

🏛️ Foundational Data Sources

We gratefully acknowledge the following core datasets which provided the majority of our high-fidelity clips:

  • BBC Sound Effects (369,603 clips, 1,020.62h) - Professional-grade recordings with broadcast-level fidelity under Remix License
  • AudioSet (326,890 clips, 896.61h) - Large-scale benchmark from YouTube under CC BY (Google)
  • VGGSound (115,191 clips, 319.10h) - Real-world acoustic diversity under CC BY 4.0 (University of Oxford)
  • FreeSound (17,451 clips, 46.90h) - Rich crowdsourced soundscapes under CC0/BY/BY-NC (MTG-UPF)

🎯 Specialized Domain Contributors

Our sincere thanks go to the following datasets for providing the raw source audio that forms the specialized domains of the Hive Dataset:

Music & Speech:

  • MUSIC21 (32,701 clips, 90.28h) - Solo and ensemble instruments for harmonic structure modeling
  • Voicebank-DEMAND (12,376 clips, 9.94h) - Clean speech signals under CC BY 4.0
  • FSD50K (636 clips, 0.80h) - Finely annotated subset based on AudioSet ontology

Environmental & Events:

  • ClothoV2 (14,759 clips, 38.19h) - Audio captioning dataset with rich temporal evolution
  • AVE (3,054 clips, 6.91h) - Audio-visual event localization under CC BY-NC-SA
  • SoundBible (2,501 clips, 5.78h) - Curated short clips under CC BY 4.0
  • DCASE (1,969 clips, 5.46h) - Acoustic scene detection challenges
  • ESC50 (1,433 clips, 1.99h) - Environmental sound classification benchmark under CC BY-NC 3.0

⚖️ License & Ethical Compliance

All source data were processed in strict accordance with their respective licenses (e.g., CC BY, CC0, Remix License). An automated data collection pipeline was employed to ensure that only semantically aligned and single-label pure segments were extracted, respecting the original intent of the data contributors while enhancing their utility for sound separation tasks.

Important Note: This repository releases only the metadata (JSON files containing mixing parameters and source references) for reproducibility. We do not redistribute the original audio files from the source datasets. Users must independently download and prepare the source datasets according to their respective licenses and terms of use.

We thank all original contributors for their invaluable service to the scientific community.