AudioX: Diffusion Transformer for Anything-to-Audio Generation
Abstract
Audio and music generation have emerged as crucial tasks in many applications, yet existing approaches face significant limitations: they operate in isolation without unified capabilities across modalities, suffer from scarce high-quality, multi-modal training data, and struggle to effectively integrate diverse inputs. In this work, we propose AudioX, a unified Diffusion Transformer model for Anything-to-Audio and Music Generation. Unlike previous domain-specific models, AudioX can generate both general audio and music with high quality, while offering flexible natural language control and seamless processing of various modalities including text, video, image, music, and audio. Its key innovation is a multi-modal masked training strategy that masks inputs across modalities and forces the model to learn from masked inputs, yielding robust and unified cross-modal representations. To address data scarcity, we curate two comprehensive datasets: vggsound-caps with 190K audio captions based on the VGGSound dataset, and V2M-caps with 6 million music captions derived from the V2M dataset. Extensive experiments demonstrate that AudioX not only matches or outperforms state-of-the-art specialized models, but also offers remarkable versatility in handling diverse input modalities and generation tasks within a unified architecture. The code and datasets will be available at https://zeyuet.github.io/AudioX/
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- UniForm: A Unified Diffusion Transformer for Audio-Video Generation (2025)
- DiffGAP: A Lightweight Diffusion Module in Contrastive Space for Bridging Cross-Model Gap (2025)
- RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models (2025)
- TA-V2A: Textually Assisted Video-to-Audio Generation (2025)
- HiTVideo: Hierarchical Tokenizers for Enhancing Text-to-Video Generation with Autoregressive Large Language Models (2025)
- MACS: Multi-source Audio-to-image Generation with Contextual Significance and Semantic Alignment (2025)
- DeepResonance: Enhancing Multimodal Music Understanding via Music-centric Multi-way Instruction Tuning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on HF中国镜像站 checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper