Awesome OpenSound

Various Audio Process Baselines

Task 1: Sound Event Detection and Localization

An improved Event-Independent Network for Polyphonic Sound Event Localization and Detection (2020). Github
SELD-Net: Sound Event Localization and Detection of overlapping sources using convolutional recurrent neural network, IEEE Journal of Selected Topics in Signal Processing (JSTSP 2018). Github

Task 2: Room Acoustics

Sebastian Thrun, Affine Structure from Sound. NIPS 2005, paper link
Zhoutong Zhang et al. Shape and Material from Sound. NIPS 2017. paper link
Miranda et al. Structure from Sound with Incomplete Data. ICASSP 2018. paper link
Arun Balajee Vasudevan et al. Semantic Object Prediction and Spatial Sound Super-Resolution with Binaural Sounds. ECCV 2020. paper link
Changan Chen et al. Audio-Visual Waypoints for Navigation.
F. Ribeiro, D. Florencio, D. Ba and C. Zhang, "Geometrically Constrained Room Modeling With Compact Microphone Arrays," in IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 5, pp. 1449-1460, July 2012, doi: 10.1109/TASL.2011.2180897.
F. Antonacci et al., "Inference of Room Geometry From Acoustic Impulse Responses," in IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 10, pp. 2683-2695, Dec. 2012, doi: 10.1109/TASL.2012.2210877.
Ivan Dokmanić, Acoustic echoes reveal room shape. PNAS, 2013. paper link.

Task 3: Sound Generation

Wei Ping et al. WaveFlow: A Compact Flow-based Model for Raw Audio. ICML2020. paper link

Task 4: Sound Object Discussion

Daniel Arteaga, et al., Multichannel-based learning for audio object extraction. ICASSP 21. paper link

Dataset

Google AudioSet, link
GWA, link
STARSS23: Audio-Visual Dataset. link. The paper publication: Kazuki Shimada, et al., STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events. NeurIPS 2023.
Clotho Dataset Link

Tools

SoundSpaces, link
Pyroomacoustics. link

Position Encoding

On Position Embeddings in BERT. paper link

Permutation Invariant Training

Dong Yu et al. Permutation Invariant Training of Deep Models for Speaker-Independent Multi-talker Speech Separation. paper link

Learning from Sound Raw Waveforms

Neil Zeghidour et al., Learning Filterbanks from Raw Speech for Phone Recognition. ICASSP 2018. paper link
Neil,Zeghidour et al., LEAF: A Learnable Frontend for Audio Classification. ICLR 2021. paper link
Paul-Gauthier Noe et al., CGCNN: Complex Gabor Convolutional Neural Network on Raw Speech. 2020. paper link
Yi Luo, Nima Mesgarani, Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation. IEEE/ACM Transactions on Audio, Speech and Language Processing. 2019. paper link, code.
Yuhang He et al. SoundDet: Polyphonic Moving Sound Event Detection and Localization from Raw Waveform. ICML2021. paper link
Yuhang He et al. SoundDoA: Learn Sound Source Direction of Arrival and Semantics from Sound Raw Waveforms. Interspeech 2022.

Sound + Vision Cross-Modality Perception

R. Gao and K. Grauman, 2.5D Visual Sound. CVPR, 2019. paper link
Valentina Sanguineti, et al., Audio-Visual Localization by Synthetic Acoustic Image Generation. AAAI, 2021. paper link
Triantafyllos Afouras, et al., Self-Supervised Learning of Audio-Visual Objects from Video. ECCV, 2020. paper link
Senthil Purushwalkam, et al., Audio-Visual Floorplan Reconstruction, ICCV 2021. paper link
Hu Di, Lichao Mou, Qingzhong Wang, Junyu Gao and Yuansheng Hua and Dejing Dou and Xiao Xiang Zhu, Ambient Sound Helps: Audivisual Crowd Counting in Extreme Conditions. arxiv preprint. 2020. paper link.
Zhenyu Tang et al., GWA: A Large High-Quality Acoustic Dataset for Audio Processing. SIGGRAPH 2022. project site
Changan Chen et al., Visual Acoustic Matching. CVPR 2022. paper link
Abdelrahman Younes et al., Catch Me If You Hear Me: Audio-Visual Navigation in Complex Unmapped Environments with Moving Sounds. arXiv:2111.14843. paper link
Yudong Guo et al., AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis. ICCV19. Gihub
Arda Senocak, et al., Less Can Be More: Sound Source Localization With a Classification Model. WACV2022. paper link.
Rishit Dagli, et al., SEE-2-SOUND: Zero-shot Spatial Environment-to-Spatial Sound. project page.

Dynamic NeRF

Chen Gao et al., Dynamic View Synthesis from Dynamic Monocular Video. ICCV, 2021. project page
BARF 🤮: Bundle-Adjusting Neural Radiance Fields. ICCV 21. Github

Binaural and Spatial Sound Generation,

Yichong, Leng, et al., BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis. ArXiv paper link
Alexander Richard, et al., Neural Synthesis of Binaural Speech From Mono Audio. ICLR 2021. paper link
Sijia Li et al., Binaural Audio Generating via Multi-Task Learning. ACM SIGGRAPH Asia. 2021. project site
Arun Balajee Vasudevan et al., Sound and Visual Representation Learning with Multiple Pretraining Tasks, CVPR 22. paper link
Alon Levkovitch et al., Zero-Shot Mono-to-Binaural Speech Synthesis. 2024. paper link
Mingfei Chen et al., SoundVista: Novel-View Ambient Sound Synthesis via Visual-Acoustic Binding. CVPR 2025. project site
Huadai Liu et al., OmniAudio: Generating Spatial Audio from 360-Degree Video. ICML2025. project site.

Neural Audio Effect

Christian Steinmetz et al. Style transfer of audio effects with differentiable signal processing. Journal of the Audio Engineering Society (JAES). 2022. paper link

Sound Propagation Process

Samuel Siltanen et al., The room acoustic rendering equation. The Journal of the Acoustical Society of America. 2007. paper link

Embodied-AI research

Vincent Cartillier, et al., Semantic MapNet: Building Allocentric Semantic Maps and Representations from Egocentric Views.
Karen Yang, et al., Camera Pose Estimation and Localization with Active Audio Sensing.ECCV 2022.

Audio + Transformer

Sanyuan Chen et al., BEATS : Audio Pre-Training with Acoustic Tokenizers. 2022. paper link.

Large Model on Audio Synthesis

Andrea Agostinelli et al., MusicLM: Generating Music From Text. Arxiv 2301.11325. 2023.

Audio-Driven Task

See link

Audio-involved LLM

See link

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
AudioDrivenTask.md		AudioDrivenTask.md
AudioLLM.md		AudioLLM.md
MultimodalDynamicLearning.md		MultimodalDynamicLearning.md
README.md		README.md
SoundSynthesis.md		SoundSynthesis.md
WorldModel.md		WorldModel.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome OpenSound

Task 1: Sound Event Detection and Localization

Task 2: Room Acoustics

Task 3: Sound Generation

Task 4: Sound Object Discussion

Dataset

Tools

Position Encoding

Permutation Invariant Training

Learning from Sound Raw Waveforms

Sound + Vision Cross-Modality Perception

Dynamic NeRF

Binaural and Spatial Sound Generation,

Neural Audio Effect

Sound Propagation Process

Embodied-AI research

Audio + Transformer

Large Model on Audio Synthesis

Audio-Driven Task

Audio-involved LLM

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Awesome OpenSound

Task 1: Sound Event Detection and Localization

Task 2: Room Acoustics

Task 3: Sound Generation

Task 4: Sound Object Discussion

Dataset

Tools

Position Encoding

Permutation Invariant Training

Learning from Sound Raw Waveforms

Sound + Vision Cross-Modality Perception

Dynamic NeRF

Binaural and Spatial Sound Generation,

Neural Audio Effect

Sound Propagation Process

Embodied-AI research

Audio + Transformer

Large Model on Audio Synthesis

Audio-Driven Task

Audio-involved LLM

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages