Skip to content

yuhanghe01/OpenSound

Repository files navigation

Awesome OpenSound

Various Audio Process Baselines

Task 1: Sound Event Detection and Localization

  1. An improved Event-Independent Network for Polyphonic Sound Event Localization and Detection (2020). Github
  2. SELD-Net: Sound Event Localization and Detection of overlapping sources using convolutional recurrent neural network, IEEE Journal of Selected Topics in Signal Processing (JSTSP 2018). Github

Task 2: Room Acoustics

  1. Sebastian Thrun, Affine Structure from Sound. NIPS 2005, paper link
  2. Zhoutong Zhang et al. Shape and Material from Sound. NIPS 2017. paper link
  3. Miranda et al. Structure from Sound with Incomplete Data. ICASSP 2018. paper link
  4. Arun Balajee Vasudevan et al. Semantic Object Prediction and Spatial Sound Super-Resolution with Binaural Sounds. ECCV 2020. paper link
  5. Changan Chen et al. Audio-Visual Waypoints for Navigation.
  6. F. Ribeiro, D. Florencio, D. Ba and C. Zhang, "Geometrically Constrained Room Modeling With Compact Microphone Arrays," in IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 5, pp. 1449-1460, July 2012, doi: 10.1109/TASL.2011.2180897.
  7. F. Antonacci et al., "Inference of Room Geometry From Acoustic Impulse Responses," in IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 10, pp. 2683-2695, Dec. 2012, doi: 10.1109/TASL.2012.2210877.
  8. Ivan Dokmanić, Acoustic echoes reveal room shape. PNAS, 2013. paper link.

Task 3: Sound Generation

  1. Wei Ping et al. WaveFlow: A Compact Flow-based Model for Raw Audio. ICML2020. paper link

Task 4: Sound Object Discussion

  1. Daniel Arteaga, et al., Multichannel-based learning for audio object extraction. ICASSP 21. paper link

Dataset

  1. Google AudioSet, link
  2. GWA, link
  3. STARSS23: Audio-Visual Dataset. link. The paper publication: Kazuki Shimada, et al., STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events. NeurIPS 2023.
  4. Clotho Dataset Link

Tools

  1. SoundSpaces, link
  2. Pyroomacoustics. link

Position Encoding

  1. On Position Embeddings in BERT. paper link

Permutation Invariant Training

  1. Dong Yu et al. Permutation Invariant Training of Deep Models for Speaker-Independent Multi-talker Speech Separation. paper link

Learning from Sound Raw Waveforms

  1. Neil Zeghidour et al., Learning Filterbanks from Raw Speech for Phone Recognition. ICASSP 2018. paper link
  2. Neil,Zeghidour et al., LEAF: A Learnable Frontend for Audio Classification. ICLR 2021. paper link
  3. Paul-Gauthier Noe et al., CGCNN: Complex Gabor Convolutional Neural Network on Raw Speech. 2020. paper link
  4. Yi Luo, Nima Mesgarani, Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation. IEEE/ACM Transactions on Audio, Speech and Language Processing. 2019. paper link, code.
  5. Yuhang He et al. SoundDet: Polyphonic Moving Sound Event Detection and Localization from Raw Waveform. ICML2021. paper link
  6. Yuhang He et al. SoundDoA: Learn Sound Source Direction of Arrival and Semantics from Sound Raw Waveforms. Interspeech 2022.

Sound + Vision Cross-Modality Perception

  1. R. Gao and K. Grauman, 2.5D Visual Sound. CVPR, 2019. paper link
  2. Valentina Sanguineti, et al., Audio-Visual Localization by Synthetic Acoustic Image Generation. AAAI, 2021. paper link
  3. Triantafyllos Afouras, et al., Self-Supervised Learning of Audio-Visual Objects from Video. ECCV, 2020. paper link
  4. Senthil Purushwalkam, et al., Audio-Visual Floorplan Reconstruction, ICCV 2021. paper link
  5. Hu Di, Lichao Mou, Qingzhong Wang, Junyu Gao and Yuansheng Hua and Dejing Dou and Xiao Xiang Zhu, Ambient Sound Helps: Audivisual Crowd Counting in Extreme Conditions. arxiv preprint. 2020. paper link.
  6. Zhenyu Tang et al., GWA: A Large High-Quality Acoustic Dataset for Audio Processing. SIGGRAPH 2022. project site
  7. Changan Chen et al., Visual Acoustic Matching. CVPR 2022. paper link
  8. Abdelrahman Younes et al., Catch Me If You Hear Me: Audio-Visual Navigation in Complex Unmapped Environments with Moving Sounds. arXiv:2111.14843. paper link
  9. Yudong Guo et al., AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis. ICCV19. Gihub
  10. Arda Senocak, et al., Less Can Be More: Sound Source Localization With a Classification Model. WACV2022. paper link.
  11. Rishit Dagli, et al., SEE-2-SOUND: Zero-shot Spatial Environment-to-Spatial Sound. project page.

Dynamic NeRF

  1. Chen Gao et al., Dynamic View Synthesis from Dynamic Monocular Video. ICCV, 2021. project page
  2. BARF 🤮: Bundle-Adjusting Neural Radiance Fields. ICCV 21. Github

Binaural and Spatial Sound Generation,

  1. Yichong, Leng, et al., BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis. ArXiv paper link
  2. Alexander Richard, et al., Neural Synthesis of Binaural Speech From Mono Audio. ICLR 2021. paper link
  3. Sijia Li et al., Binaural Audio Generating via Multi-Task Learning. ACM SIGGRAPH Asia. 2021. project site
  4. Arun Balajee Vasudevan et al., Sound and Visual Representation Learning with Multiple Pretraining Tasks, CVPR 22. paper link
  5. Alon Levkovitch et al., Zero-Shot Mono-to-Binaural Speech Synthesis. 2024. paper link
  6. Mingfei Chen et al., SoundVista: Novel-View Ambient Sound Synthesis via Visual-Acoustic Binding. CVPR 2025. project site
  7. Huadai Liu et al., OmniAudio: Generating Spatial Audio from 360-Degree Video. ICML2025. project site.

Neural Audio Effect

  1. Christian Steinmetz et al. Style transfer of audio effects with differentiable signal processing. Journal of the Audio Engineering Society (JAES). 2022. paper link

Sound Propagation Process

  1. Samuel Siltanen et al., The room acoustic rendering equation. The Journal of the Acoustical Society of America. 2007. paper link

Embodied-AI research

  1. Vincent Cartillier, et al., Semantic MapNet: Building Allocentric Semantic Maps and Representations from Egocentric Views.
  2. Karen Yang, et al., Camera Pose Estimation and Localization with Active Audio Sensing.ECCV 2022.

Audio + Transformer

  1. Sanyuan Chen et al., BEATS : Audio Pre-Training with Acoustic Tokenizers. 2022. paper link.

Large Model on Audio Synthesis

  1. Andrea Agostinelli et al., MusicLM: Generating Music From Text. Arxiv 2301.11325. 2023.

Audio-Driven Task

See link

Audio-involved LLM

See link

About

Various Audio Process Baselines

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors