Various Audio Process Baselines
- An improved Event-Independent Network for Polyphonic Sound Event Localization and Detection (2020). Github
- SELD-Net: Sound Event Localization and Detection of overlapping sources using convolutional recurrent neural network, IEEE Journal of Selected Topics in Signal Processing (JSTSP 2018). Github
- Sebastian Thrun, Affine Structure from Sound. NIPS 2005, paper link
- Zhoutong Zhang et al. Shape and Material from Sound. NIPS 2017. paper link
- Miranda et al. Structure from Sound with Incomplete Data. ICASSP 2018. paper link
- Arun Balajee Vasudevan et al. Semantic Object Prediction and Spatial Sound Super-Resolution with Binaural Sounds. ECCV 2020. paper link
- Changan Chen et al. Audio-Visual Waypoints for Navigation.
- F. Ribeiro, D. Florencio, D. Ba and C. Zhang, "Geometrically Constrained Room Modeling With Compact Microphone Arrays," in IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 5, pp. 1449-1460, July 2012, doi: 10.1109/TASL.2011.2180897.
- F. Antonacci et al., "Inference of Room Geometry From Acoustic Impulse Responses," in IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 10, pp. 2683-2695, Dec. 2012, doi: 10.1109/TASL.2012.2210877.
- Ivan Dokmanić, Acoustic echoes reveal room shape. PNAS, 2013. paper link.
- Wei Ping et al. WaveFlow: A Compact Flow-based Model for Raw Audio. ICML2020. paper link
- Daniel Arteaga, et al., Multichannel-based learning for audio object extraction. ICASSP 21. paper link
- Google AudioSet, link
- GWA, link
- STARSS23: Audio-Visual Dataset. link. The paper publication: Kazuki Shimada, et al., STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events. NeurIPS 2023.
- Clotho Dataset Link
- On Position Embeddings in BERT. paper link
- Dong Yu et al. Permutation Invariant Training of Deep Models for Speaker-Independent Multi-talker Speech Separation. paper link
- Neil Zeghidour et al., Learning Filterbanks from Raw Speech for Phone Recognition. ICASSP 2018. paper link
- Neil,Zeghidour et al., LEAF: A Learnable Frontend for Audio Classification. ICLR 2021. paper link
- Paul-Gauthier Noe et al., CGCNN: Complex Gabor Convolutional Neural Network on Raw Speech. 2020. paper link
- Yi Luo, Nima Mesgarani, Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation. IEEE/ACM Transactions on Audio, Speech and Language Processing. 2019. paper link, code.
- Yuhang He et al. SoundDet: Polyphonic Moving Sound Event Detection and Localization from Raw Waveform. ICML2021. paper link
- Yuhang He et al. SoundDoA: Learn Sound Source Direction of Arrival and Semantics from Sound Raw Waveforms. Interspeech 2022.
- R. Gao and K. Grauman, 2.5D Visual Sound. CVPR, 2019. paper link
- Valentina Sanguineti, et al., Audio-Visual Localization by Synthetic Acoustic Image Generation. AAAI, 2021. paper link
- Triantafyllos Afouras, et al., Self-Supervised Learning of Audio-Visual Objects from Video. ECCV, 2020. paper link
- Senthil Purushwalkam, et al., Audio-Visual Floorplan Reconstruction, ICCV 2021. paper link
- Hu Di, Lichao Mou, Qingzhong Wang, Junyu Gao and Yuansheng Hua and Dejing Dou and Xiao Xiang Zhu, Ambient Sound Helps: Audivisual Crowd Counting in Extreme Conditions. arxiv preprint. 2020. paper link.
- Zhenyu Tang et al., GWA: A Large High-Quality Acoustic Dataset for Audio Processing. SIGGRAPH 2022. project site
- Changan Chen et al., Visual Acoustic Matching. CVPR 2022. paper link
- Abdelrahman Younes et al., Catch Me If You Hear Me: Audio-Visual Navigation in Complex Unmapped Environments with Moving Sounds. arXiv:2111.14843. paper link
- Yudong Guo et al., AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis. ICCV19. Gihub
- Arda Senocak, et al., Less Can Be More: Sound Source Localization With a Classification Model. WACV2022. paper link.
- Rishit Dagli, et al., SEE-2-SOUND: Zero-shot Spatial Environment-to-Spatial Sound. project page.
- Chen Gao et al., Dynamic View Synthesis from Dynamic Monocular Video. ICCV, 2021. project page
- BARF 🤮: Bundle-Adjusting Neural Radiance Fields. ICCV 21. Github
- Yichong, Leng, et al., BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis. ArXiv paper link
- Alexander Richard, et al., Neural Synthesis of Binaural Speech From Mono Audio. ICLR 2021. paper link
- Sijia Li et al., Binaural Audio Generating via Multi-Task Learning. ACM SIGGRAPH Asia. 2021. project site
- Arun Balajee Vasudevan et al., Sound and Visual Representation Learning with Multiple Pretraining Tasks, CVPR 22. paper link
- Alon Levkovitch et al., Zero-Shot Mono-to-Binaural Speech Synthesis. 2024. paper link
- Mingfei Chen et al., SoundVista: Novel-View Ambient Sound Synthesis via Visual-Acoustic Binding. CVPR 2025. project site
- Huadai Liu et al., OmniAudio: Generating Spatial Audio from 360-Degree Video. ICML2025. project site.
- Christian Steinmetz et al. Style transfer of audio effects with differentiable signal processing. Journal of the Audio Engineering Society (JAES). 2022. paper link
- Samuel Siltanen et al., The room acoustic rendering equation. The Journal of the Acoustical Society of America. 2007. paper link
- Vincent Cartillier, et al., Semantic MapNet: Building Allocentric Semantic Maps and Representations from Egocentric Views.
- Karen Yang, et al., Camera Pose Estimation and Localization with Active Audio Sensing.ECCV 2022.
- Sanyuan Chen et al., BEATS : Audio Pre-Training with Acoustic Tokenizers. 2022. paper link.
- Andrea Agostinelli et al., MusicLM: Generating Music From Text. Arxiv 2301.11325. 2023.
See link
See link