OpenSound/MultimodalDynamicLearning.md at master · yuhanghe01/OpenSound · GitHub

4 lines (4 loc) · 616 Bytes

Modeling Multimodal Social Interactions: New Challenges and Baselines with Densely Aligned Representations. CVPR24. project page
Emu3. Emu3: Next-Token Prediction is All You Need. 2024. project pager
Simran Khanuja et al., An image speaks a thousand words, but can everyone listen? On image transcreation for cultural relevance. EMNLP 2024 Best Paper. paper.
Janus Series: Unified Multimodal Understanding and Generation Models. GitHub.