Sounds That Shape: Audio-Driven 3D Mesh Generation with Attribute-Decoupled Score Distillation Sampling

Anonymous Author(s)

"Imagine What Shape Sound Could Take"


tl;dr: Introducing audio-driven 3D mesh and texture generation system with pretrained 2D diffusion models from single audio input within 2 minutes.


Method Overview



SFX-to-3D Generation Results


Audio: πŸ”Š (Fire Cracking)

Audio: πŸ”Š (Forest)

Audio: πŸ”Š (Forest)

Audio: πŸ”Š (Underwater Bubbling)

A23D Result:

Audio: πŸ”Š (Snow)

A23D Result:

Our system is capable of audio-driven 3D mesh and texture generation using pretrained 2D diffusion models from only single audio file.


Ablation: Modality-Cross 3D Generation


Audio: πŸ”Š (Null)
Text: πŸ’¬ "A chair with fire crackling effect"

Text-to-2D:

Audio: πŸ”Š (Null)
Text: πŸ’¬ "A chair with fire crackling effect"

Text-to-3D:

Audio: πŸ”Š (Fire Cracking)
Text: πŸ’¬ "A Chair"

(Naïve) Audio-Driven Text-to-3D:

Audio: πŸ”Š (Fire Crackling)
Text: πŸ’¬ "A Chair"

Ours:

Audio embedding perform well then text embedding to express SFX of ambient sounds for 3D generation.


Additional Results


Audio: πŸ”Š (Fire Crackling)
Text: πŸ’¬ "A vase"

Ours:

Audio: πŸ”Š (Underwater)
Text: "A shoes"

Ours:

Audio: πŸ”Š (Forest)
Text: "A cup"

Ours:

Audio: πŸ”Š (Splashing water)
Text: "A Chair"

Ours:

Our system is capable of audio-driven 3D mesh and texture generation using pretrained 2D diffusion models from only single audio file.


Q&A


  1. Why Gaussian Splatting instead of NeRFs?

    Gaussian Splatting offers balanced performance and efficiency on training time, intuition and computing resource. No one want to wait 5+ hours to create simple 3D object indeed. Taking GS as 3D representation allows for fast optimization and easier manipulation of 3D objects within 2 minutes and 12GB VRAM occupancy.

    Simultaneously, the explicit property of 3D-GS allow us to manipulate attributes-specific parameter from separate object function. Thereby shape attributes can be varied following external information (here, textual prompt) while maintaining visual effects for representing audio condition.


  2. Why SDS?

    At the time of this research, SDS-based 3D generation method (e.g., DreamFusion, DreamGaussian) provides substantial flexibility which enables 3D content creation from single condition upon pretrained diffusion models without 3D awareness or cross-modal data requirements. Thanks to this capability, we can realize audio-to-3D system upon pretrained audio-to-image diffusion models for 3D mesh.


Reference


  1. (SDS) DreamFusion: https://dreamfusion3d.github.io/
  2. (3DGS) 3D Gaussian Splatting: https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/
  3. (SDS + 3DGS) DreamGaussian: https://dreamgaussian.github.io/
  4. (Text-to-Image) MVDream: https://mv-dream.github.io/
  5. (Audio-to-Image) SonicDiffusion: https://cyberiada.github.io/SonicDiffusion/
  6. (Conditional Sampling) Classifier-free Guidance: https://arxiv.org/abs/2207.12598
  7. (Audio Embedding) CLAP: https://arxiv.org/pdf/2206.04769
  8. (Text Embedding) CLIP: https://arxiv.org/pdf/2103.00020