Sounds That Shape: Audio-Driven 3D Mesh Generation with Attribute-Decoupled Score Distillation Sampling

Anonymous Author(s)

"Imagine What Shape Sound Could Take"

Audio: 🔊 (Fire Cracking)

Audio: 🔊 (Forest)

Audio: 🔊 (Underwater Bubbling)

A23D Result:

Audio: 🔊 (Snow)

A23D Result:

Audio: 🔊 (Null)
Text: 💬 "A chair with fire crackling effect"

Text-to-2D:

Audio: 🔊 (Null)
Text: 💬 "A chair with fire crackling effect"

Text-to-3D:

Audio: 🔊 (Fire Cracking)
Text: 💬 "A Chair"

(Naïve) Audio-Driven Text-to-3D:

Audio: 🔊 (Fire Crackling)
Text: 💬 "A Chair"

Ours:

Audio: 🔊 (Fire Crackling)
Text: 💬 "A vase"

Ours:

Audio: 🔊 (Underwater)
Text: "A shoes"

Ours:

Audio: 🔊 (Forest)
Text: "A cup"

Ours:

Audio: 🔊 (Splashing water)
Text: "A Chair"

Ours:

Why Gaussian Splatting instead of NeRFs?

Gaussian Splatting offers balanced performance and efficiency on training time, intuition and computing resource. No one want to wait 5+ hours to create simple 3D object indeed. Taking GS as 3D representation allows for fast optimization and easier manipulation of 3D objects within 2 minutes and 12GB VRAM occupancy.

Simultaneously, the explicit property of 3D-GS allow us to manipulate attributes-specific parameter from separate object function. Thereby shape attributes can be varied following external information (here, textual prompt) while maintaining visual effects for representing audio condition.
Why SDS?

At the time of this research, SDS-based 3D generation method (e.g., DreamFusion, DreamGaussian) provides substantial flexibility which enables 3D content creation from single condition upon pretrained diffusion models without 3D awareness or cross-modal data requirements. Thanks to this capability, we can realize audio-to-3D system upon pretrained audio-to-image diffusion models for 3D mesh.