Revolutionizing 3D Model Generation with MVDream AI

Published on
12-09-2023
Author
Product Minting
Category
How to
https://cdn.aisys.pro/stories/1694484054333-2655.jpg

In this video, we're diving deep into the groundbreaking technology of MVDream, a game-changing AI model that's redefining what's possible in the realm of 3D modeling.


Before we jump right on this exploration, let's set the stage. Text-to-3D is a cutting-edge field where AI takes ordinary text descriptions and transforms them into astonishing 3D objects, bridging the imaginative power of words with the visual impact of 3D modeling. It's like bringing your wildest ideas to life in a tangible form (not just images!), and MVDream is leading the way in this transformative arena.


Artificial Intelligence (AI) has been making giant strides in recent years, from generating text to crafting lifelike images and even

. But the real challenge lies in turning simple textual descriptions into intricate 3D models, complete with all the fine details and realistic features that our world is composed of. That's where MVDream steps in, not as just another initial step but as a monumental leap forward in 3D model generation from text.


In this video, we'll unveil the astounding capabilities of MVDream. You'll witness how this AI understands the physics of 3D modeling like never before, creating high-quality, realistic 3D objects from nothing more than a short sentence. We'll showcase its prowess in comparison with other approaches, highlighting its ability to produce spatially coherent, real-world objects, free from the quirks of past models.


But how does MVDream work its magic? We'll delve into the inner workings of this AI powerhouse, exploring its architecture and the ingenious techniques it employs. You'll discover its evolution from a 2D image diffusion model to a multi-view diffusion model, enabling it to generate multiple views of an object and, most importantly, to ensure the consistency of the 3D models it creates.


Of course, no technology is without its limitations, and MVDream is no exception. We'll also discuss its current constraints, including resolution and dataset size. These are crucial factors that impact the applicability of this impressive technology.

So, join us on this exciting journey as we unravel the mysteries behind MVDream, a model that's rewriting the rules of 3D modeling, and explore the incredible possibilities it offers. If you're curious about the intersection of AI, 3D modeling, and the future of creativity, this video is a must-watch. Let's dive right in!


foreign

[00:00:00] : [00:00:03]

I'm super excited to share this new AI

[00:00:03] : [00:00:12]

model with you we've seen so many new

[00:00:12] : [00:00:14]

approaches to generating text then

[00:00:14] : [00:00:16]

generating images only getting better

[00:00:16] : [00:00:18]

after that we've seen other amazing

[00:00:18] : [00:00:20]

initial works for generating videos and

[00:00:20] : [00:00:23]

even 3D models out of text just imagine

[00:00:23] : [00:00:25]

the complexity of such a task when all

[00:00:25] : [00:00:27]

you have is a sentence and you need to

[00:00:27] : [00:00:29]

generate something that could look like

[00:00:29] : [00:00:32]

a real object in our real world with all

[00:00:32] : [00:00:34]

its details well here's a new model that

[00:00:34] : [00:00:37]

is not merely an initial step it's a

[00:00:37] : [00:00:39]

huge step forward in 3D model generation

[00:00:39] : [00:00:42]

from just text mvdream as you can see it

[00:00:42] : [00:00:44]

seems like mvdream is able to understand

[00:00:44] : [00:00:47]

physics compared to previous approaches

[00:00:47] : [00:00:49]

in jetset it knows that the view should

[00:00:49] : [00:00:52]

be realistic with only two ears and not

[00:00:52] : [00:00:54]

two for any possible views it ends up

[00:00:54] : [00:00:56]

creating a very high quality 3D model

[00:00:56] : [00:00:59]

out of just this simple line of text how

[00:00:59] : [00:01:01]

cool is this but what's even cooler is

[00:01:01] : [00:01:04]

how it works so let's dive into it but

[00:01:04] : [00:01:06]

before doing so let me introduce a super

[00:01:06] : [00:01:08]

cool company answering the video with

[00:01:08] : [00:01:10]

another application of artificial

[00:01:10] : [00:01:13]

intelligence voice synthesis introducing

[00:01:13] : [00:01:16]

kit.ai a platform for artists producers

[00:01:16] : [00:01:19]

and fans to create AI voice models with

[00:01:19] : [00:01:22]

ease and even create monetizable work

[00:01:22] : [00:01:24]

with licensed AI voice models of your

[00:01:24] : [00:01:27]

favorite artists kids that AI offers a

[00:01:27] : [00:01:30]

library of licensed artist voices and

[00:01:30] : [00:01:32]

royalty-free library and a community

[00:01:32] : [00:01:34]

library with voice models of characters

[00:01:34] : [00:01:38]

and celebrities created by the users you

[00:01:38] : [00:01:40]

can even train your own voice with one

[00:01:40] : [00:01:42]

click simply provide audio files of the

[00:01:42] : [00:01:44]

voice you want to replicate and kids

[00:01:44] : [00:01:46]

that AI will create an AI voice model

[00:01:46] : [00:01:48]

for you to use with no back-end

[00:01:48] : [00:01:51]

knowledge required generate voice model

[00:01:51] : [00:01:53]

conversion by providing a nakapella file

[00:01:53] : [00:01:56]

recording audio manually or even

[00:01:56] : [00:01:58]

inputting a YouTube link for easy vocal

[00:01:58] : [00:02:00]

separation and that's pretty cool since

[00:02:00] : [00:02:02]

I can do it pretty easily get started

[00:02:02] : [00:02:05]

with kits.ai using the first link in the

[00:02:05] : [00:02:06]

description right now

[00:02:06] : [00:02:10]

now let's get back to the 3D World if

[00:02:10] : [00:02:11]

you look at a 3D model the biggest

[00:02:11] : [00:02:13]

challenge is that they need to generate

[00:02:13] : [00:02:15]

both realistic and high quality images

[00:02:15] : [00:02:18]

for each view from where you are looking

[00:02:18] : [00:02:20]

at it and those views have to be

[00:02:20] : [00:02:22]

spatially coherent with each other not

[00:02:22] : [00:02:24]

like the four eared Yoda we previously

[00:02:24] : [00:02:27]

saw are multi-phase subjects we see

[00:02:27] : [00:02:29]

since we rarely have people from the

[00:02:29] : [00:02:31]

back in any image data set so the model

[00:02:31] : [00:02:34]

kind of wants to see faces out because

[00:02:34] : [00:02:36]

one of the main approaches to generating

[00:02:36] : [00:02:39]

3D models is to simulate a view angle

[00:02:39] : [00:02:41]

from a camera and then generate what it

[00:02:41] : [00:02:43]

should be seeing from this Viewpoint

[00:02:43] : [00:02:45]

this is called 2D lifting since we

[00:02:45] : [00:02:48]

generate regular images to combine them

[00:02:48] : [00:02:51]

into a full 3d scene then we generate

[00:02:51] : [00:02:53]

all possible views from around the

[00:02:53] : [00:02:55]

object that is why we are used to seeing

[00:02:55] : [00:02:58]

weird artifacts like these since the

[00:02:58] : [00:03:00]

model is just trying to generate one

[00:03:00] : [00:03:02]

View at a time and doesn't understand

[00:03:02] : [00:03:04]

the overall object well enough in the 3D

[00:03:04] : [00:03:07]

space well MP dream made a huge step in

[00:03:07] : [00:03:09]

this Direction They tackled what we call

[00:03:09] : [00:03:12]

the 3D consistency problem and even

[00:03:12] : [00:03:14]

claimed to have solved it using a

[00:03:14] : [00:03:16]

technique called score distillation

[00:03:16] : [00:03:19]

sampling introduced by dream Fusion

[00:03:19] : [00:03:21]

another text to 3D method that was

[00:03:21] : [00:03:24]

published in late 2022 which I covered

[00:03:24] : [00:03:26]

on the channel by the way if you enjoyed

[00:03:26] : [00:03:27]

this video and these kinds of new

[00:03:27] : [00:03:29]

technologies you should definitely

[00:03:29] : [00:03:31]

subscribe I cover new approaches like

[00:03:31] : [00:03:33]

this one every week on the channel

[00:03:33] : [00:03:35]

before entering into the score

[00:03:35] : [00:03:37]

distribution sampling Technique we need

[00:03:37] : [00:03:39]

to know about the architecture they are

[00:03:39] : [00:03:41]

using in short it's yet just another 2D

[00:03:41] : [00:03:43]

image diffusion model like Dali

[00:03:43] : [00:03:46]

mid-journey or stable diffusion more

[00:03:46] : [00:03:48]

specifically they started with a

[00:03:48] : [00:03:50]

pre-trained dreambooth Model A powerful

[00:03:50] : [00:03:52]

open source model to generate images

[00:03:52] : [00:03:55]

based on stable diffusion that I already

[00:03:55] : [00:03:57]

covered on the channel then the change

[00:03:57] : [00:03:59]

they made was to render a set of

[00:03:59] : [00:04:02]

multi-view images directly instead of

[00:04:02] : [00:04:04]

only one image thanks to being trained

[00:04:04] : [00:04:06]

on a 3D data set of various objects here

[00:04:06] : [00:04:08]

we take multiple views from the 3D

[00:04:08] : [00:04:11]

object that we have in our data set and

[00:04:11] : [00:04:13]

use them to train the model to generate

[00:04:13] : [00:04:15]

them backward this is done by changing

[00:04:15] : [00:04:17]

the self-attention block we see here in

[00:04:17] : [00:04:20]

blue for a 3D one meaning that we simply

[00:04:20] : [00:04:22]

add a dimension to reconstruct multiple

[00:04:22] : [00:04:24]

images at a time instead of one below

[00:04:24] : [00:04:26]

you can see the camera and time step

[00:04:26] : [00:04:28]

that is also being inputted into the

[00:04:28] : [00:04:30]

model for each view to help the model

[00:04:30] : [00:04:33]

understand where which image is going

[00:04:33] : [00:04:35]

and what kind of view needs to be

[00:04:35] : [00:04:37]

generated now all the images are

[00:04:37] : [00:04:39]

connected and generated together so they

[00:04:39] : [00:04:41]

can share information and better

[00:04:41] : [00:04:43]

understand the global content then you

[00:04:43] : [00:04:45]

feed it your text and train the model to

[00:04:45] : [00:04:47]

reconstruct the objects from a data set

[00:04:47] : [00:04:49]

accurately this is where they apply

[00:04:49] : [00:04:51]

their multi-view score distillation

[00:04:51] : [00:04:54]

sampling process I mentioned they now

[00:04:54] : [00:04:55]

have a multi-view diffusion model which

[00:04:55] : [00:04:58]

can generate well multiple views of an

[00:04:58] : [00:05:00]

object but they needed to reconstruct

[00:05:00] : [00:05:03]

consistent 3D models not just views so

[00:05:03] : [00:05:06]

this is often done using Nerf or neural

[00:05:06] : [00:05:09]

Radiance Fields as it is done with trim

[00:05:09] : [00:05:11]

Fusion which we mentioned earlier it

[00:05:11] : [00:05:13]

basically uses the trained multi-view

[00:05:13] : [00:05:15]

diffusion model that we have and freezes

[00:05:15] : [00:05:17]

it meaning that it is just being used

[00:05:17] : [00:05:19]

and not being trained we start

[00:05:19] : [00:05:21]

generating an initial image version

[00:05:21] : [00:05:24]

Guided by our caption and initial

[00:05:24] : [00:05:26]

rendering with added noise using our

[00:05:26] : [00:05:29]

multi-view diffusion model we add noise

[00:05:29] : [00:05:31]

so that the model knows it needs to

[00:05:31] : [00:05:32]

generate a different version of the

[00:05:32] : [00:05:35]

image while still receiving context for

[00:05:35] : [00:05:37]

it then we use the model to generate a

[00:05:37] : [00:05:40]

higher quality image add the image used

[00:05:40] : [00:05:42]

to generate it and remove the Noise We

[00:05:42] : [00:05:45]

manually added to use this result to

[00:05:45] : [00:05:47]

guide and improve our Nerf model for the

[00:05:47] : [00:05:49]

next step we do all that to better

[00:05:49] : [00:05:51]

understand where in the image the Nerf

[00:05:51] : [00:05:53]

model should focus its attention to

[00:05:53] : [00:05:55]

produce better results in the next step

[00:05:55] : [00:05:57]

and we repeat that until the 3D model is

[00:05:57] : [00:06:00]

satisfying enough and voila this is how

[00:06:00] : [00:06:02]

they took a 2d text to image model

[00:06:02] : [00:06:04]

adapted it for multiple view synthesis

[00:06:04] : [00:06:07]

and finally used it to create a text to

[00:06:07] : [00:06:09]

3D version of the model iteratively of

[00:06:09] : [00:06:11]

course they added many technical

[00:06:11] : [00:06:13]

improvements to the approaches they

[00:06:13] : [00:06:15]

based themselves on which I did not

[00:06:15] : [00:06:18]

enter into for Simplicity but if you are

[00:06:18] : [00:06:20]

curious I definitely invite you to read

[00:06:20] : [00:06:22]

their great paper for more information

[00:06:22] : [00:06:24]

they are also still some limitations

[00:06:24] : [00:06:26]

with this new approach mainly that the

[00:06:26] : [00:06:30]

generations are only of 256 by 256

[00:06:30] : [00:06:32]

pixels which is quite low resolution

[00:06:32] : [00:06:35]

even though the results look incredible

[00:06:35] : [00:06:37]

they also mentioned that the size of the

[00:06:37] : [00:06:39]

data set for this task is definitely a

[00:06:39] : [00:06:41]

limitation for the generalizability of

[00:06:41] : [00:06:43]

the approach this was an overview of

[00:06:43] : [00:06:46]

Envy dream and thank you for watching I

[00:06:46] : [00:06:47]

will see you next time with another

[00:06:47] : [00:06:50]

amazing paper

[00:06:50] : [00:06:50]

thank you

[00:06:50] : [00:07:04]

foreign

[00:07:04] : [00:07:13]

[Music]

[00:07:13] : [00:07:13]

References:

►Read the full article: https://www.louisbouchard.ai/mvdream/

►Shi et al., 2023: MVDream, https://arxiv.org/abs/2308.16512

►Project with more examples: https://mv-dream.github.io/

►Code (to come): https://github.com/MV-Dream/MVDream

►Twitter: https://twitter.com/Whats_AI

►My Newsletter (A new AI application explained weekly to your emails!): https://www.louisbouchard.ai/newsletter/

►Support me on Patreon: https://www.patreon.com/whatsai

►Join Our AI Discord: https://discord.gg/learnaitogether

Discussion (20)

Not yet any reply