AI Stack for Text-to-Video Generation

In the recent past with the genesis of large language models, one thing that we critically debate around is whether AI-generated content like AI art, AI videos, etc will destroy the creativity of content creators or can enhance the quality and assist them. In my previous blog, I talked about how AI code-generation tools can help add value to the software development cycle, in this post I will mainly be highlighting how by using AI tools creating video content is not only getting easy and fast but also creative.

Before understanding how one can leverage current video and image models in generating creative and engaging content, it is important to understand what the current state looks like, we can layer the current AI video generation landscape as below:

Layers in AI Video Generation

Existing video editors trying to integrate AI into their workflow like Adobe, Canva, etc.
AI-based new-age video editing tools like Fliki.ai, unscreen.com, synthesia.ai, hourone.ai, etc.
Abstraction layer dedicated to single use case in video generation workflow, eg Midjourney helps in creating realistic images for videos, RunwayML provides a platform to convert image to video or image to image, Did helps in adding animation to image, and so on.
Model Layer which forms the base of the entire landscape, software teams can leverage this layer to customize for their use cases.

As we move above in layer flexibility to customize decreases while ease of use increases, for the context of this article we will be diving deep to understand how we can use the Abstraction layer in generating creative videos since this layer falls in the middle of flexibility and ease of use, and as an artist one needs the best of both worlds.

Building text-to-video pipeline

Before we learn how we can create a pipeline to generate text-to-video using the tools in the abstraction layer it is important to chalk down what will be steps to generate the video and what tools can be used in each layer.

AI Video Generation Workflow

AI text-to-video generation involves the following steps:

1. Generating scenes from the script using GPT prompts

The first step I did was to few-shot-prompt GPT to give out the Hindi script in the form of dialogues between the characters involved in every scene.

Input

Input

Output

Output

2. Generating images from the scenes

This is the crucial step and involves creating the images for the scene which was broken down from the script, it all boils down to how creatively we can express ourselves using the prompt guidelines of Midjourney, the example below mentions the prompt given to Midjourney to generate an image for a scene.

A cartoon of scene where Indian old hindu saint is asking for a help with Lord Cloud; Lord cloud is personified and have happiness on his face, the environment around is full of trees with dark clouds and lightening all around

Generating Images

3. Adding animation to the image

In case you need to add animation to the image you can use DiD or RunwayML to add character motion and scene animation.

4. Generating AI voice for the scene narration

In this step, you can generate the AI voice for the narration using eleven labs, generally, these are Text to speech narration models using behind which may sound a bit robotic but solve the purpose of generating voice, one can make it more expressive and realistic from eleven labs paid version, for this story I needed hindi voice narration for which Ai4Bharat Text to speech narration does a great job.

5. Stitching the video clips and syncing the voice

This is the last and simplest step to add the images in a video editor and sync the voice as per scene and narration timeline, tools like Canva, and Adobe Express do a great job here.

Rough Cost of Video Production

Above is the simplest breakdown of how you can quickly generate video from text using a few basic tools, for my example, I generated an almost ~ 3-minute video with 16 unique scenes, interesting would be to see the time and money I paid to generate this video:

Midjourney cost ~ $0.05/image - 16*0.05 = $0.8

RunwayML ~ $0.02/image - 16*0.02 = 0.32

Canva ~ Free of cost since not used their premium artifacts

Total Cost ~ $1 /video

Comparing it with the new-age AI video editors like Fliki which charges almost $28/month for 180 minutes of creation, which would cost ~ $0.5 for a video length mentioned above.

Need to bundle the offering

Although the final cost of generating the video in the AI-based video editors seems less compared to the total cost incurred by using tools like Midjourney, RunwayML, etc, with added cost these tools provide flexibility and creativity to a video content creator and can help in generating some amazing videos which can be comparable to an amazing scene of Hollywood movie, it seems that if these AI tools can be bundled and integrated with the workflow of video agency or video production houses they can produce maximum value in video production, as Justine Moore, Partner @a16z in this thread also reflects the same.

Justine Moore

Discussion (20)

Not yet any reply