Sponsored by Google Cloud
Choosing Your First Generative AI Use Cases
To get started with generative AI, first focus on areas that can improve human experiences with information.
Lumiere is a cutting edge text-to-video model that uses a new technique to create realistic videos from short text inputs.
Google has unveiled a new text-to-video model capable of generating lifelike videos from short text inputs.
Lumiere creates videos that showcase realistic motion and can even use images and other videos as inputs to improve results. Unveiled in a paper titled ‘A Space-Time Diffusion Model for Video Generation,' Lumiere works differently from existing video generation models. It generates a temporal duration of the video at once, whereas existing models synthesize distant keyframes followed by temporal super-resolution.
Put simply, Lumiere focuses on the movement of objects in the image, whereas prior systems patch together a video from key frames where the movement already happened.
The model is capable of generating videos comprised of 80 frames. For comparison, Stability’s Stable Video Diffusion clocks in at 14 and 25 frames. The more frames, the smoother the motion of the video.
Lumiere outperforms rival video generation models from the likes of Pika, Meta and Runway across various tests, including zero-shot trials, according to Google's team.
The researchers also contend that Lumiere produces state-of-the-art generation outputs as a result of its alternative approach. They claim Lumiere's outputs could be used in content creation tasks and video editing, including video inpainting and stylized generation (mimicking artistic styles it is shown) by using fine-tuned text-to-image model weights.
Credit: Google
To achieve its results, Lumiere leverages a new architecture, Space-Time U-Net. This generates the entire temporal duration of the video at once, through a single pass in the model.
The Google team wrote that the novel approach improves consistency in outputs. “By deploying both spatial and (importantly) temporal down- and up-sampling and leveraging a pre-trained text-to-image diffusion model, our model learns to directly generate a full-framerate, low-resolution video by processing it in multiple space-time scales,” the paper reads.
Credit: Google
The goal of the Lumiere project was to create a system to enable novice users to more easily create video content.
However, the paper acknowledges the risk of potential misuse, specifically warning models like Lumiere could be used to create fake or harmful content.
“We believe that it is crucial to develop and apply tools for detecting biases and malicious use cases in order to ensure a safe and fair use,” the paper reads.
Google has not made the model available to the public at the time of writing. However, you can explore various example generations on the showcase page on GitHub.
Lumiere follows VideoPoet, a Google-produced multimodal model that creates videos from text, video and image inputs. Unveiled last December, VideoPoet uses a decoder-only transformer architecture, making it capable of creating content it has not been trained on.
Google has developed several video generation models, including Phenaki and Imagen Video, as well as plans to cover AI-generated videos with its detection tool SynthID.
Google’s video work compliments its Gemini foundation model, specifically the Pro Vision multimodal endpoint that is capable of handling images and video as input while generating text as output.
Read more about:
ChatGPT / Generative AIYou May Also Like