Google's Lumiere: New AI Model that Creates Realistic Videos

Lumiere is a cutting edge text-to-video model that uses a new technique to create realistic videos from short text inputs.

Ben Wodecki, Jr. Editor

January 29, 2024

3 Min Read

Screen capture of an AI-generated video of a boat sailing on water

Google

At a Glance

Google researchers came up with a new AI model that can generate realistic videos from short text inputs.
Lumiere works differently from existing video generation models, focusing on the movement of objects in the image.
It generates more frames per video than Stable Video Diffusion and can reimagine content from short prompts.

Google has unveiled a new text-to-video model capable of generating lifelike videos from short text inputs.

Lumiere creates videos that showcase realistic motion and can even use images and other videos as inputs to improve results. Unveiled in a paper titled ‘A Space-Time Diffusion Model for Video Generation,' Lumiere works differently from existing video generation models. It generates a temporal duration of the video at once, whereas existing models synthesize distant keyframes followed by temporal super-resolution.

Put simply, Lumiere focuses on the movement of objects in the image, whereas prior systems patch together a video from key frames where the movement already happened.

The model is capable of generating videos comprised of 80 frames. For comparison, Stability’s Stable Video Diffusion clocks in at 14 and 25 frames. The more frames, the smoother the motion of the video.

Lumiere outperforms rival video generation models from the likes of Pika, Meta and Runway across various tests, including zero-shot trials, according to Google's team.

The researchers also contend that Lumiere produces state-of-the-art generation outputs as a result of its alternative approach. They claim Lumiere's outputs could be used in content creation tasks and video editing, including video inpainting and stylized generation (mimicking artistic styles it is shown) by using fine-tuned text-to-image model weights.

Credit: Google

To achieve its results, Lumiere leverages a new architecture, Space-Time U-Net. This generates the entire temporal duration of the video at once, through a single pass in the model.

The Google team wrote that the novel approach improves consistency in outputs. “By deploying both spatial and (importantly) temporal down- and up-sampling and leveraging a pre-trained text-to-image diffusion model, our model learns to directly generate a full-framerate, low-resolution video by processing it in multiple space-time scales,” the paper reads.

Credit: Google

The goal of the Lumiere project was to create a system to enable novice users to more easily create video content.

However, the paper acknowledges the risk of potential misuse, specifically warning models like Lumiere could be used to create fake or harmful content.

“We believe that it is crucial to develop and apply tools for detecting biases and malicious use cases in order to ensure a safe and fair use,” the paper reads.

Google has not made the model available to the public at the time of writing. However, you can explore various example generations on the showcase page on GitHub.

Google steps up video work

Lumiere follows VideoPoet, a Google-produced multimodal model that creates videos from text, video and image inputs. Unveiled last December, VideoPoet uses a decoder-only transformer architecture, making it capable of creating content it has not been trained on.

Google has developed several video generation models, including Phenaki and Imagen Video, as well as plans to cover AI-generated videos with its detection tool SynthID.

Google’s video work compliments its Gemini foundation model, specifically the Pro Vision multimodal endpoint that is capable of handling images and video as input while generating text as output.

About the Author(s)

Ben Wodecki

Jr. Editor

Ben Wodecki is the Jr. Editor of AI Business, covering a wide range of AI content. Ben joined the team in March 2021 as assistant editor and was promoted to Jr. Editor. He has written for The New Statesman, Intellectual Property Magazine, and The Telegraph India, among others. He holds an MSc in Digital Journalism from Middlesex University.

See more from Ben Wodecki

Related Topics

Recent in ML

Related Topics

Recent in NLP

Related Topics

Recent in Data

Related Topics

Recent in Automation

Related Topics

Recent in Verticals

Related Topics

Recent in Responsible AI

Related Topics

Recent in Companies

Related Topics

Google's Lumiere: New AI Model that Creates Realistic Videos

At a Glance

Google steps up video work

About the Author(s)

Latest News

Trending articles