On Tuesday, Google declared Lumiere, an AI video producer that it calls “a space-time diffusion model for realistic video generation” in the accompanying preprint paper. But let’s not deceive ourselves: It does an excellent job at creating videos of adorable animals in absurd situations, such as utilizing roller skates, operating a car, or playing a piano. Certainly, it can accomplish more, but it is perhaps the most advanced text-to-animal AI video producer yet demonstrated.
According to Google, Lumiere utilizes distinct architecture to create a video’s entire temporal duration in one go. Or, as the company articulated it, “We introduce a Space-Time U-Net architecture that generates the entire temporal duration of the video at once, through a single pass in the model. This is in contrast to existing video models which synthesize distant keyframes followed by temporal super-resolution—an approach that inherently makes global temporal consistency difficult to achieve.”
In layperson terms, Google’s tech is designed to manage both the space (where things are in the video) and time (how things move and change throughout the video) aspects simultaneously. So, rather than creating a video by assembling numerous small parts or frames, it can produce the entire video, from start to finish, in one smooth process.
Lumiere can also perform numerous party tricks, which are laid out quite well with examples on Google’s demo page. For example, it can execute text-to-video generation (turning a written prompt into a video), convert still images into videos, generate videos in specific styles using a reference image, apply consistent video editing using text-based prompts, create cinemagraphs by animating specific regions of an image, and offer video inpainting capabilities (for example, it can change the type of dress a person is wearing).
In the Lumiere research paper, the Google researchers state that the AI model outputs five-second long 1024×1024 pixel videos, which they describe as “low-resolution.” Despite those limitations, the researchers conducted a user study and assert that Lumiere’s outputs were favored over existing AI video synthesis models.
As for training data, Google doesn’t specify where it obtained the videos they fed into Lumiere, writing, “We train our T2V [text to video] model on a dataset containing 30M videos along with their text caption. [sic] The videos are 80 frames long at 16 fps (5 seconds). The base model is trained at 128×128.”
AI-manufactured video is still in a primitive state, but it has been advancing in quality over the past two years. In October 2022, we reported Google’s first publicly disclosed image synthesis model, Imagen Video. It could generate short 1280×768 video clips from a written prompt at 24 fps, but the results weren’t always coherent. Before that, Meta premiered its AI video generator, Make-A-Video. In June of last year, Runway’s Gen2 video synthesis model enabled the creation of two-second video clips from text prompts, fueling the creation of surrealistic parody commercials. And in November, we discussed Stable Video Diffusion, which can generate short clips from still images.
AI companies regularly demonstrate video producers with charming animals because generating coherent, non-deformed humans is currently challenging—particularly since we, as humans (you are human, right?), are skillful at noticing any imperfections in human bodies or how they move. Just look at AI-manufactured Will Smith eating spaghetti.
Assessing by Google’s examples (and not having used it ourselves), Lumiere appears to surpass these other AI video production models. But since Google tends to keep its AI research models tightly sealed, we’re not sure when, if ever, the public may have an opportunity to try it for themselves.
As always, whenever we witness text-to-video synthesis models becoming more adept, we can’t help but think of the future implications for our Internet-connected society, which is focused on sharing media artifacts—and the general assumption that “realistic” video typically depicts real objects in real situations captured by a camera. Future video synthesis tools more proficient than Lumiere will make deceptive deepfakes trivially easy to create.
To that end, in the “Societal Impact” section of the Lumiere paper, the researchers write, “Our primary goal in this work is to enable novice users to generate visual content in an creative and flexible way. [sic] However, there is a risk of misuse for creating fake or harmful content with our technology, and we believe that it is crucial to develop and apply tools for detecting biases and malicious use cases in order to ensure a safe and fair use.”