In the rapidly evolving landscape of AI, generating high-quality video content has become increasingly critical for creators. This fall, Adobe released the new Firefly Video Model, a beta technology that enables the generation of creative, relevant, and high-quality video content, with major contributions by Adobe Research. This work empowers creators to produce imaginative videos with cinematic movements, thanks to advancements in camera dynamics.
In addition, a joint effort between Adobe Research and the Firefly team developed the new Generative Extend (GenExtend) feature in Premiere Pro, which generates novel video frames to add to the beginning or end of an existing video. In keeping with Adobe’s commitment to ethical AI, Firefly Video is the first publicly available video model designed to be safe for commercial use.
Novel architecture, improved text prompt matching, and life-like scene dynamics
A significant milestone in this journey is Adobe Research’s Foundation Model Team’s achievement in text-to-video generation. The team introduced a novel transformer architecture, redesigned the variational autoencoder (VAE), and enhanced the learning of scene dynamics. These innovations have greatly improved text-to-visual alignment—that is, the system’s capacity to match text prompts to the visual outcome in a video—and the ability to generate videos with realistic camera motions and lifelike scene dynamics. This model now serves as the foundation for the beta release of the Firefly Video Model and GenExtend.
The team’s new transformer architecture directly addresses the challenges of text-to-video generation by significantly improving text-to-visual alignment. This diffusion transformer architecture features a unique design that enhances the fusion of information across different modalities, which allows text prompts to be reflected in more relevant video content. This enhancement enables the model to produce videos that are coherent, realistic, and closely aligned with even long and detailed input prompts. The result is content that is both accurate and visually engaging.
The redesigned VAE—the system’s internal representation of video data—further enhances the model’s capabilities. With its simplified structure, the novel VAE seamlessly represents both image and video data while maintaining high fidelity and preserving intricate details. This design also facilitates efficient knowledge transfer from the image domain to the video domain, ensuring consistency and quality in generated videos. Paired with a novel training strategy proposed by the team, the model efficiently learns visual concepts and captures high-quality dynamics.
In addition, this technology is supported by a custom-made infrastructure to manage and serve data at scale. This enables better high-throughput solutions for training the system, which has led to important gains in training efficiency.
By innovating both the transformer architecture and the VAE, and introducing an improved training methodology, the Foundation Model Team has advanced the capabilities of text-to-video generation. Their work not only enhances current possibilities but also paves the way for creative and professional applications in video production.
Seamless GenExtend in Premiere Pro
Building on the novel transformer architecture, the new Generative Extend feature in Premiere Pro is the first video generation Adobe has shipped that integrates into existing creative video workflows. This feature extends the duration of a video clip by two seconds at either its end or start by generating novel video frames. Unlike text-to-video, GenExtend has the challenge of seamlessly juxtaposing generated and captured content. Even the slightest difference between the two can cause temporal coherency artifacts such as popping or jitter.
The team, a collaboration between Firefly Video and Adobe Research’s Video AI Lab, took up the challenge. Generated video is typically trained by first heavily compressing video into a latent space and then decoding back. However, the difference after compression between original and generated content can be noticeable when they are directly juxtaposed together. An additional issue arises because the technology down-samples the original video, generates the extra frames, and then up-samples them using a generative super-resolution algorithm. This process of down- and up-sampling can cause the loss of high-frequency texture which can look like sudden blurring when transitioning between original and generated frames.
The team that created GenExtend addressed this problem with reference-guided decoding and super-resolution. They modified the models that convert latent representations to video frames and super-resolve generated video frames to consider an additional input: a reference video frame. Then, during these decoding and super-resolving steps, the method uses the nearest original video frame as a reference, significantly increasing the visual quality and seamlessness of the generated video. As a final step, the technology uses traditional contrast and brightness normalization to further align generated and original video.
After addressing seamlessness, the team encountered an additional problem—sometimes the generated two seconds were too novel, introducing dramatic appearances and motion. Video creators usually want to extend video just to add a beat or a cross-fade, without adding new elements. To avoid too much novelty in generated video, the team fine-tuned the generative model with carefully filtered data so that the last (or first) few seconds of video were as consistent and predictable as possible. This filtering helped teach the model to avoid novel content and produce extensions that blend seamlessly with the original video.
Adobe’s advancements in text-to-video generation and the introduction of GenExtend technology, empowering users to create in new ways. As the AI revolution continues to gather speed, Adobe’s pioneering work ensures that creators have the tools they need to produce high-quality, imaginative video content, while adhering to ethical AI practices.
Interested in trying these technologies out? Join the waitlist for the new Adobe Firefly Video Model here and learn how to access Generative Extend in Premiere Pro here.