Adobe Researchers first revealed Project Dub Dub Dub—a groundbreaking technology that uses generative AI to instantly translate a video while matching the speaker’s voice, tone, and cadence—at Adobe Max 2023. In the two years since that first Sneak, the team has refined the technology and helped bring it to users of Firefly, Adobe’s family of creative generative AI models. This research work evolved into what’s now known as the Firefly Voice Model, which powers Translate Audio, Translate Video, and the Translate and Lip Sync API. Here’s the research story behind the technology.
How Project Dub Dub Dub went from an idea to the MAX stage
Back in 2020, Adobe Research’s Speech AI Team had a vision. They wanted to use emerging LLM technology to build a tool that could automatically translate videos while preserving non-speech sounds and the voice of the speaker.
The goal, according to Zeyu Jin, Senior Research Scientist and one of the key contributors to Project Dub Dub Dub, was to empower video creators: “We wanted to make sure that the content people create can reach audiences all over the world. We knew that preserving their voices needed to be a core part of the technology because that allows creators to preserve their identities and create content that feels natural and authentic—even across languages.”
The team, including Jin’s group at Adobe Research and their collaborators at Princeton University, Adam Finkelstein and Yunyun Wang, began exploring and experimenting, and once the technology was up and running, their work was selected for a Sneak at Adobe MAX. Jin, who already had three MAX Sneak presentations under his belt, was delighted to take the stage again. “MAX feels like a conversation between Researchers and Adobe users. You go on stage, you show what you’ve created, and you get massive reactions from the audience. For Project Dub Dub Dub, I remember that the biggest ‘wow’ moment was when we showed one woman’s voice being rapidly translated across five different languages. Everyone really loved that.”
After a successful Sneak, the team, including Research Scientists Rithesh Kumar and former interns who joined Adobe Research full time, Jiaqi Su and Ke Chen, began working with the Firefly product team to take the technology from prototype to Adobe product. “An impressive demo is a completely different thing from a product that’s reliable and works in all the corner cases. To achieve that, we had to innovate the generative AI models,” says Jin.
Tackling accents, datasets, safety, and more
Refining Dub Dub Dub’s model posed challenging questions, including what to do about accents. As Kumar, an expert on audio generation algorithms, explains, “One thing that was really important to us from the beginning was the preservation of a speaker’s identity—so at first, we weren’t sure how to handle a speaker’s accent. If we translated a voice from Spanish to English, for example, it would generate English with a Spanish accent. We thought that might be preferable because it was more natural to the speaker, but users let us know that they wanted the accent to match the language. They really wanted translations to be true to the language and the way it’s spoken locally.”
Considering accents led the team even deeper into the question of localization. For example, Spanish is spoken very differently in different regions. So researchers decided that the model needed to give users the option to choose both a language and a region.
To train the AI model, the Research team needed a collection of recordings of speakers across different languages. And to follow Adobe’s policies for safe and ethical AI, they needed to assemble that data from scratch. Su led the process, which involved auditioning 7,000 voices and narrowing them down to 480 speakers for the initial dataset. Su’s careful selection process had the added benefit of allowing the team to carefully control for specific dialects and accents.
In alignment with Adobe’s work as a founding member of the Content Authenticity Initiative (CAI), the team also built safeguards that make it easy to detect when a new voice has been added to a video and when a celebrity’s face is being used, helping to prevent the use of the tool’s voice cloning capabilities to create deepfakes.
Throughout the development process, Researchers collaborated closely with the Digital Video and Audio (DVA) and Firefly product teams on requirements for speed, efficiency, and quality, and to determine the best way to weave the product into the Adobe ecosystem. It was a joint effort, with audio innovations led by Adobe Research and the video component developed in collaboration with product engineering partners to deliver a cohesive user experience across both modalities.

By the time the Project Dub Dub Dub technology was ready to be released as an official Adobe feature in Firefly, the Research team had completely rebuilt the model that originally powered the Sneak—and boosted support from an initial six languages to more than 20 languages and dialects. The new Translate Audio and Translate Video features—which became available at the same time as Firefly’s text-to-video capability—were among the very first audio and video features released for Firefly users.
“This is pioneering work. It’s extremely fast and efficient with good quality results,” says Jin. “The technology really brings Firefly to the forefront among creative AI tools.”
Charting the future of AI-powered voice translation
As Researchers look ahead, they’re already planning to add support for more languages and dialects, and to develop new features that will give Adobe users even more control over their translations.
“I think our goal is always to be the best sounding model in the world. And we want to support more languages, including languages that are less represented because they’re not as commonly spoken,” says Kumar. “And, as always, we want our users to have more controls. We strongly believe it’s not just about plugging a video in and getting a video out. Because we’re Adobe, we want creativity to be there. We’re always exploring new ways to put more control into users’ hands, whether it’s editing a transcript, or having more ways to control duration and expressivity.”
“We’re so excited about how far the technology we pioneered with Project Dub Dub Dub has come—and about everything that’s ahead,” adds Jin.
Wondering what else is happening inside Adobe Research? Check out our latest news here.
