Researcher Spotlight: Valentina Shin on building powerful, intuitive tools for audio and video

Research Scientist Valentina Shin combines AI and machine learning technologies with intuitive human computer interactions (HCI) to help users create audio and visual media.

We talked to Shin about her research and inspirations, imagining new creative tools that are easier to use, and what she loves about Adobe Research.

How did you get interested in interaction tools and techniques for audio and visual media?

When I started studying computer graphics, I was excited to invent new algorithms. But I also wanted to see how end-users interacted with the tools I made, so I got interested in HCI. Now I help create new ways for people to interact intuitively with visual and audio media. For example, we’ve built tools that let people edit videos using a transcript, which is similar enough to editing a document that people already know exactly what to do, even if they’re not video experts. And I’ve helped create technology that lets a novice user animate a character with sounds.

These are just a couple of ways that we’re taking familiar interactions and extending them so people can focus on content creation while taking advantage of powerful technology in the background.

You recently collaborated with a big team of researchers, designers, and engineers on Project Blink. Can you tell us about some of the problems you helped solve for Project Blink, and how they relate to your research?

Project Blink lets people edit video in a completely new way. They can search for words, images, people, and moments in a video transcript and then cut and paste, just like they do with a text document. This is very different from the traditional method of video editing, where people work with a timeline, which might have visualizations of audio waves or thumbnails.

When we started, one of the most interesting questions was how to visualize multiple types of information extracted with ML algorithms. I was going back to fundamental questions I’d asked in my PhD research, trying to figure out how to extract and visualize useful structure from videos. Of course, we can transcribe speech, but what about the things happening visually, and non-speech sounds like music, noise, and silence? We worked on bringing all of that into a structured transcript that users can easily skim and interact with.

In a recent project with a summer intern, we also worked on visualizing multiple tracks in video transcripts so we could help users add effects like background music or b-roll. For inspiration, we talked to video editors and looked to other disciplines, including music scores for multiple instruments and screenplays that include what the actors need to act out, directions for who walks in, and lots of visual and audio things that happen in the background.

In all this work, we’re drawing on existing interfaces and interaction models and augmenting them with emerging ML technologies.

You’ve also worked on live animation. Can you tell us about that?

Traditionally, when you want to animate a character, you do it by key framing and then matching the animation with the prerecorded audio. This is a tedious and time-consuming task. But one of my interns and I recently worked on a project that allows users to control animation through sound. For example, we wanted to let people just sing in front of a character and make that character sing. Or we wanted to let them make a sneezing sound to make a character sneeze.

So, we created an interface where users can provide examples — they can sing or make other sounds — and then decide how they want the animation to play along. Then, behind the scenes, our algorithm infers the relationship between the audio and the animation and takes care of synchronization. It was such an exciting project, and we presented it at IUI, a conference on intelligent user interfaces.

In all these projects there are complex algorithms behind the scenes, but we wrap them in simple user interactions. We want the experience to feel familiar, so users can just use the tool without necessarily diving into its complexity, just like we use our hands without having to understand their anatomy.

What role do you think new AI and machine learning technology will play in building more intuitive creative tools?

The challenge is embedding increasingly powerful machine learning and AI technology in the context of the user’s workflow so that it enhances their creativity without losing their own personality.

For example, we hear a lot about generative, text-to-image technology. You can give it a prompt and generate the image. And there are large-language models that can carry out conversations. So how do we embed these technologies in a video editing tool? One idea is to use them to suggest edits, such as titles or contents for b-roll footage. But generic or wrong suggestions would hinder the user’s creativity and be counterproductive. We want the suggestions to be aware of the context, such as the theme of the video and the user’s editing style. We also want the suggestions to be customizable.

So the question is, how do we refine these technologies and wrap them in an intuitive interaction that’s useful — one that relates to familiar processes, cuts out the boring repetitive tasks while still letting people really use their skills and creativity, and has a very small learning curve?

You’re also interested in accessibility for people who are hard of hearing or deaf. What are you working on there?

One example is the work I did with my colleague Dingzeyu Li and a summer intern to bring more life into captions. We wanted to look beyond speech to other sounds — there’s so much information in sound that can be translated into different modalities. For example, we can use graphical elements like emojis to convey sound events detected in the background.

With new vision language models that understand what’s happening in the scene and translate it into text, I think there’s a lot of potential to create bridges across multi-modal experiences.

You started some of your work as an Adobe Research intern and then decided to join full-time. What made you want to stay?

My favorite part of Adobe is the people. I’m always learning from my colleagues — sometimes I feel a little bit out of breath because they are all so good! And I really love the opportunity to work with talented interns each summer. I get to grow with them, and it’s almost like reliving my PhD. It’s such a beautiful experience.

To learn more about Adobe Research’s work in Human Computer Interaction, check out our HCI research team, publications, and news here!

Researcher Spotlight: Valentina Shin on building powerful, intuitive tools for audio and video

May 15, 2023

Tags: Careers, Human Computer Interaction, Researcher Spotlights

Related Posts

Adobe at CHI 2023

Adobe researchers co-authored six papers at the annual ACM CHI Conference on Human Factors in Computing Systems (CHI 2023), which will be held from April 23 to 28 in a hybrid format.

Project Blink: Creating the Future of AI-Powered Video Editing

Project Blink helps users find and quickly pull highlights from video content, making video editing as easy as text editing.

Boomerangs: Adobe Researchers who came back and helped build Project Blink

Aseem Agarwala, Anh Truong, and Joel Brandt — 3 members of the large team of researchers, engineers, and designers who developed Project Blink — share why they came back to Adobe, what they enjoy most about research-product collaborations, and how the culture helps them to do what they love.