Publications

A Hybrid Rigid and Non-Rigid Motion Approximation for Generating Realistic Listening Behavior Videos

Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP)

Publication date: December 8, 2022

Swasti Shreya Mishra, Kumar Shubham, Dinesh Babu Jayagopi

The use of generated behavioral videos, particularly in a controlled fashion, is an emerging research direction in multi-modal interaction research. Apart from speaking-behavior, listening-behavior understanding is an important problem. Listening behaviors play a crucial role in many important dyadic conversations in day-to-day life. The interviewer uses cues from the listener’s behavior to judge a given candidate. A recent study has demonstrated the use of one-shot deep-fake generation models like the first-order motion model (FOMM) to transfer the behavior of an actor onto any single facial image for listening behavior understanding. Such an experiment requires the creation of multiple motion-transfer videos for different participants. FOMM, compared to other DeepFake models, provides a unique option of transferring any actor’s motion to a target source image without training multiple models for every participant. In addition, the unsupervised keypoint learning process allows FOMM to generalize to different kinds of motions. However, to generate high-quality output, it makes a first-order approximation of any motion. This limits its application, especially for transformations consisting of multiple rigid bodies and non-rigid body-based motions, especially for facial movements. As a result, these models often generate artifacts around the lip and eye regions. To circumvent this issue, in our work we have proposed a hybrid model which can approximate a given motion using both first-order and zero-order motion. Compared to the FOMM, our approach gives the network the flexibility to decide on the type of approximation it wants to make for a given keypoint. Results show that our model not only prevents distortion for non-rigid body motion but also generates better quality output when compared with FOMM.

Learn More