The emergence of commercial tools for real-time performance- based 2D animation has enabled 2D characters to appear on live broadcasts and streaming platforms. A key requirement for live animation is fast and accurate lip sync that allows characters to respond naturally to other actors or the audience through the voice of a human performer. In this work, we present a deep learning based interactive system that auto- matically generates live lip sync for layered 2D characters using a Long Short Term Memory (LSTM) model. Our sys- tem takes streaming audio as input and produces viseme se- quences with less than 200ms of latency (including processing time). Our contributions include specific design decisions for our feature definition and LSTM configuration that pro- vide a small but useful amount of lookahead to produce ac- curate lip sync. We also describe a data augmentation pro- cedure that allows us to achieve good results with a very small amount of hand-animated training data (13-20 min- utes). Extensive human judgement experiments show that our results are preferred over several competing methods, in- cluding those that only support offline (non-live) processing. Video summary and supplementary results at GitHub link: https://github.com/deepalianeja/CharacterLipSync2D
Learn More