As videos progressively take a central role in conveying information on the Web, current \textit{linear-consumption methods that involve spending time proportional to the duration of the video, need to be revisited. In this work, we present NoVoExp, a method that enables a Non-linear Video Consumption Experience by generating a sequence of multimodal fragments that represents the content in different segments of the videos in a succinct fashion. These fragments can help understand the content of the video without watching it in its entirety and also serve as pointers to different segments of the video, enabling a new mechanism to interact/consume the video. We also design several baselines by building on top of video captioning and video summarization works to better understand the relative advantages and disadvantages of NoVoExp, and compare the performances across different video duration (short, medium, long) and categorizes (entertainment, lectures, tutorials). We find that the sequences of multimodal fragments generated by NoVoExp have higher relevance to the video, are more diverse and yet coherent. Our extensive evaluation using several automated metrics as well as human studies show that our multimodal fragments are not only good at representing the contents of the video but also align well with targeted viewer preferences.
Learn More