Multimodal content is central to digital communications and has been shown to increase user engagement – making them indispensable in today's digital economy. Image-text combination is a common multimodal manifestation seen in several digital forums, e.g. banners, online ads, social posts. The choice of a specific image-text combination is dictated by the information to be represented, the strength of the image and text modalities in representing the information, and the need of the reader consuming the content. Given an input content, representing the information to be represented in a multimodal fragment and creating variants accounting for these factors is a non-trivial and tedious task, calling for a need for automation. In this paper, we propose a holistic approach to automatically create multimodal image-text fragments derived from an unstructured input content tailored towards a target need. The proposed approach aligns the fragment to the target need both in terms of content as well as style. With the help of metric-based and human evaluations, we show the effectiveness of the proposed approach in generating multimodal fragments aligned to target needs while also capturing the information to be presented.
Learn More