Publications

DocEditAgent: Document Structure Editing Via Multimodal LLM Grounding

EMNLP 2024

Publication date: November 16, 2024

Manan Suri, Puneet Mathur, Franck Dernoncourt, Rajiv Jain, Vlad Morariu, Ramit Sawhney, Preslav Nakov, Dinesh Manocha

Document structure editing involves manipulating localized textual, visual, and layout components in document images based on the user's requests. Past works have shown that multimodal grounding of user requests in the document image and identifying the accurate structural components and their associated attributes remain key challenges for this task. To address these, we introduce the DocEditAgent. Extensive experiments on the DocEdit dataset show that DocEditAgent significantly outperforms strong baselines on edit command generation (2-33%), RoI bounding box detection (12-31%), and overall document editing (1-12%) tasks.