We present a novel task of document-level script event prediction, which aims to predict the next event given a candidate list of narrative events in long-form documents. To enable this, we introduce DocSEP, a challenging dataset of over five million samples from two new domains --contractual documents and Wikipedia articles--, where timeline events may be paragraphs apart and may require multi-hop temporal and causal reasoning. We present DocScript -- a novel architecture that uses optimal transport between event pair representations to learn sequential ordering between events via three unique event-aware instruction fine-tuning tasks. Our experimental results on the DocSEP dataset demonstrate that DocScript can learn longer-range dependencies between events and outperforms existing state-of-the-art script event prediction methods by 6-8\% on the proposed datasets. We show that contemporary LLMs such as ChatGPT, GPT-4, and LlaMA struggle to solve this task, indicating their lack of reasoning abilities for understanding causal relationships and temporal sequences within long texts.