Curricular Next Conversation Prediction Pretraining for Transcript Segmentation

Findings of EACL 2023

Publication date: June 6, 2023

Anvesh Rao Vijjini, Hanieh Deilamsalehy, Franck Dernoncourt, Snigdha Chaturvedi

Transcript segmentation is the task of dividing a single continuous transcript into multiple segments. While document segmentation is a popular task, transcript segmentation has significant challenges due to the relatively noisy and sporadic nature of data. We propose pretraining strategies to address these challenges. The strategies are based on ``Next Conversation Prediction'' (NCP) with the underlying idea of pretraining a model to identify consecutive conversations. We further introduce ``Advanced NCP'' to make the pretraining task more relevant to the downstream task of segmentation break prediction while being significantly easier. Finally we introduce a curriculum to Advanced NCP (Curricular NCP) based on the similarity between pretraining and downstream task samples. Curricular NCP applied to a state-of-the-art model for text segmentation outperforms prior results. We also show that our pretraining strategies make the model robust to speech recognition errors commonly found in automatically generated transcripts.