Video topic segmentation unveils the coarse-grained semantic structure underlying videos and therefore plays a critical role in a variety of other downstream video understanding tasks. With the explosion of multi-modal content in recent years, solely relying on a single modality such as textual or visual modality is arguably insufficient. On the other hand, previously proposed solutions for similar tasks like video scene/shot segmentation usually focus on short videos with distinct visual changes, but are not robust enough for long videos with subtler visual changes (e.g., livestream videos). In this paper, we introduce a multi-modal video topic segmenter leveraging both video transcripts and frames for segment prediction, enhanced by a cross-modal attention mechanism. Furthermore, we propose a dual-contrastive learning framework adhering to the unsupervised domain adaptation paradigm, allowing our model to generalize to longer videos with much more complex semantics than what it was trained on. Experimental results on a short video corpus and two long video corpora demonstrate that our proposed multi-modal solution, augmented by the dual-contrastive domain adaptation strategy, significantly surpasses baseline methods in terms of both accuracy and transferability, in both intra- and cross-domain settings.