Training Full-Page Handwritten Text Recognition Models without Annotated Line Breaks

International Conference on Document Analysis and Recognition (ICDAR)

Published November 11, 2019

Chris Tensmeyer, Curtis Wigington


Training Handwritten Text Recognition (HTR) models typically requires large amounts of labeled data which often are line or page images with corresponding line-level ground truth (GT) transcriptions. Many digital collections have page-level transcriptions for each image, but the transcription is unformatted, i.e., line breaks are not annotated. Can we train lined-based HTR models using such data? In this work, we present a novel alignment technique for segmenting page-level GT text into text lines during HTR model training. This text segmentation problem is formulated as an optimization problem to minimize the cost of aligning predicted lines with the GT text. Using both simulated and HTR model predictions, we show that the alignment method identifies line breaks accurately, even when the predicted lines have high character error rates (CER). We removed the GT line breaks from the ICDAR-2017 READ dataset and trained a HTR model using the proposed alignment method to predict line breaks on-the-fly. This model achieves comparable CER w.r.t. to the same model trained with the GT line breaks. Additionally, we downloaded an online digital collection of 50K English journal pages (not curated for HTR research) whose transcriptions do not contain line breaks, and achieve 11% CER.