An end-to-end pipeline for historical censuses processing

Censuses are structured documents of great value for social and demographic history, which became widespread from the nineteenth century on. However, the plurality of formats and the natural variability of historical data make their extraction arduous and often lead to ungeneric recognition algorith...

Full description

Saved in:

Bibliographic Details
Published in:	International journal on document analysis and recognition Vol. 26; no. 4; pp. 419 - 432
Main Authors:	Petitpierre, Rémi, Kramer, Marion, Rappo, Lucas
Format:	Journal Article
Language:	English
Published:	Berlin/Heidelberg Springer Berlin Heidelberg 01.12.2023 Springer Nature B.V
Subjects:	Algorithms Census Censuses Columnar structure Computer Science Computer vision Deep learning Dictionaries Documents Image Processing and Computer Vision Neural networks Original Paper Pattern Recognition Performance enhancement Recognition Semantic segmentation Handwritten text recognition OCR post-correction Tabular document understanding Historical document processing Layout analysis
ISSN:	1433-2833, 1433-2825
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Censuses are structured documents of great value for social and demographic history, which became widespread from the nineteenth century on. However, the plurality of formats and the natural variability of historical data make their extraction arduous and often lead to ungeneric recognition algorithms. We propose an end-to-end processing pipeline, based on optimization, in an attempt to reduce the number of free parameters. The layout analysis is based on semantic segmentation using neural networks for a generic recognition of the explicit column structure. The implicit row structure is deduced directly from the position of the text segments. The handwritten text detection is complemented by an intelligent framing method which significantly improves the quality of the HTR. In the end, we propose to combine several post-correction approaches, neural networks, and language models, to further improve the performance. Ultimately, our flexible methods make it possible to accurately detect more than 98% of the columns and 88% of the rows, despite the lack of graphical separator and the diversity of formats. Thanks to various reframing and post-correction strategies, HTR results reach the excellent performance of 3.44% character error rate on these noisy nineteenth century data. In total, more than 18,831 pages were extracted in 72 censuses over a century. This large historical dataset, as well as training data, is made open-access and released along with this article.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1433-2833 1433-2825
DOI:	10.1007/s10032-023-00428-9