View in EDS

Transforming Physical Archives into Searchable Digital Libraries with Optical Character Recognition.

Saved in:

Bibliographic Details
Title:	Transforming Physical Archives into Searchable Digital Libraries with Optical Character Recognition.
Authors:	Sivankalai, Sivankalai, Balachandran, Shanmugam
Source:	Preservation, Digital Technology & Culture; Dec2025, Vol. 54 Issue 4, p263-284, 22p
Subject Terms:	OPTICAL character recognition, DIGITIZATION of archival materials, SPELLING errors, TEXT recognition, IMAGE enhancement (Imaging systems), GRAPHICAL user interfaces, DIGITAL libraries, DIGITAL preservation
Abstract:	This study presents an integrated and extensible OCR framework that transforms physical documents into searchable digital archives, emphasizing system-level engineering over algorithmic novelty. While the core components such as contrast enhancement, PaddleOCR, and BERT-based spelling correction are individually established, their synthesis into a unified pipeline for multilingual and multimodal archival digitization represents a practical contribution to digital preservation. The architecture is implemented using Python and incorporates preprocessing steps such as grayscale conversion, CLAHE-based contrast adjustment, Gaussian blurring, and adaptive thresholding to enhance image quality. Multilingual high-accuracy text recognition is performed using PaddleOCR, while a hybrid spelling correction module combines lexical and BERT-based contextual corrections. A straightforward and easy-to-use graphical user interface (GUI) facilitates interaction and visual diagnostics on the basis of confidence heatmaps as well as pixel intensity histograms. Digitized outputs are retained in a central library repository to facilitate efficient retrieval and organization. The system performs well in high OCR accuracy across different input types 99.12 % in screenshots and 94.81 % in scanned documents. Preprocessing greatly enhances text readability in degraded images. The spelling correction module enhances text readability by more than 95 %. Visualization tools offer useful insights into OCR performance and preprocessing effects. This paper demonstrates a modular pipeline that integrates preprocessing, OCR, error correction, visualization, and repository integration. The application of PaddleOCR with domain-specific preprocessing and contextual correction capabilities brings novelty, particularly for rich archival content. [ABSTRACT FROM AUTHOR]
	Copyright of Preservation, Digital Technology & Culture is the property of De Gruyter and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Database:	Complementary Index

Full Text Finder

Nájsť tento článok vo Web of Science

Description
Abstract:	This study presents an integrated and extensible OCR framework that transforms physical documents into searchable digital archives, emphasizing system-level engineering over algorithmic novelty. While the core components such as contrast enhancement, PaddleOCR, and BERT-based spelling correction are individually established, their synthesis into a unified pipeline for multilingual and multimodal archival digitization represents a practical contribution to digital preservation. The architecture is implemented using Python and incorporates preprocessing steps such as grayscale conversion, CLAHE-based contrast adjustment, Gaussian blurring, and adaptive thresholding to enhance image quality. Multilingual high-accuracy text recognition is performed using PaddleOCR, while a hybrid spelling correction module combines lexical and BERT-based contextual corrections. A straightforward and easy-to-use graphical user interface (GUI) facilitates interaction and visual diagnostics on the basis of confidence heatmaps as well as pixel intensity histograms. Digitized outputs are retained in a central library repository to facilitate efficient retrieval and organization. The system performs well in high OCR accuracy across different input types 99.12 % in screenshots and 94.81 % in scanned documents. Preprocessing greatly enhances text readability in degraded images. The spelling correction module enhances text readability by more than 95 %. Visualization tools offer useful insights into OCR performance and preprocessing effects. This paper demonstrates a modular pipeline that integrates preprocessing, OCR, error correction, visualization, and repository integration. The application of PaddleOCR with domain-specific preprocessing and contextual correction capabilities brings novelty, particularly for rich archival content. [ABSTRACT FROM AUTHOR]
ISSN:	21952957
DOI:	10.1515/pdtc-2025-0025