Transforming text into discovery: OCR enrichment of digital collections in the University of Galway Library

Gespeichert in:
Bibliographische Detailangaben
Titel: Transforming text into discovery: OCR enrichment of digital collections in the University of Galway Library
Autoren: Dereza, Oksana, Rouget, Marie-Louise, Egwu, Chidi, Joy, Cillian
Weitere Verfasser: Department of Tourism, Culture, Arts, Gaeltacht, Sport and Media, Ireland, Carnegie Corporation, University of Galway Research Repository
Verlagsinformationen: IEEE, 2025.
Publikationsjahr: 2025
Schlagwörter: OCR, Collections as Data, IIIF, Optical Character Recognition, OCR workflow, Digital Collections, Heritage Collections
Beschreibung: The University of Galway Library has been working on an Optical Character Recognition (OCR) pipeline to transform scanned archival materials into machine-readable text at scale. This significantly enhances the accessibility and searchability of the University’s digitised heritage collections, supporting diverse areas of research interest and fostering deeper engagement with the Library’s holdings. The paper discusses key aspects of building an OCR pipeline, including the performance of available OCR software on heritage data, pre-processing of digitised images, quality assurance, and converting the OCR engine outputs for DAMS upload and seamless IIIF integration. The pipeline aims at balancing automation with quality control for successful extraction of printed, typewritten and handwritten text. We believe that our experience may help other GLAM institutions that are considering incorporating automatic text extraction into their digital collections workflow.
Publikationsart: Conference object
Dateibeschreibung: application/pdf
Sprache: English
Zugangs-URL: https://hdl.handle.net/10379/19264
Rights: CC BY NC ND
Dokumentencode: edsair.od......1513..b9b4b0aa475f3749048b58b51a73c881
Datenbank: OpenAIRE
Beschreibung
Abstract:The University of Galway Library has been working on an Optical Character Recognition (OCR) pipeline to transform scanned archival materials into machine-readable text at scale. This significantly enhances the accessibility and searchability of the University’s digitised heritage collections, supporting diverse areas of research interest and fostering deeper engagement with the Library’s holdings. The paper discusses key aspects of building an OCR pipeline, including the performance of available OCR software on heritage data, pre-processing of digitised images, quality assurance, and converting the OCR engine outputs for DAMS upload and seamless IIIF integration. The pipeline aims at balancing automation with quality control for successful extraction of printed, typewritten and handwritten text. We believe that our experience may help other GLAM institutions that are considering incorporating automatic text extraction into their digital collections workflow.