Transforming text into discovery: OCR enrichment of digital collections in the University of Galway Library

Uložené v:
Podrobná bibliografia
Názov: Transforming text into discovery: OCR enrichment of digital collections in the University of Galway Library
Autori: Dereza, Oksana, Rouget, Marie-Louise, Egwu, Chidi, Joy, Cillian
Prispievatelia: Department of Tourism, Culture, Arts, Gaeltacht, Sport and Media, Ireland, Carnegie Corporation, University of Galway Research Repository
Informácie o vydavateľovi: IEEE, 2025.
Rok vydania: 2025
Predmety: OCR, Collections as Data, IIIF, Optical Character Recognition, OCR workflow, Digital Collections, Heritage Collections
Popis: The University of Galway Library has been working on an Optical Character Recognition (OCR) pipeline to transform scanned archival materials into machine-readable text at scale. This significantly enhances the accessibility and searchability of the University’s digitised heritage collections, supporting diverse areas of research interest and fostering deeper engagement with the Library’s holdings. The paper discusses key aspects of building an OCR pipeline, including the performance of available OCR software on heritage data, pre-processing of digitised images, quality assurance, and converting the OCR engine outputs for DAMS upload and seamless IIIF integration. The pipeline aims at balancing automation with quality control for successful extraction of printed, typewritten and handwritten text. We believe that our experience may help other GLAM institutions that are considering incorporating automatic text extraction into their digital collections workflow.
Druh dokumentu: Conference object
Popis súboru: application/pdf
Jazyk: English
Prístupová URL adresa: https://hdl.handle.net/10379/19264
Rights: CC BY NC ND
Prístupové číslo: edsair.od......1513..b9b4b0aa475f3749048b58b51a73c881
Databáza: OpenAIRE
Popis
Abstrakt:The University of Galway Library has been working on an Optical Character Recognition (OCR) pipeline to transform scanned archival materials into machine-readable text at scale. This significantly enhances the accessibility and searchability of the University’s digitised heritage collections, supporting diverse areas of research interest and fostering deeper engagement with the Library’s holdings. The paper discusses key aspects of building an OCR pipeline, including the performance of available OCR software on heritage data, pre-processing of digitised images, quality assurance, and converting the OCR engine outputs for DAMS upload and seamless IIIF integration. The pipeline aims at balancing automation with quality control for successful extraction of printed, typewritten and handwritten text. We believe that our experience may help other GLAM institutions that are considering incorporating automatic text extraction into their digital collections workflow.