Transforming text into discovery: OCR enrichment of digital collections in the University of Galway Library

Uloženo v:
Podrobná bibliografie
Název: Transforming text into discovery: OCR enrichment of digital collections in the University of Galway Library
Autoři: Dereza, Oksana, Rouget, Marie-Louise, Egwu, Chidi, Joy, Cillian
Přispěvatelé: Department of Tourism, Culture, Arts, Gaeltacht, Sport and Media, Ireland, Carnegie Corporation, University of Galway Research Repository
Informace o vydavateli: IEEE, 2025.
Rok vydání: 2025
Témata: OCR, Collections as Data, IIIF, Optical Character Recognition, OCR workflow, Digital Collections, Heritage Collections
Popis: The University of Galway Library has been working on an Optical Character Recognition (OCR) pipeline to transform scanned archival materials into machine-readable text at scale. This significantly enhances the accessibility and searchability of the University’s digitised heritage collections, supporting diverse areas of research interest and fostering deeper engagement with the Library’s holdings. The paper discusses key aspects of building an OCR pipeline, including the performance of available OCR software on heritage data, pre-processing of digitised images, quality assurance, and converting the OCR engine outputs for DAMS upload and seamless IIIF integration. The pipeline aims at balancing automation with quality control for successful extraction of printed, typewritten and handwritten text. We believe that our experience may help other GLAM institutions that are considering incorporating automatic text extraction into their digital collections workflow.
Druh dokumentu: Conference object
Popis souboru: application/pdf
Jazyk: English
Přístupová URL adresa: https://hdl.handle.net/10379/19264
Rights: CC BY NC ND
Přístupové číslo: edsair.od......1513..b9b4b0aa475f3749048b58b51a73c881
Databáze: OpenAIRE
FullText Text:
  Availability: 0
CustomLinks:
  – Url: https://explore.openaire.eu/search/publication?articleId=od______1513%3A%3Ab9b4b0aa475f3749048b58b51a73c881
    Name: EDS - OpenAIRE (s4221598)
    Category: fullText
    Text: View record at OpenAIRE
  – Url: https://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=EBSCO&SrcAuth=EBSCO&DestApp=WOS&ServiceName=TransferToWoS&DestLinkType=GeneralSearchSummary&Func=Links&author=Dereza%20O
    Name: ISI
    Category: fullText
    Text: Nájsť tento článok vo Web of Science
    Icon: https://imagesrvr.epnet.com/ls/20docs.gif
    MouseOverText: Nájsť tento článok vo Web of Science
Header DbId: edsair
DbLabel: OpenAIRE
An: edsair.od......1513..b9b4b0aa475f3749048b58b51a73c881
RelevancyScore: 998
AccessLevel: 3
PubType: Conference
PubTypeId: conference
PreciseRelevancyScore: 997.509216308594
IllustrationInfo
Items – Name: Title
  Label: Title
  Group: Ti
  Data: Transforming text into discovery: OCR enrichment of digital collections in the University of Galway Library
– Name: Author
  Label: Authors
  Group: Au
  Data: <searchLink fieldCode="AR" term="%22Dereza%2C+Oksana%22">Dereza, Oksana</searchLink><br /><searchLink fieldCode="AR" term="%22Rouget%2C+Marie-Louise%22">Rouget, Marie-Louise</searchLink><br /><searchLink fieldCode="AR" term="%22Egwu%2C+Chidi%22">Egwu, Chidi</searchLink><br /><searchLink fieldCode="AR" term="%22Joy%2C+Cillian%22">Joy, Cillian</searchLink>
– Name: Author
  Label: Contributors
  Group: Au
  Data: Department of Tourism, Culture, Arts, Gaeltacht, Sport and Media, Ireland<br />Carnegie Corporation<br />University of Galway Research Repository
– Name: Publisher
  Label: Publisher Information
  Group: PubInfo
  Data: IEEE, 2025.
– Name: DatePubCY
  Label: Publication Year
  Group: Date
  Data: 2025
– Name: Subject
  Label: Subject Terms
  Group: Su
  Data: <searchLink fieldCode="DE" term="%22OCR%22">OCR</searchLink><br /><searchLink fieldCode="DE" term="%22Collections+as+Data%22">Collections as Data</searchLink><br /><searchLink fieldCode="DE" term="%22IIIF%22">IIIF</searchLink><br /><searchLink fieldCode="DE" term="%22Optical+Character+Recognition%22">Optical Character Recognition</searchLink><br /><searchLink fieldCode="DE" term="%22OCR+workflow%22">OCR workflow</searchLink><br /><searchLink fieldCode="DE" term="%22Digital+Collections%22">Digital Collections</searchLink><br /><searchLink fieldCode="DE" term="%22Heritage+Collections%22">Heritage Collections</searchLink>
– Name: Abstract
  Label: Description
  Group: Ab
  Data: The University of Galway Library has been working on an Optical Character Recognition (OCR) pipeline to transform scanned archival materials into machine-readable text at scale. This significantly enhances the accessibility and searchability of the University’s digitised heritage collections, supporting diverse areas of research interest and fostering deeper engagement with the Library’s holdings. The paper discusses key aspects of building an OCR pipeline, including the performance of available OCR software on heritage data, pre-processing of digitised images, quality assurance, and converting the OCR engine outputs for DAMS upload and seamless IIIF integration. The pipeline aims at balancing automation with quality control for successful extraction of printed, typewritten and handwritten text. We believe that our experience may help other GLAM institutions that are considering incorporating automatic text extraction into their digital collections workflow.
– Name: TypeDocument
  Label: Document Type
  Group: TypDoc
  Data: Conference object
– Name: Format
  Label: File Description
  Group: SrcInfo
  Data: application/pdf
– Name: Language
  Label: Language
  Group: Lang
  Data: English
– Name: URL
  Label: Access URL
  Group: URL
  Data: <link linkTarget="URL" linkTerm="https://hdl.handle.net/10379/19264" linkWindow="_blank">https://hdl.handle.net/10379/19264</link>
– Name: Copyright
  Label: Rights
  Group: Cpyrght
  Data: CC BY NC ND
– Name: AN
  Label: Accession Number
  Group: ID
  Data: edsair.od......1513..b9b4b0aa475f3749048b58b51a73c881
PLink https://erproxy.cvtisr.sk/sfx/access?url=https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsair&AN=edsair.od......1513..b9b4b0aa475f3749048b58b51a73c881
RecordInfo BibRecord:
  BibEntity:
    Languages:
      – Text: English
    Subjects:
      – SubjectFull: OCR
        Type: general
      – SubjectFull: Collections as Data
        Type: general
      – SubjectFull: IIIF
        Type: general
      – SubjectFull: Optical Character Recognition
        Type: general
      – SubjectFull: OCR workflow
        Type: general
      – SubjectFull: Digital Collections
        Type: general
      – SubjectFull: Heritage Collections
        Type: general
    Titles:
      – TitleFull: Transforming text into discovery: OCR enrichment of digital collections in the University of Galway Library
        Type: main
  BibRelationships:
    HasContributorRelationships:
      – PersonEntity:
          Name:
            NameFull: Dereza, Oksana
      – PersonEntity:
          Name:
            NameFull: Rouget, Marie-Louise
      – PersonEntity:
          Name:
            NameFull: Egwu, Chidi
      – PersonEntity:
          Name:
            NameFull: Joy, Cillian
      – PersonEntity:
          Name:
            NameFull: Department of Tourism, Culture, Arts, Gaeltacht, Sport and Media, Ireland
      – PersonEntity:
          Name:
            NameFull: Carnegie Corporation
      – PersonEntity:
          Name:
            NameFull: University of Galway Research Repository
    IsPartOfRelationships:
      – BibEntity:
          Dates:
            – D: 09
              M: 06
              Type: published
              Y: 2025
          Identifiers:
            – Type: issn-locals
              Value: edsair
            – Type: issn-locals
              Value: edsairFT
ResultId 1