Evaluating BERT's Encoding of Intrinsic Semantic Features of OCR'd Digital Library Collections

The uncertainty caused by optical character recognition (OCR) noise has been a primary barrier for digital libraries (DL) to promote their curated datasets for research purposes, particularly when the datasets are fed into advanced language models with less transparency. To shed some light on this i...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL) s. 308 - 309
Hlavní autoři: Jiang, Ming, Hu, Yuerong, Worthey, Glen, Dubnicek, Ryan C, Underwood, Ted, Downie, J Stephen
Médium: Konferenční příspěvek
Jazyk:angličtina
Vydáno: IEEE 01.09.2021
Témata:
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:The uncertainty caused by optical character recognition (OCR) noise has been a primary barrier for digital libraries (DL) to promote their curated datasets for research purposes, particularly when the datasets are fed into advanced language models with less transparency. To shed some light on this issue, this study evaluates the impacts of OCR noise on BERT models for encoding the intrinsic semantic features of OCR'd texts. Specifically, we encoded chapterwise paired OCR'd texts and their cleaned counterparts extracted from books in six domains using BERT pre-trained and fine-tune models respectively. Given the encoded text features, we further calculated the cosine similarity between any two chapters and used normalized discounted cumulative gain (NDCG) [1] to measure BERT variants' capabilities to preserve narrative coherence and semantic relevance among texts. Our empirical results show that (1) BERT embeddings can encode and preserve texts' intrinsic semantic features (i.e., relevance and coherence); and (2) such capabilities are comparatively robust against OCR noise. This should help alleviate some DL users' concerns regarding applying contextualized word embeddings to encode chapter-level or even document-level OCR'd text information, which benefits promoting scholarly use of DL collections. Our research also demonstrates how texts' intrinsic semantic features can be used for evaluating the impacts of OCR noise on advanced language models, which is an underdeveloped and promising direction for future work.
DOI:10.1109/JCDL52503.2021.00045