Preparing Big Manuscript Data for Hierarchical Clustering with Minimal HTR Training

HTR (Handwritten Text Recognition) technologies have progressed enough to offer high-accuracy results in recognising handwritten documents, even on a synchronous level. Despite the state-of-the-art algorithms and software, historical documents (especially those written in Greek) remain a real-world...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:Journal of data mining and digital humanities Ročník Historical Documents and...; číslo Sciences of Antiquity and...
Hlavný autor: Perdiki, Elpida
Médium: Journal Article
Jazyk:English
Vydavateľské údaje: INRIA 20.12.2023
Nicolas Turenne
Predmet:
ISSN:2416-5999, 2416-5999
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Abstract HTR (Handwritten Text Recognition) technologies have progressed enough to offer high-accuracy results in recognising handwritten documents, even on a synchronous level. Despite the state-of-the-art algorithms and software, historical documents (especially those written in Greek) remain a real-world challenge for researchers. A large number of unedited or under-edited works of Greek Literature (ancient or Byzantine, especially the latter) exist to this day due to the complexity of producing critical editions. To critically edit a literary text, scholars need to pinpoint text variations on several manuscripts, which requires fully (or at least partially) transcribed manuscripts. For a large manuscript tradition (i.e., a large number of manuscripts transmitting the same work), such a process can be a painstaking and time-consuming project. To that end, HTR algorithms that train AI models can significantly assist, even when not resulting in entirely accurate transcriptions. Deep learning models, though, require a quantum of data to be effective. This, in turn, intensifies the same problem: big (transcribed) data require heavy loads of manual transcriptions as training sets. In the absence of such transcriptions, this study experiments with training sets of various sizes to determine the minimum amount of manual transcription needed to produce usable results. HTR models are trained through the Transkribus platform on manuscripts from multiple works of a single Byzantine author, John Chrysostom. By gradually reducing the number of manually transcribed texts and by training mixed models from multiple manuscripts, economic transcriptions of large bodies of manuscripts (in the hundreds) can be achieved. Results of these experiments show that if the right combination of manuscripts is selected, and with the transfer-learning tools provided by Transkribus, the required training sets can be reduced by up to 80%. Certain peculiarities of Greek manuscripts, which lead to easy automated cleaning of resulting transcriptions, could further improve these results. The ultimate goal of these experiments is to produce a transcription with the minimum required accuracy (and therefore the minimum manual input) for text clustering. If we can accurately assess HTR learning and outcomes, we may find that less data could be enough. This case study proposes a solution for researching/editing authors and works that were popular enough to survive in hundreds (if not thousands) of manuscripts and are, therefore, unfeasible to be evaluated by humans.
AbstractList HTR (Handwritten Text Recognition) technologies have progressed enough to offer high-accuracy results in recognising handwritten documents, even on a synchronous level. Despite the state-of-the-art algorithms and software, historical documents (especially those written in Greek) remain a real-world challenge for researchers. A large number of unedited or under-edited works of Greek Literature (ancient or Byzantine, especially the latter) exist to this day due to the complexity of producing critical editions. To critically edit a literary text, scholars need to pinpoint text variations on several manuscripts, which requires fully (or at least partially) transcribed manuscripts. For a large manuscript tradition (i.e., a large number of manuscripts transmitting the same work), such a process can be a painstaking and time-consuming project. To that end, HTR algorithms that train AI models can significantly assist, even when not resulting in entirely accurate transcriptions. Deep learning models, though, require a quantum of data to be effective. This, in turn, intensifies the same problem: big (transcribed) data require heavy loads of manual transcriptions as training sets. In the absence of such transcriptions, this study experiments with training sets of various sizes to determine the minimum amount of manual transcription needed to produce usable results. HTR models are trained through the Transkribus platform on manuscripts from multiple works of a single Byzantine author, John Chrysostom. By gradually reducing the number of manually transcribed texts and by training mixed models from multiple manuscripts, economic transcriptions of large bodies of manuscripts (in the hundreds) can be achieved. Results of these experiments show that if the right combination of manuscripts is selected, and with the transfer-learning tools provided by Transkribus, the required training sets can be reduced by up to 80%. Certain peculiarities of Greek manuscripts, which lead to easy automated cleaning of resulting transcriptions, could further improve these results. The ultimate goal of these experiments is to produce a transcription with the minimum required accuracy (and therefore the minimum manual input) for text clustering. If we can accurately assess HTR learning and outcomes, we may find that less data could be enough. This case study proposes a solution for researching/editing authors and works that were popular enough to survive in hundreds (if not thousands) of manuscripts and are, therefore, unfeasible to be evaluated by humans.
Author Perdiki, Elpida
Author_xml – sequence: 1
  givenname: Elpida
  orcidid: 0000-0002-0762-1577
  surname: Perdiki
  fullname: Perdiki, Elpida
BackLink https://hal.science/hal-03880102$$DView record in HAL
BookMark eNpVkUtPwzAQhC0EEqX0xg_IFYnC2rHj-FjKo5VagaCcrfUjras0qZwA4t_TpgjBaXdHs59GmjNyXNWVJ-SCwjXPmMpv1m7jVtcUOFVHpMc4zYZCKXX8Zz8lg6ZZAwAVPBdC9Mjrc_RbjKFaJrdhmcyxem9sDNs2ucMWk6KOyST4iNGugsUyGZfvTes7_2doV8k8VGGz0yeLl2QRcXdVy3NyUmDZ-MHP7JO3h_vFeDKcPT1Ox6PZ0FIp1FBJZngujbHcZibNUukKoAWKzCguLAhnU09ZhswJhdYAMOYz7phx3pnCp30yPXBdjWu9jbsg8UvXGHQn1HGpMbbBll6nUnjJvQOa5bwAzCVIMBzTXFrFLd-xLg-sFZb_UJPRTO81SPMcKLCPvffq4LWxbproi98HCrrrQndd6K6L9BvD-H2k
ContentType Journal Article
Copyright Attribution
Copyright_xml – notice: Attribution
DBID AAYXX
CITATION
1XC
BXJBU
IHQJB
VOOES
DOA
DOI 10.46298/jdmdh.10419
DatabaseName CrossRef
Hyper Article en Ligne (HAL)
HAL-SHS: Archive ouverte en Sciences de l'Homme et de la Société
HAL-SHS: Archive ouverte en Sciences de l'Homme et de la Société (Open Access)
Hyper Article en Ligne (HAL) (Open Access)
DOAJ Directory of Open Access Journals
DatabaseTitle CrossRef
DatabaseTitleList
CrossRef

Database_xml – sequence: 1
  dbid: DOA
  name: DOAJ Directory of Open Access Journals
  url: https://www.doaj.org/
  sourceTypes: Open Website
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISSN 2416-5999
ExternalDocumentID oai_doaj_org_article_375e74ed01684f0a87070b4a387c94c4
oai:HAL:hal-03880102v4
10_46298_jdmdh_10419
GroupedDBID 5VS
AAFWJ
AAYXX
ADBBV
ADQAK
AFPKN
ALMA_UNASSIGNED_HOLDINGS
BCNDV
CITATION
FRP
GROUPED_DOAJ
KQ8
M~E
OK1
1XC
BXJBU
IHQJB
VOOES
ID FETCH-LOGICAL-c1759-972b487bbc4c6b3637df01fa56b945c05dc3e126a2d59acb0022e64d2bdedbfe3
IEDL.DBID DOA
ISSN 2416-5999
IngestDate Fri Oct 03 12:52:33 EDT 2025
Tue Oct 14 20:35:56 EDT 2025
Sat Nov 29 04:10:29 EST 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue Sciences of Antiquity and...
Keywords deep learning
Big data
HTR models
Byzantine manuscripts
Transkribus
Language English
License https://creativecommons.org/licenses/by/4.0
Attribution: http://creativecommons.org/licenses/by
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c1759-972b487bbc4c6b3637df01fa56b945c05dc3e126a2d59acb0022e64d2bdedbfe3
ORCID 0000-0002-0762-1577
OpenAccessLink https://doaj.org/article/375e74ed01684f0a87070b4a387c94c4
ParticipantIDs doaj_primary_oai_doaj_org_article_375e74ed01684f0a87070b4a387c94c4
hal_primary_oai_HAL_hal_03880102v4
crossref_primary_10_46298_jdmdh_10419
PublicationCentury 2000
PublicationDate 2023-12-20
PublicationDateYYYYMMDD 2023-12-20
PublicationDate_xml – month: 12
  year: 2023
  text: 2023-12-20
  day: 20
PublicationDecade 2020
PublicationTitle Journal of data mining and digital humanities
PublicationYear 2023
Publisher INRIA
Nicolas Turenne
Publisher_xml – name: INRIA
– name: Nicolas Turenne
SSID ssj0001548555
Score 2.2410524
Snippet HTR (Handwritten Text Recognition) technologies have progressed enough to offer high-accuracy results in recognising handwritten documents, even on a...
SourceID doaj
hal
crossref
SourceType Open Website
Open Access Repository
Index Database
SubjectTerms [info.info-tt]computer science [cs]/document and text processing
[shs.stat]humanities and social sciences/methods and statistics
[shs]humanities and social sciences
big data
byzantine manuscripts
Computer Science
deep learning
Document and Text Processing
htr models
Humanities and Social Sciences
Methods and statistics
transkribus
Title Preparing Big Manuscript Data for Hierarchical Clustering with Minimal HTR Training
URI https://hal.science/hal-03880102
https://doaj.org/article/375e74ed01684f0a87070b4a387c94c4
Volume Historical Documents and...
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVAON
  databaseName: DOAJ Directory of Open Access Journals
  customDbUrl:
  eissn: 2416-5999
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0001548555
  issn: 2416-5999
  databaseCode: DOA
  dateStart: 20140101
  isFulltext: true
  titleUrlDefault: https://www.doaj.org/
  providerName: Directory of Open Access Journals
– providerCode: PRVHPJ
  databaseName: ROAD: Directory of Open Access Scholarly Resources
  customDbUrl:
  eissn: 2416-5999
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0001548555
  issn: 2416-5999
  databaseCode: M~E
  dateStart: 20140101
  isFulltext: true
  titleUrlDefault: https://road.issn.org
  providerName: ISSN International Centre
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV07T8MwELYQYmDhjSgvWQjGiDwcJx5pAXWgVQVFYov8bIPagErLyG_nzklRmVhYMpxOTnRn6-6cu-8j5FJmFiOfC_JUmgBnGQPpIhHEoA31hoISoSabyPr9_OVFDFaovrAnrIYHrg13nWSpzZg1kJrkzIUS9lcWKiaTPNOCaY8EClnPSjFVzwcj6Elad7ozHov8-tVMzRh_aSKozkoM8lD9EFnGy5tUH1nud8hWkxLSm_pTdsmarfbI9pJugTanb588DWYWOQOrEW2XI9qT1aI-8_RWziWF7JN2S5wn9vQmE9qZLBAFAfXxtpX2yqqcgrw7fKTDhhnigDzf3w073aDhRAg0BHoRiCxWUGMopZnmKuFJZlwYOZlyJViqw9ToxEYxl7FJhdQ-RlvOTKyMNcrZ5JCsV2-VPSIUChcXWacixSzLBXoljIzR0nEuIhe3yNXSSsV7DX1RQMngrVl4axbemi3SRhP-6CBgtReAG4vGjcVfbmyRC3DArzW6Nw8FyjxeDeRBn-z4P950QjaRMh5bUuLwlKzPZwt7Rjb057z8mJ37bQTP3tfdN7OvzJA
linkProvider Directory of Open Access Journals
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Preparing+Big+Manuscript+Data+for+Hierarchical+Clustering+with+Minimal+HTR+Training&rft.jtitle=Journal+of+data+mining+and+digital+humanities&rft.au=Elpida+Perdiki&rft.date=2023-12-20&rft.pub=Nicolas+Turenne&rft.eissn=2416-5999&rft.volume=Historical+Documents+and...&rft.issue=Sciences+of+Antiquity+and...&rft_id=info:doi/10.46298%2Fjdmdh.10419&rft.externalDBID=DOA&rft.externalDocID=oai_doaj_org_article_375e74ed01684f0a87070b4a387c94c4
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2416-5999&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2416-5999&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2416-5999&client=summon