Combining Morphological and Histogram based Text Line Segmentation in the OCR Context

Text line segmentation is one of the pre-stages of modern optical character recognition systems. The algorithmic approach proposed by this paper has been designed for this exact purpose. Its main characteristic is the combination of two different techniques, morphological image operations and horizo...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Journal of data mining and digital humanities Ročník 2021; číslo HistoInformatics
Hlavní autor: Schneider, Pit
Médium: Journal Article
Jazyk:angličtina
Vydáno: Nicolas Turenne 04.11.2021
Témata:
ISSN:2416-5999, 2416-5999
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Abstract Text line segmentation is one of the pre-stages of modern optical character recognition systems. The algorithmic approach proposed by this paper has been designed for this exact purpose. Its main characteristic is the combination of two different techniques, morphological image operations and horizontal histogram projections. The method was developed to be applied on a historic data collection that commonly features quality issues, such as degraded paper, blurred text, or presence of noise. For that reason, the segmenter in question could be of particular interest for cultural institutions, that want access to robust line bounding boxes for a given historic document. Because of the promising segmentation results that are joined by low computational cost, the algorithm was incorporated into the OCR pipeline of the National Library of Luxembourg, in the context of the initiative of reprocessing their historic newspaper collection. The general contribution of this paper is to outline the approach and to evaluate the gains in terms of accuracy and speed, comparing it to the segmentation algorithm bundled with the used open source OCR software.
AbstractList Text line segmentation is one of the pre-stages of modern optical character recognition systems. The algorithmic approach proposed by this paper has been designed for this exact purpose. Its main characteristic is the combination of two different techniques, morphological image operations and horizontal histogram projections. The method was developed to be applied on a historic data collection that commonly features quality issues, such as degraded paper, blurred text, or presence of noise. For that reason, the segmenter in question could be of particular interest for cultural institutions, that want access to robust line bounding boxes for a given historic document. Because of the promising segmentation results that are joined by low computational cost, the algorithm was incorporated into the OCR pipeline of the National Library of Luxembourg, in the context of the initiative of reprocessing their historic newspaper collection. The general contribution of this paper is to outline the approach and to evaluate the gains in terms of accuracy and speed, comparing it to the segmentation algorithm bundled with the used open source OCR software.
Text line segmentation is one of the pre-stages of modern optical character recognition systems. The algorithmic approach proposed by this paper has been designed for this exact purpose. Its main characteristic is the combination of two different techniques, morphological image operations and horizontal histogram projections. The method was developed to be applied on a historic data collection that commonly features quality issues, such as degraded paper, blurred text, or presence of noise. For that reason, the segmenter in question could be of particular interest for cultural institutions, that want access to robust line bounding boxes for a given historic document. Because of the promising segmentation results that are joined by low computational cost, the algorithm was incorporated into the OCR pipeline of the National Library of Luxembourg, in the context of the initiative of reprocessing their historic newspaper collection. The general contribution of this paper is to outline the approach and to evaluate the gains in terms of accuracy and speed, comparing it to the segmentation algorithm bundled with the used open source OCR software.
Author Schneider, Pit
Author_xml – sequence: 1
  givenname: Pit
  orcidid: 0000-0001-9034-1551
  surname: Schneider
  fullname: Schneider, Pit
BookMark eNpNkEFLwzAYhoNMcM6d_AO5S2eSNk1ylKJuMBnodg5fk7TLaJOR9qD_3rmJePre7-XlOTy3aBJicAjdU7IoSqbk48H2dr8QTIgrNGUFLTOulJr8yzdoPgwHQgjlheScT9Guin3tgw8tfovpuI9dbL2BDkOweOmHMbYJelzD4Czeus8Rr31w-MO1vQsjjD4G7AMe9w5vqndcxTCeRnfouoFucPPfO0O7l-dttczWm9dV9bTODM2VyAyXtJbEWgJlA0Io2ZBGCWYUOGklE3lRcDC1PH25pcQ2dQOm5FwxS0pK8xlaXbg2wkEfk-8hfekIXp-LmFoNafSmcxqoAO4sBZmTgokSlBGuBCFJXivO5In1cGGZFIchueaPR4k-C9ZnwfpHcP4NmXVwLg
ContentType Journal Article
DBID AAYXX
CITATION
DOA
DOI 10.46298/jdmdh.7277
DatabaseName CrossRef
DOAJ Directory of Open Access Journals
DatabaseTitle CrossRef
DatabaseTitleList
CrossRef
Database_xml – sequence: 1
  dbid: DOA
  name: DOAJ Directory of Open Access Journals
  url: https://www.doaj.org/
  sourceTypes: Open Website
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISSN 2416-5999
ExternalDocumentID oai_doaj_org_article_a17a5ed1a8304276a9c7e6a7803b9528
10_46298_jdmdh_7277
GroupedDBID 5VS
AAFWJ
AAYXX
ADBBV
ADQAK
AFPKN
ALMA_UNASSIGNED_HOLDINGS
BCNDV
CITATION
FRP
GROUPED_DOAJ
KQ8
M~E
OK1
ID FETCH-LOGICAL-c1397-c581b80dd0a6fa7798f0f972c9ae8d8273445acb8e8d3d10dfbfac65592d06113
IEDL.DBID DOA
ISSN 2416-5999
IngestDate Fri Oct 03 12:37:16 EDT 2025
Sat Nov 29 04:10:29 EST 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue HistoInformatics
Language English
License https://creativecommons.org/licenses/by/4.0
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c1397-c581b80dd0a6fa7798f0f972c9ae8d8273445acb8e8d3d10dfbfac65592d06113
ORCID 0000-0001-9034-1551
OpenAccessLink https://doaj.org/article/a17a5ed1a8304276a9c7e6a7803b9528
ParticipantIDs doaj_primary_oai_doaj_org_article_a17a5ed1a8304276a9c7e6a7803b9528
crossref_primary_10_46298_jdmdh_7277
PublicationCentury 2000
PublicationDate 2021-11-04
PublicationDateYYYYMMDD 2021-11-04
PublicationDate_xml – month: 11
  year: 2021
  text: 2021-11-04
  day: 04
PublicationDecade 2020
PublicationTitle Journal of data mining and digital humanities
PublicationYear 2021
Publisher Nicolas Turenne
Publisher_xml – name: Nicolas Turenne
SSID ssj0001548555
Score 2.1616614
Snippet Text line segmentation is one of the pre-stages of modern optical character recognition systems. The algorithmic approach proposed by this paper has been...
Text line segmentation is one of the pre-stages of modern optical character recognition systems. The algorithmic approach proposed by this paper has been...
SourceID doaj
crossref
SourceType Open Website
Index Database
SubjectTerms computer science - computer vision and pattern recognition
i.4.6
Title Combining Morphological and Histogram based Text Line Segmentation in the OCR Context
URI https://doaj.org/article/a17a5ed1a8304276a9c7e6a7803b9528
Volume 2021
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVAON
  databaseName: DOAJ Directory of Open Access Journals
  customDbUrl:
  eissn: 2416-5999
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0001548555
  issn: 2416-5999
  databaseCode: DOA
  dateStart: 20140101
  isFulltext: true
  titleUrlDefault: https://www.doaj.org/
  providerName: Directory of Open Access Journals
– providerCode: PRVHPJ
  databaseName: ROAD: Directory of Open Access Scholarly Resources
  customDbUrl:
  eissn: 2416-5999
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0001548555
  issn: 2416-5999
  databaseCode: M~E
  dateStart: 20140101
  isFulltext: true
  titleUrlDefault: https://road.issn.org
  providerName: ISSN International Centre
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1LS8NAEF5EPHjxLdYXe-g1dvPax1FLi5dW0RZ6C7OPaIWmUqt48rc7u0mlnrx4CWQJIXyzO99MmPmGkDaDxKjcQuRYKaKMax4p5mwks8wwQIottQ3DJsRwKCcTdb826svXhNXywDVwHYgF5M7GIH3iLTgoIxwHIVmqVZ6ENl-MetaSqbo_2Iue5HVDXsYTJTsvdmafr5CuxS8KWlPqD5TS3yM7TSxIr-tv2Ccbrjogu6s5C7Q5dodkjEs6DHKggznisvJXFCpLg86HL7GinpAsHaG3pZhhOvronmZNZ1FFpxXFUI_edR9oEKT6XB6Rcb836t5GzTiEyPgwLTI5hpiSWcuAlyCEkiUrlUCswUkrvU5NloPREu9SGzNb6hIMx5QhscjacXpMNqt55U4IBcW1Qj9o8bhmTjhQqXOpTLSRBgFyLdJeIVS81qoXBWYLAcgiAFl4IFvkxqP384iXqg4LaMCiMWDxlwFP_-MlZ2Q78cUm_n9vdk42l4t3d0G2zMdy-ra4DHsDr4Ov3jdRpMCx
linkProvider Directory of Open Access Journals
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Combining+Morphological+and+Histogram+based+Text+Line+Segmentation+in+the+OCR+Context&rft.jtitle=Journal+of+data+mining+and+digital+humanities&rft.au=Schneider%2C+Pit&rft.date=2021-11-04&rft.issn=2416-5999&rft.eissn=2416-5999&rft.volume=2021&rft.issue=HistoInformatics&rft_id=info:doi/10.46298%2Fjdmdh.7277&rft.externalDBID=n%2Fa&rft.externalDocID=10_46298_jdmdh_7277
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2416-5999&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2416-5999&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2416-5999&client=summon