Combining Morphological and Histogram based Text Line Segmentation in the OCR Context
Text line segmentation is one of the pre-stages of modern optical character recognition systems. The algorithmic approach proposed by this paper has been designed for this exact purpose. Its main characteristic is the combination of two different techniques, morphological image operations and horizo...
Uloženo v:
| Vydáno v: | Journal of data mining and digital humanities Ročník 2021; číslo HistoInformatics |
|---|---|
| Hlavní autor: | |
| Médium: | Journal Article |
| Jazyk: | angličtina |
| Vydáno: |
Nicolas Turenne
04.11.2021
|
| Témata: | |
| ISSN: | 2416-5999, 2416-5999 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Abstract | Text line segmentation is one of the pre-stages of modern optical character
recognition systems. The algorithmic approach proposed by this paper has been
designed for this exact purpose. Its main characteristic is the combination of
two different techniques, morphological image operations and horizontal
histogram projections. The method was developed to be applied on a historic
data collection that commonly features quality issues, such as degraded paper,
blurred text, or presence of noise. For that reason, the segmenter in question
could be of particular interest for cultural institutions, that want access to
robust line bounding boxes for a given historic document. Because of the
promising segmentation results that are joined by low computational cost, the
algorithm was incorporated into the OCR pipeline of the National Library of
Luxembourg, in the context of the initiative of reprocessing their historic
newspaper collection. The general contribution of this paper is to outline the
approach and to evaluate the gains in terms of accuracy and speed, comparing it
to the segmentation algorithm bundled with the used open source OCR software. |
|---|---|
| AbstractList | Text line segmentation is one of the pre-stages of modern optical character recognition systems. The algorithmic approach proposed by this paper has been designed for this exact purpose. Its main characteristic is the combination of two different techniques, morphological image operations and horizontal histogram projections. The method was developed to be applied on a historic data collection that commonly features quality issues, such as degraded paper, blurred text, or presence of noise. For that reason, the segmenter in question could be of particular interest for cultural institutions, that want access to robust line bounding boxes for a given historic document. Because of the promising segmentation results that are joined by low computational cost, the algorithm was incorporated into the OCR pipeline of the National Library of Luxembourg, in the context of the initiative of reprocessing their historic newspaper collection. The general contribution of this paper is to outline the approach and to evaluate the gains in terms of accuracy and speed, comparing it to the segmentation algorithm bundled with the used open source OCR software. Text line segmentation is one of the pre-stages of modern optical character recognition systems. The algorithmic approach proposed by this paper has been designed for this exact purpose. Its main characteristic is the combination of two different techniques, morphological image operations and horizontal histogram projections. The method was developed to be applied on a historic data collection that commonly features quality issues, such as degraded paper, blurred text, or presence of noise. For that reason, the segmenter in question could be of particular interest for cultural institutions, that want access to robust line bounding boxes for a given historic document. Because of the promising segmentation results that are joined by low computational cost, the algorithm was incorporated into the OCR pipeline of the National Library of Luxembourg, in the context of the initiative of reprocessing their historic newspaper collection. The general contribution of this paper is to outline the approach and to evaluate the gains in terms of accuracy and speed, comparing it to the segmentation algorithm bundled with the used open source OCR software. |
| Author | Schneider, Pit |
| Author_xml | – sequence: 1 givenname: Pit orcidid: 0000-0001-9034-1551 surname: Schneider fullname: Schneider, Pit |
| BookMark | eNpNkEFLwzAYhoNMcM6d_AO5S2eSNk1ylKJuMBnodg5fk7TLaJOR9qD_3rmJePre7-XlOTy3aBJicAjdU7IoSqbk48H2dr8QTIgrNGUFLTOulJr8yzdoPgwHQgjlheScT9Guin3tgw8tfovpuI9dbL2BDkOweOmHMbYJelzD4Czeus8Rr31w-MO1vQsjjD4G7AMe9w5vqndcxTCeRnfouoFucPPfO0O7l-dttczWm9dV9bTODM2VyAyXtJbEWgJlA0Io2ZBGCWYUOGklE3lRcDC1PH25pcQ2dQOm5FwxS0pK8xlaXbg2wkEfk-8hfekIXp-LmFoNafSmcxqoAO4sBZmTgokSlBGuBCFJXivO5In1cGGZFIchueaPR4k-C9ZnwfpHcP4NmXVwLg |
| ContentType | Journal Article |
| DBID | AAYXX CITATION DOA |
| DOI | 10.46298/jdmdh.7277 |
| DatabaseName | CrossRef DOAJ Directory of Open Access Journals |
| DatabaseTitle | CrossRef |
| DatabaseTitleList | CrossRef |
| Database_xml | – sequence: 1 dbid: DOA name: DOAJ Directory of Open Access Journals url: https://www.doaj.org/ sourceTypes: Open Website |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Computer Science |
| EISSN | 2416-5999 |
| ExternalDocumentID | oai_doaj_org_article_a17a5ed1a8304276a9c7e6a7803b9528 10_46298_jdmdh_7277 |
| GroupedDBID | 5VS AAFWJ AAYXX ADBBV ADQAK AFPKN ALMA_UNASSIGNED_HOLDINGS BCNDV CITATION FRP GROUPED_DOAJ KQ8 M~E OK1 |
| ID | FETCH-LOGICAL-c1397-c581b80dd0a6fa7798f0f972c9ae8d8273445acb8e8d3d10dfbfac65592d06113 |
| IEDL.DBID | DOA |
| ISSN | 2416-5999 |
| IngestDate | Fri Oct 03 12:37:16 EDT 2025 Sat Nov 29 04:10:29 EST 2025 |
| IsDoiOpenAccess | true |
| IsOpenAccess | true |
| IsPeerReviewed | true |
| IsScholarly | true |
| Issue | HistoInformatics |
| Language | English |
| License | https://creativecommons.org/licenses/by/4.0 |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c1397-c581b80dd0a6fa7798f0f972c9ae8d8273445acb8e8d3d10dfbfac65592d06113 |
| ORCID | 0000-0001-9034-1551 |
| OpenAccessLink | https://doaj.org/article/a17a5ed1a8304276a9c7e6a7803b9528 |
| ParticipantIDs | doaj_primary_oai_doaj_org_article_a17a5ed1a8304276a9c7e6a7803b9528 crossref_primary_10_46298_jdmdh_7277 |
| PublicationCentury | 2000 |
| PublicationDate | 2021-11-04 |
| PublicationDateYYYYMMDD | 2021-11-04 |
| PublicationDate_xml | – month: 11 year: 2021 text: 2021-11-04 day: 04 |
| PublicationDecade | 2020 |
| PublicationTitle | Journal of data mining and digital humanities |
| PublicationYear | 2021 |
| Publisher | Nicolas Turenne |
| Publisher_xml | – name: Nicolas Turenne |
| SSID | ssj0001548555 |
| Score | 2.1616614 |
| Snippet | Text line segmentation is one of the pre-stages of modern optical character
recognition systems. The algorithmic approach proposed by this paper has been... Text line segmentation is one of the pre-stages of modern optical character recognition systems. The algorithmic approach proposed by this paper has been... |
| SourceID | doaj crossref |
| SourceType | Open Website Index Database |
| SubjectTerms | computer science - computer vision and pattern recognition i.4.6 |
| Title | Combining Morphological and Histogram based Text Line Segmentation in the OCR Context |
| URI | https://doaj.org/article/a17a5ed1a8304276a9c7e6a7803b9528 |
| Volume | 2021 |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVAON databaseName: DOAJ Directory of Open Access Journals customDbUrl: eissn: 2416-5999 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0001548555 issn: 2416-5999 databaseCode: DOA dateStart: 20140101 isFulltext: true titleUrlDefault: https://www.doaj.org/ providerName: Directory of Open Access Journals – providerCode: PRVHPJ databaseName: ROAD: Directory of Open Access Scholarly Resources customDbUrl: eissn: 2416-5999 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0001548555 issn: 2416-5999 databaseCode: M~E dateStart: 20140101 isFulltext: true titleUrlDefault: https://road.issn.org providerName: ISSN International Centre |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1LS8NAEF5EPHjxLdYXe-g1dvPax1FLi5dW0RZ6C7OPaIWmUqt48rc7u0mlnrx4CWQJIXyzO99MmPmGkDaDxKjcQuRYKaKMax4p5mwks8wwQIottQ3DJsRwKCcTdb826svXhNXywDVwHYgF5M7GIH3iLTgoIxwHIVmqVZ6ENl-MetaSqbo_2Iue5HVDXsYTJTsvdmafr5CuxS8KWlPqD5TS3yM7TSxIr-tv2Ccbrjogu6s5C7Q5dodkjEs6DHKggznisvJXFCpLg86HL7GinpAsHaG3pZhhOvronmZNZ1FFpxXFUI_edR9oEKT6XB6Rcb836t5GzTiEyPgwLTI5hpiSWcuAlyCEkiUrlUCswUkrvU5NloPREu9SGzNb6hIMx5QhscjacXpMNqt55U4IBcW1Qj9o8bhmTjhQqXOpTLSRBgFyLdJeIVS81qoXBWYLAcgiAFl4IFvkxqP384iXqg4LaMCiMWDxlwFP_-MlZ2Q78cUm_n9vdk42l4t3d0G2zMdy-ra4DHsDr4Ov3jdRpMCx |
| linkProvider | Directory of Open Access Journals |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Combining+Morphological+and+Histogram+based+Text+Line+Segmentation+in+the+OCR+Context&rft.jtitle=Journal+of+data+mining+and+digital+humanities&rft.au=Schneider%2C+Pit&rft.date=2021-11-04&rft.issn=2416-5999&rft.eissn=2416-5999&rft.volume=2021&rft.issue=HistoInformatics&rft_id=info:doi/10.46298%2Fjdmdh.7277&rft.externalDBID=n%2Fa&rft.externalDocID=10_46298_jdmdh_7277 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2416-5999&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2416-5999&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2416-5999&client=summon |