An End-to-End Approach for Extracting and Segmenting High-Variance References from PDF Documents
This paper addresses the problem of extracting and segmenting references from PDF documents. The novelty of the presented approach lies in its capability to discover highly varying references mainly in terms of content, length and location in the document. Unlike existing works, the proposed method...
Saved in:
| Published in: | 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL) pp. 186 - 195 |
|---|---|
| Main Authors: | , , |
| Format: | Conference Proceeding |
| Language: | English |
| Published: |
IEEE
01.06.2019
|
| Subjects: | |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Abstract | This paper addresses the problem of extracting and segmenting references from PDF documents. The novelty of the presented approach lies in its capability to discover highly varying references mainly in terms of content, length and location in the document. Unlike existing works, the proposed method does not follow the classical pipeline that consists of sequential phases. It rather learns the different characteristics of references to be used in a coherent scheme that reduces the error accumulation by following a probabilistic approach. Contrary to conventional references, mentioning the sources of information in some publications, such as those of social science, is not subject to the same specifications such as being located in a unique reference section. Therefore, the proposed method aims to extract references of highly varying reference characteristics by relaxing the restrictions of existing methods. Additionally, we present in this paper a new challenging dataset of annotated references in German social science publications. The main purpose of this work is to serve the indexation of missing references by extracting them from challenging publications such as those of German social science. The effectiveness of the presented methods in terms of both extraction and segmentation is evaluated on different datasets, including the German social science set. |
|---|---|
| AbstractList | This paper addresses the problem of extracting and segmenting references from PDF documents. The novelty of the presented approach lies in its capability to discover highly varying references mainly in terms of content, length and location in the document. Unlike existing works, the proposed method does not follow the classical pipeline that consists of sequential phases. It rather learns the different characteristics of references to be used in a coherent scheme that reduces the error accumulation by following a probabilistic approach. Contrary to conventional references, mentioning the sources of information in some publications, such as those of social science, is not subject to the same specifications such as being located in a unique reference section. Therefore, the proposed method aims to extract references of highly varying reference characteristics by relaxing the restrictions of existing methods. Additionally, we present in this paper a new challenging dataset of annotated references in German social science publications. The main purpose of this work is to serve the indexation of missing references by extracting them from challenging publications such as those of German social science. The effectiveness of the presented methods in terms of both extraction and segmentation is evaluated on different datasets, including the German social science set. |
| Author | Ambhore, Shriharsh Staab, Steffen Boukhers, Zeyd |
| Author_xml | – sequence: 1 givenname: Zeyd surname: Boukhers fullname: Boukhers, Zeyd organization: University of Koblenz-Landau – sequence: 2 givenname: Shriharsh surname: Ambhore fullname: Ambhore, Shriharsh organization: University of Koblenz-Landau – sequence: 3 givenname: Steffen surname: Staab fullname: Staab, Steffen organization: University of Koblenz-Landau |
| BookMark | eNotjMlOwzAURY0EErR0zYKNfyDFs-NllHQARQIxbYvrvNdGok7kBAn-njKszr06V3dCTmMXgZArzuacM3dzV1b1XDDu5owxqU_IhFuRc66VtedkNgztlilmlRTaXJC3ItJFbLKxy46gRd-nzoc9xS7RxeeYfBjbuKP-6J5gd4D4W9ftbp-9-tT6GIA-AkKCYxoopu5AH6olrbrw8bMeLskZ-vcBZv-ckpfl4rlcZ_X96rYs6swLZcdMNs45jhpzZI2BrTBgjZfaMmPyrUSng_IORQDLG-OCR1ToLcuDcai4kFNy_ffbAsCmT-3Bp69Nbh0XQstv15xT2Q |
| CODEN | IEEPAD |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IL CBEJK RIE RIL |
| DOI | 10.1109/JCDL.2019.00035 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE/IET Electronic Library IEEE Proceedings Order Plans (POP All) 1998-Present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| EISBN | 1728115477 9781728115474 |
| EndPage | 195 |
| ExternalDocumentID | 8791225 |
| Genre | orig-research |
| GroupedDBID | 6IE 6IL ACM ALMA_UNASSIGNED_HOLDINGS APO CBEJK GUFHI LHSKQ RIE RIL |
| ID | FETCH-LOGICAL-a247t-3d9991f5f8f0d6eb26e76a3570668b3f95c4a9f2ce71d69caff4fa708c69f4123 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 5 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000555928200028&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| IngestDate | Wed Aug 27 02:54:29 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | false |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-a247t-3d9991f5f8f0d6eb26e76a3570668b3f95c4a9f2ce71d69caff4fa708c69f4123 |
| PageCount | 10 |
| ParticipantIDs | ieee_primary_8791225 |
| PublicationCentury | 2000 |
| PublicationDate | 2019-Jun |
| PublicationDateYYYYMMDD | 2019-06-01 |
| PublicationDate_xml | – month: 06 year: 2019 text: 2019-Jun |
| PublicationDecade | 2010 |
| PublicationTitle | 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL) |
| PublicationTitleAbbrev | JCDL |
| PublicationYear | 2019 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| SSID | ssib040743256 |
| Score | 1.787703 |
| Snippet | This paper addresses the problem of extracting and segmenting references from PDF documents. The novelty of the presented approach lies in its capability to... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 186 |
| SubjectTerms | Conditional Random Fields Random Forest Reference Extraction Reference Segmentation |
| Title | An End-to-End Approach for Extracting and Segmenting High-Variance References from PDF Documents |
| URI | https://ieeexplore.ieee.org/document/8791225 |
| WOSCitedRecordID | wos000555928200028&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07T8MwELbaioEJUIt4ywMjpkljx_FY9SGEUFWJh7oVPyuWFLUp4udz54bCwMIUJ5GS6HzR9zn5vjtCrrUDWsBFwkRaGMatwVdKZTByyFchB4SLzSbkZFLMZmraIDc7L4z3PorP_C0O4798t7Qb_FTWLaRKIf-apCml3Hq1vnOHIxQCfNfVe9JEde8HwwfUbmFBygTbuf1qnxLRY3zwv_seks6PDY9OdwBzRBq-bJPXfklHpWPVksGG9uua4BTIJx19VtH0VC6ohnOPfhHFQLCLcg72AutinGS6qy67pmgvodPhmA7rx1h3yPN49DS4Y3WbBKZ7XFYsc0jygghFSFwOK-Xcy1xnQgKbKEwWlLBcq9CzXqYuV1aHwIOWSWFzFTgg1zFplcvSnxBqvNTCA4dTzvE0MRBXowxcmkvBTe5PSRujM3_fVsKY14E5-_vwOdnH8G-FVRekVa02_pLs2Y_qbb26itP3BU09nB8 |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV09T8MwED2VggQToBbxjQdGTJPGjuOx6ocKlKoSBXUrdmxXLClqU8TP55yGwsDCFCeWkujurPec3LsDuFYGaQHjAeVhoilLtV9SMsKR8XwVY4CbotmEGA6TyUSOKnCz0cJYa4vkM3vrh8W_fDNPV_5TWSMRMsT424JtzlgzXKu1vqOHeTBEAC_r94SBbNy3OwOfveVLUga-oduvBioFfvT2__fkA6j_CPHIaAMxh1CxWQ1eWxnpZobmc4oH0iqrghOkn6T7mReyp2xGFM492VmRDoSnPqGDvuDO2LuZbOrLLokXmJBRp0c65Wss6_Dc647bfVo2SqCqyUROI-NpnuMucYGJca8cWxGriAvkE4mOnOQpU9I1UytCE8tUOcecEkGSxtIxxK4jqGbzzB4D0VYobpHFSWNYGGi0q5Yab80EZzq2J1Dz1pm-r2thTEvDnP59-Qp2--PHwXRwN3w4gz3vinWa1TlU88XKXsBO-pG_LReXhSu_AJn-n2Y |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2019+ACM%2FIEEE+Joint+Conference+on+Digital+Libraries+%28JCDL%29&rft.atitle=An+End-to-End+Approach+for+Extracting+and+Segmenting+High-Variance+References+from+PDF+Documents&rft.au=Boukhers%2C+Zeyd&rft.au=Ambhore%2C+Shriharsh&rft.au=Staab%2C+Steffen&rft.date=2019-06-01&rft.pub=IEEE&rft.spage=186&rft.epage=195&rft_id=info:doi/10.1109%2FJCDL.2019.00035&rft.externalDocID=8791225 |