An End-to-End Approach for Extracting and Segmenting High-Variance References from PDF Documents

This paper addresses the problem of extracting and segmenting references from PDF documents. The novelty of the presented approach lies in its capability to discover highly varying references mainly in terms of content, length and location in the document. Unlike existing works, the proposed method...

Full description

Saved in:

Bibliographic Details
Published in:	2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL) pp. 186 - 195
Main Authors:	Boukhers, Zeyd, Ambhore, Shriharsh, Staab, Steffen
Format:	Conference Proceeding
Language:	English
Published:	IEEE 01.06.2019
Subjects:	Conditional Random Fields Random Forest Reference Extraction Reference Segmentation
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Abstract	This paper addresses the problem of extracting and segmenting references from PDF documents. The novelty of the presented approach lies in its capability to discover highly varying references mainly in terms of content, length and location in the document. Unlike existing works, the proposed method does not follow the classical pipeline that consists of sequential phases. It rather learns the different characteristics of references to be used in a coherent scheme that reduces the error accumulation by following a probabilistic approach. Contrary to conventional references, mentioning the sources of information in some publications, such as those of social science, is not subject to the same specifications such as being located in a unique reference section. Therefore, the proposed method aims to extract references of highly varying reference characteristics by relaxing the restrictions of existing methods. Additionally, we present in this paper a new challenging dataset of annotated references in German social science publications. The main purpose of this work is to serve the indexation of missing references by extracting them from challenging publications such as those of German social science. The effectiveness of the presented methods in terms of both extraction and segmentation is evaluated on different datasets, including the German social science set.
AbstractList	This paper addresses the problem of extracting and segmenting references from PDF documents. The novelty of the presented approach lies in its capability to discover highly varying references mainly in terms of content, length and location in the document. Unlike existing works, the proposed method does not follow the classical pipeline that consists of sequential phases. It rather learns the different characteristics of references to be used in a coherent scheme that reduces the error accumulation by following a probabilistic approach. Contrary to conventional references, mentioning the sources of information in some publications, such as those of social science, is not subject to the same specifications such as being located in a unique reference section. Therefore, the proposed method aims to extract references of highly varying reference characteristics by relaxing the restrictions of existing methods. Additionally, we present in this paper a new challenging dataset of annotated references in German social science publications. The main purpose of this work is to serve the indexation of missing references by extracting them from challenging publications such as those of German social science. The effectiveness of the presented methods in terms of both extraction and segmentation is evaluated on different datasets, including the German social science set.
Author	Ambhore, Shriharsh Staab, Steffen Boukhers, Zeyd
Author_xml	– sequence: 1 givenname: Zeyd surname: Boukhers fullname: Boukhers, Zeyd organization: University of Koblenz-Landau – sequence: 2 givenname: Shriharsh surname: Ambhore fullname: Ambhore, Shriharsh organization: University of Koblenz-Landau – sequence: 3 givenname: Steffen surname: Staab fullname: Staab, Steffen organization: University of Koblenz-Landau
BookMark	eNotjMlOwzAURY0EErR0zYKNfyDFs-NllHQARQIxbYvrvNdGok7kBAn-njKszr06V3dCTmMXgZArzuacM3dzV1b1XDDu5owxqU_IhFuRc66VtedkNgztlilmlRTaXJC3ItJFbLKxy46gRd-nzoc9xS7RxeeYfBjbuKP-6J5gd4D4W9ftbp-9-tT6GIA-AkKCYxoopu5AH6olrbrw8bMeLskZ-vcBZv-ckpfl4rlcZ_X96rYs6swLZcdMNs45jhpzZI2BrTBgjZfaMmPyrUSng_IORQDLG-OCR1ToLcuDcai4kFNy_ffbAsCmT-3Bp69Nbh0XQstv15xT2Q
CODEN	IEEPAD
ContentType	Conference Proceeding
DBID	6IE 6IL CBEJK RIE RIL
DOI	10.1109/JCDL.2019.00035
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE/IET Electronic Library IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
EISBN	1728115477 9781728115474
EndPage	195
ExternalDocumentID	8791225
Genre	orig-research
GroupedDBID	6IE 6IL ACM ALMA_UNASSIGNED_HOLDINGS APO CBEJK GUFHI LHSKQ RIE RIL
ID	FETCH-LOGICAL-a247t-3d9991f5f8f0d6eb26e76a3570668b3f95c4a9f2ce71d69caff4fa708c69f4123
IEDL.DBID	RIE
ISICitedReferencesCount	5
ISICitedReferencesURI	http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000555928200028&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate	Wed Aug 27 02:54:29 EDT 2025
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-a247t-3d9991f5f8f0d6eb26e76a3570668b3f95c4a9f2ce71d69caff4fa708c69f4123
PageCount	10
ParticipantIDs	ieee_primary_8791225
PublicationCentury	2000
PublicationDate	2019-Jun
PublicationDateYYYYMMDD	2019-06-01
PublicationDate_xml	– month: 06 year: 2019 text: 2019-Jun
PublicationDecade	2010
PublicationTitle	2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL)
PublicationTitleAbbrev	JCDL
PublicationYear	2019
Publisher	IEEE
Publisher_xml	– name: IEEE
SSID	ssib040743256
Score	1.787703
Snippet	This paper addresses the problem of extracting and segmenting references from PDF documents. The novelty of the presented approach lies in its capability to...
SourceID	ieee
SourceType	Publisher
StartPage	186
SubjectTerms	Conditional Random Fields Random Forest Reference Extraction Reference Segmentation
Title	An End-to-End Approach for Extracting and Segmenting High-Variance References from PDF Documents
URI	https://ieeexplore.ieee.org/document/8791225
WOSCitedRecordID	wos000555928200028&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07T8MwELbaioEJUIt4ywMjpkljx_FY9SGEUFWJh7oVPyuWFLUp4udz54bCwMIUJ5GS6HzR9zn5vjtCrrUDWsBFwkRaGMatwVdKZTByyFchB4SLzSbkZFLMZmraIDc7L4z3PorP_C0O4798t7Qb_FTWLaRKIf-apCml3Hq1vnOHIxQCfNfVe9JEde8HwwfUbmFBygTbuf1qnxLRY3zwv_seks6PDY9OdwBzRBq-bJPXfklHpWPVksGG9uua4BTIJx19VtH0VC6ohnOPfhHFQLCLcg72AutinGS6qy67pmgvodPhmA7rx1h3yPN49DS4Y3WbBKZ7XFYsc0jygghFSFwOK-Xcy1xnQgKbKEwWlLBcq9CzXqYuV1aHwIOWSWFzFTgg1zFplcvSnxBqvNTCA4dTzvE0MRBXowxcmkvBTe5PSRujM3_fVsKY14E5-_vwOdnH8G-FVRekVa02_pLs2Y_qbb26itP3BU09nB8
linkProvider	IEEE
linkToHtml	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV09T8MwED2VggQToBbxjQdGTJPGjuOx6ocKlKoSBXUrdmxXLClqU8TP55yGwsDCFCeWkujurPec3LsDuFYGaQHjAeVhoilLtV9SMsKR8XwVY4CbotmEGA6TyUSOKnCz0cJYa4vkM3vrh8W_fDNPV_5TWSMRMsT424JtzlgzXKu1vqOHeTBEAC_r94SBbNy3OwOfveVLUga-oduvBioFfvT2__fkA6j_CPHIaAMxh1CxWQ1eWxnpZobmc4oH0iqrghOkn6T7mReyp2xGFM492VmRDoSnPqGDvuDO2LuZbOrLLokXmJBRp0c65Wss6_Dc647bfVo2SqCqyUROI-NpnuMucYGJca8cWxGriAvkE4mOnOQpU9I1UytCE8tUOcecEkGSxtIxxK4jqGbzzB4D0VYobpHFSWNYGGi0q5Yab80EZzq2J1Dz1pm-r2thTEvDnP59-Qp2--PHwXRwN3w4gz3vinWa1TlU88XKXsBO-pG_LReXhSu_AJn-n2Y
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2019+ACM%2FIEEE+Joint+Conference+on+Digital+Libraries+%28JCDL%29&rft.atitle=An+End-to-End+Approach+for+Extracting+and+Segmenting+High-Variance+References+from+PDF+Documents&rft.au=Boukhers%2C+Zeyd&rft.au=Ambhore%2C+Shriharsh&rft.au=Staab%2C+Steffen&rft.date=2019-06-01&rft.pub=IEEE&rft.spage=186&rft.epage=195&rft_id=info:doi/10.1109%2FJCDL.2019.00035&rft.externalDocID=8791225