An End-to-End Approach for Extracting and Segmenting High-Variance References from PDF Documents

This paper addresses the problem of extracting and segmenting references from PDF documents. The novelty of the presented approach lies in its capability to discover highly varying references mainly in terms of content, length and location in the document. Unlike existing works, the proposed method...

Full description

Saved in:
Bibliographic Details
Published in:2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL) pp. 186 - 195
Main Authors: Boukhers, Zeyd, Ambhore, Shriharsh, Staab, Steffen
Format: Conference Proceeding
Language:English
Published: IEEE 01.06.2019
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract This paper addresses the problem of extracting and segmenting references from PDF documents. The novelty of the presented approach lies in its capability to discover highly varying references mainly in terms of content, length and location in the document. Unlike existing works, the proposed method does not follow the classical pipeline that consists of sequential phases. It rather learns the different characteristics of references to be used in a coherent scheme that reduces the error accumulation by following a probabilistic approach. Contrary to conventional references, mentioning the sources of information in some publications, such as those of social science, is not subject to the same specifications such as being located in a unique reference section. Therefore, the proposed method aims to extract references of highly varying reference characteristics by relaxing the restrictions of existing methods. Additionally, we present in this paper a new challenging dataset of annotated references in German social science publications. The main purpose of this work is to serve the indexation of missing references by extracting them from challenging publications such as those of German social science. The effectiveness of the presented methods in terms of both extraction and segmentation is evaluated on different datasets, including the German social science set.
AbstractList This paper addresses the problem of extracting and segmenting references from PDF documents. The novelty of the presented approach lies in its capability to discover highly varying references mainly in terms of content, length and location in the document. Unlike existing works, the proposed method does not follow the classical pipeline that consists of sequential phases. It rather learns the different characteristics of references to be used in a coherent scheme that reduces the error accumulation by following a probabilistic approach. Contrary to conventional references, mentioning the sources of information in some publications, such as those of social science, is not subject to the same specifications such as being located in a unique reference section. Therefore, the proposed method aims to extract references of highly varying reference characteristics by relaxing the restrictions of existing methods. Additionally, we present in this paper a new challenging dataset of annotated references in German social science publications. The main purpose of this work is to serve the indexation of missing references by extracting them from challenging publications such as those of German social science. The effectiveness of the presented methods in terms of both extraction and segmentation is evaluated on different datasets, including the German social science set.
Author Ambhore, Shriharsh
Staab, Steffen
Boukhers, Zeyd
Author_xml – sequence: 1
  givenname: Zeyd
  surname: Boukhers
  fullname: Boukhers, Zeyd
  organization: University of Koblenz-Landau
– sequence: 2
  givenname: Shriharsh
  surname: Ambhore
  fullname: Ambhore, Shriharsh
  organization: University of Koblenz-Landau
– sequence: 3
  givenname: Steffen
  surname: Staab
  fullname: Staab, Steffen
  organization: University of Koblenz-Landau
BookMark eNotjMlOwzAURY0EErR0zYKNfyDFs-NllHQARQIxbYvrvNdGok7kBAn-njKszr06V3dCTmMXgZArzuacM3dzV1b1XDDu5owxqU_IhFuRc66VtedkNgztlilmlRTaXJC3ItJFbLKxy46gRd-nzoc9xS7RxeeYfBjbuKP-6J5gd4D4W9ftbp-9-tT6GIA-AkKCYxoopu5AH6olrbrw8bMeLskZ-vcBZv-ckpfl4rlcZ_X96rYs6swLZcdMNs45jhpzZI2BrTBgjZfaMmPyrUSng_IORQDLG-OCR1ToLcuDcai4kFNy_ffbAsCmT-3Bp69Nbh0XQstv15xT2Q
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/JCDL.2019.00035
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE/IET Electronic Library
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 1728115477
9781728115474
EndPage 195
ExternalDocumentID 8791225
Genre orig-research
GroupedDBID 6IE
6IL
ACM
ALMA_UNASSIGNED_HOLDINGS
APO
CBEJK
GUFHI
LHSKQ
RIE
RIL
ID FETCH-LOGICAL-a247t-3d9991f5f8f0d6eb26e76a3570668b3f95c4a9f2ce71d69caff4fa708c69f4123
IEDL.DBID RIE
ISICitedReferencesCount 5
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000555928200028&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Aug 27 02:54:29 EDT 2025
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a247t-3d9991f5f8f0d6eb26e76a3570668b3f95c4a9f2ce71d69caff4fa708c69f4123
PageCount 10
ParticipantIDs ieee_primary_8791225
PublicationCentury 2000
PublicationDate 2019-Jun
PublicationDateYYYYMMDD 2019-06-01
PublicationDate_xml – month: 06
  year: 2019
  text: 2019-Jun
PublicationDecade 2010
PublicationTitle 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL)
PublicationTitleAbbrev JCDL
PublicationYear 2019
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssib040743256
Score 1.787703
Snippet This paper addresses the problem of extracting and segmenting references from PDF documents. The novelty of the presented approach lies in its capability to...
SourceID ieee
SourceType Publisher
StartPage 186
SubjectTerms Conditional Random Fields
Random Forest
Reference Extraction
Reference Segmentation
Title An End-to-End Approach for Extracting and Segmenting High-Variance References from PDF Documents
URI https://ieeexplore.ieee.org/document/8791225
WOSCitedRecordID wos000555928200028&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07T8MwELbaioEJUIt4ywMjpkljx_FY9SGEUFWJh7oVPyuWFLUp4udz54bCwMIUJ5GS6HzR9zn5vjtCrrUDWsBFwkRaGMatwVdKZTByyFchB4SLzSbkZFLMZmraIDc7L4z3PorP_C0O4798t7Qb_FTWLaRKIf-apCml3Hq1vnOHIxQCfNfVe9JEde8HwwfUbmFBygTbuf1qnxLRY3zwv_seks6PDY9OdwBzRBq-bJPXfklHpWPVksGG9uua4BTIJx19VtH0VC6ohnOPfhHFQLCLcg72AutinGS6qy67pmgvodPhmA7rx1h3yPN49DS4Y3WbBKZ7XFYsc0jygghFSFwOK-Xcy1xnQgKbKEwWlLBcq9CzXqYuV1aHwIOWSWFzFTgg1zFplcvSnxBqvNTCA4dTzvE0MRBXowxcmkvBTe5PSRujM3_fVsKY14E5-_vwOdnH8G-FVRekVa02_pLs2Y_qbb26itP3BU09nB8
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV09T8MwED2VggQToBbxjQdGTJPGjuOx6ocKlKoSBXUrdmxXLClqU8TP55yGwsDCFCeWkujurPec3LsDuFYGaQHjAeVhoilLtV9SMsKR8XwVY4CbotmEGA6TyUSOKnCz0cJYa4vkM3vrh8W_fDNPV_5TWSMRMsT424JtzlgzXKu1vqOHeTBEAC_r94SBbNy3OwOfveVLUga-oduvBioFfvT2__fkA6j_CPHIaAMxh1CxWQ1eWxnpZobmc4oH0iqrghOkn6T7mReyp2xGFM492VmRDoSnPqGDvuDO2LuZbOrLLokXmJBRp0c65Wss6_Dc647bfVo2SqCqyUROI-NpnuMucYGJca8cWxGriAvkE4mOnOQpU9I1UytCE8tUOcecEkGSxtIxxK4jqGbzzB4D0VYobpHFSWNYGGi0q5Yab80EZzq2J1Dz1pm-r2thTEvDnP59-Qp2--PHwXRwN3w4gz3vinWa1TlU88XKXsBO-pG_LReXhSu_AJn-n2Y
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2019+ACM%2FIEEE+Joint+Conference+on+Digital+Libraries+%28JCDL%29&rft.atitle=An+End-to-End+Approach+for+Extracting+and+Segmenting+High-Variance+References+from+PDF+Documents&rft.au=Boukhers%2C+Zeyd&rft.au=Ambhore%2C+Shriharsh&rft.au=Staab%2C+Steffen&rft.date=2019-06-01&rft.pub=IEEE&rft.spage=186&rft.epage=195&rft_id=info:doi/10.1109%2FJCDL.2019.00035&rft.externalDocID=8791225