A Method of Web Information Automatic Extraction Based on XML

With the increasingly high-speed of the internet as well as the increase in the amount of data it contains, users are finding it more and more difficult to gain useful information from the web. How to extract accurate information from the Web efficiently has become an urgent problem. Web information...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:Applied Mechanics and Materials Ročník 20-23; s. 178 - 183
Hlavní autori: Zhang, Na, Gu, Jun Hua, Song, Jie, Liu, Yan Liu
Médium: Journal Article
Jazyk:English
Vydavateľské údaje: Zurich Trans Tech Publications Ltd 01.01.2010
Predmet:
ISBN:0878492879, 9780878492879
ISSN:1660-9336, 1662-7482, 1662-7482
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Abstract With the increasingly high-speed of the internet as well as the increase in the amount of data it contains, users are finding it more and more difficult to gain useful information from the web. How to extract accurate information from the Web efficiently has become an urgent problem. Web information extraction technology has emerged to solve this kind of problem. The method of Web information auto-extraction based on XML is designed through standardizing the HTML document using data translation algorism, forming an extracting rule base by learning the XPath expression of samples, and using extraction rule base to realize auto-extraction of pages of same kind. The results show that this approach should lead to a higher recall ratio and precision ratio, and the result should have a self-description, making it convenient for founding data extraction system of each domain.
AbstractList With the increasingly high-speed of the internet as well as the increase in the amount of data it contains, users are finding it more and more difficult to gain useful information from the web. How to extract accurate information from the Web efficiently has become an urgent problem. Web information extraction technology has emerged to solve this kind of problem. The method of Web information auto-extraction based on XML is designed through standardizing the HTML document using data translation algorism, forming an extracting rule base by learning the XPath expression of samples, and using extraction rule base to realize auto-extraction of pages of same kind. The results show that this approach should lead to a higher recall ratio and precision ratio, and the result should have a self-description, making it convenient for founding data extraction system of each domain.
Author Liu, Yan Liu
Zhang, Na
Song, Jie
Gu, Jun Hua
Author_xml – givenname: Na
  surname: Zhang
  fullname: Zhang, Na
  email: zhang_na00@163.com
  organization: Hebei University of Technology : School of Computer Science and Engineering
– givenname: Jun Hua
  surname: Gu
  fullname: Gu, Jun Hua
  email: jhgu@hebut.edu.cn
  organization: Hebei University of Technology : School of Computer Science and Engineering
– givenname: Jie
  surname: Song
  fullname: Song, Jie
  email: songjie@scse.hebut.edu.cn
  organization: Hebei University of Technology : School of Computer Science and Engineering
– givenname: Yan Liu
  surname: Liu
  fullname: Liu, Yan Liu
  email: hbsdliuyanliu@163.com
  organization: Hebei University of Technology : School of Computer Science and Engineering
BookMark eNqNkM1LwzAchoNOcJv-DwUvXtrlq2l6EOzG_IAVL4reQpsmrMMlM8mY_vdmm6BHT0l-v5fnJc8IDIw1CoBrBDMKMZ_sdrvMy16Z0OteZkaFSVXXGYYpJhkq-AkYIsZwWlCOT8EI8oLTEvOiHBwWMC0JYedg5P0KQkYR5UNwUyW1CkvbJVYnr6pNHo22bt2E3pqk2ga7v8pk_hlcIw_DaeNVTJvkrV5cgDPdvHt1-XOOwcvd_Hn2kC6e7h9n1SKVuCxDSjqYK00pJS1lGuO8ZbQlmDeYdpRBBFvJCkS0QpJ3qlWQ5F1D45syUuSakTG4OnI3zn5slQ9iZbfOxEqBIpXDIs9JTN0eU9JZ753SYuP6deO-BIJib1BEg-LXoIgGRTQoMBSYiGgwIqZHRPyu8UHJ5Z-m_0K-AY8VgRY
Cites_doi 10.1109/icdsc.2001.918966
ContentType Journal Article
Copyright 2010 Trans Tech Publications Ltd
Copyright Trans Tech Publications Ltd. Jan 2010
Copyright_xml – notice: 2010 Trans Tech Publications Ltd
– notice: Copyright Trans Tech Publications Ltd. Jan 2010
DBID AAYXX
CITATION
7SR
7TB
8BQ
8FD
8FE
8FG
ABJCF
ABUWG
AFKRA
BENPR
BFMQW
BGLVJ
CCPQU
D1I
DWQXO
FR3
HCIFZ
JG9
KB.
KR7
L6V
M7S
PDBOC
PHGZM
PHGZT
PKEHL
PQEST
PQGLB
PQQKQ
PQUKI
PRINS
PTHSS
DOI 10.4028/www.scientific.net/AMM.20-23.178
DatabaseName CrossRef
Engineered Materials Abstracts
Mechanical & Transportation Engineering Abstracts
METADEX
Technology Research Database
ProQuest SciTech Collection
ProQuest Technology Collection
Materials Science & Engineering Collection
ProQuest Central (Alumni)
ProQuest Central UK/Ireland
ProQuest Central
Continental Europe Database
Technology Collection
ProQuest One Community College
ProQuest Materials Science Collection
ProQuest Central
Engineering Research Database
SciTech Premium Collection
Materials Research Database
ProQuest Materials Science Database (NC LIVE)
Civil Engineering Abstracts
ProQuest Engineering Collection
Engineering Database
Materials Science Collection
ProQuest Central Premium
ProQuest One Academic (New)
ProQuest One Academic Middle East (New)
ProQuest One Academic Eastern Edition (DO NOT USE)
ProQuest One Applied & Life Sciences
ProQuest One Academic (retired)
ProQuest One Academic UKI Edition
ProQuest Central China
Engineering Collection
DatabaseTitle CrossRef
Materials Research Database
Technology Collection
Technology Research Database
ProQuest One Academic Middle East (New)
Mechanical & Transportation Engineering Abstracts
Materials Science Collection
ProQuest Central (Alumni Edition)
SciTech Premium Collection
ProQuest One Community College
ProQuest Central China
ProQuest Central
ProQuest One Applied & Life Sciences
Engineered Materials Abstracts
ProQuest Engineering Collection
ProQuest Central Korea
Materials Science Database
ProQuest Central (New)
Engineering Collection
ProQuest Materials Science Collection
Civil Engineering Abstracts
Engineering Database
ProQuest One Academic Eastern Edition
ProQuest Technology Collection
Continental Europe Database
ProQuest SciTech Collection
METADEX
ProQuest One Academic UKI Edition
Materials Science & Engineering Collection
Engineering Research Database
ProQuest One Academic
ProQuest One Academic (New)
DatabaseTitleList CrossRef

Materials Research Database
Database_xml – sequence: 1
  dbid: KB.
  name: ProQuest Materials Science Database (NC LIVE)
  url: http://search.proquest.com/materialsscijournals
  sourceTypes: Aggregation Database
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
EISSN 1662-7482
EndPage 183
ExternalDocumentID 3105528841
10_4028_www_scientific_net_AMM_20_23_178
GroupedDBID 4.4
6J9
8FE
8FG
ABHXD
ABJCF
ABJNI
ABUWG
ACGFO
ACGFS
ACIWK
AFFHD
AFKRA
ALMA_UNASSIGNED_HOLDINGS
BENPR
BFMQW
BGLVJ
BPHCQ
CCPQU
CZ9
D1I
DB1
DKFMR
EBS
EJD
HCIFZ
KB.
KC.
L6V
M7S
P2P
PDBOC
PHGZM
PHGZT
PQGLB
PQQKQ
PROAC
PTHSS
RNS
RTP
.DC
AAYXX
ABDNZ
ACYGS
CITATION
7SR
7TB
8BQ
8FD
DWQXO
FR3
JG9
KR7
PKEHL
PQEST
PQUKI
PRINS
ID FETCH-LOGICAL-c299t-3d05ef4443b46f225b64b328a24d46010bc6713fe1c8debe035da43fe46375f63
IEDL.DBID M7S
ISBN 0878492879
9780878492879
ISICitedReferencesCount 0
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000277153300031&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 1660-9336
1662-7482
IngestDate Fri Jul 25 12:02:31 EDT 2025
Sat Nov 29 01:44:07 EST 2025
Fri Dec 05 20:30:20 EST 2025
IsPeerReviewed true
IsScholarly true
Keywords XSL
Information Extraction
XPath Learning
XML
Language English
License https://www.scientific.net/PolicyAndEthics/PublishingPolicies
https://www.scientific.net/license/TDM_Licenser.pdf
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c299t-3d05ef4443b46f225b64b328a24d46010bc6713fe1c8debe035da43fe46375f63
Notes Selected, peer reviewed papers from the 2010 International Conference on Information Technology for Manufacturing Systems (ITMS 2010), Macao, China, Jan. 30-31, 2010
ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
PQID 1443807553
PQPubID 2029177
PageCount 6
ParticipantIDs proquest_journals_1443807553
crossref_primary_10_4028_www_scientific_net_AMM_20_23_178
transtech_journals_10_4028_www_scientific_net_AMM_20_23_178
PublicationCentury 2000
PublicationDate 2010-01-01
PublicationDateYYYYMMDD 2010-01-01
PublicationDate_xml – month: 01
  year: 2010
  text: 2010-01-01
  day: 01
PublicationDecade 2010
PublicationPlace Zurich
PublicationPlace_xml – name: Zurich
PublicationTitle Applied Mechanics and Materials
PublicationYear 2010
Publisher Trans Tech Publications Ltd
Publisher_xml – name: Trans Tech Publications Ltd
References 2898803
2898804
2898809
2898805
2898806
2898807
2898808
References_xml – ident: 2898807
– ident: 2898808
– ident: 2898806
  doi: 10.1109/icdsc.2001.918966
– ident: 2898809
– ident: 2898803
– ident: 2898804
– ident: 2898805
SSID ssj0064148
ssj0000760444
Score 1.766594
Snippet With the increasingly high-speed of the internet as well as the increase in the amount of data it contains, users are finding it more and more difficult to...
SourceID proquest
crossref
transtech
SourceType Aggregation Database
Index Database
Publisher
StartPage 178
Title A Method of Web Information Automatic Extraction Based on XML
URI https://www.scientific.net/AMM.20-23.178
https://www.proquest.com/docview/1443807553
Volume 20-23
WOSCitedRecordID wos000277153300031&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVPQU
  databaseName: Continental Europe Database
  isbn: 0878492879
  customDbUrl:
  eissn: 1662-7482
  dateEnd: 20200630
  omitProxy: false
  ssIdentifier: ssj0064148
  issn: 1660-9336
  databaseCode: BFMQW
  dateStart: 20040901
  isFulltext: true
  titleUrlDefault: https://search.proquest.com/conteurope
  providerName: ProQuest
– providerCode: PRVPQU
  databaseName: Engineering Database
  isbn: 0878492879
  customDbUrl:
  eissn: 1662-7482
  dateEnd: 20200630
  omitProxy: false
  ssIdentifier: ssj0064148
  issn: 1660-9336
  databaseCode: M7S
  dateStart: 20040901
  isFulltext: true
  titleUrlDefault: http://search.proquest.com
  providerName: ProQuest
– providerCode: PRVPQU
  databaseName: ProQuest Central
  isbn: 0878492879
  customDbUrl:
  eissn: 1662-7482
  dateEnd: 20200630
  omitProxy: false
  ssIdentifier: ssj0064148
  issn: 1660-9336
  databaseCode: BENPR
  dateStart: 20040901
  isFulltext: true
  titleUrlDefault: https://www.proquest.com/central
  providerName: ProQuest
– providerCode: PRVPQU
  databaseName: ProQuest Materials Science Database (NC LIVE)
  isbn: 0878492879
  customDbUrl:
  eissn: 1662-7482
  dateEnd: 20200630
  omitProxy: false
  ssIdentifier: ssj0064148
  issn: 1660-9336
  databaseCode: KB.
  dateStart: 20040901
  isFulltext: true
  titleUrlDefault: http://search.proquest.com/materialsscijournals
  providerName: ProQuest
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV1LS8NAEB60FR8H32K1lj148JKaZDebBBFppUXQlOIDe1uyeYCXpLap-POd3aSPkwgel4QlmdmdmW929huAyxhDYsr9yJBJiADFxS0lTR4aqS-pLb0IvUysm024g4E3GvnDKuE2rcoq5zZRG-o4j1SO_BoDf8WN7jj0bvxpqK5R6nS1aqGxDnXFkmDp0r2XRY5FnTppOrTSMnNm6W5aFuemgUCeaz5I12M-wga_YuNZjDfhCi0KwitPf1R5P1GV7-icQScIEFwaNm1bqkPbqk9bBqpbhfI3iox1xWf19_77t_uwW0WrpFMurwNYS7JD2FnhMDyC2w4JdBdqkqfkPZGkuuCkFE46syLXnLCk911MyjsUpIuOE9_OyCh4Ooa3fu_1_sGoejIYETquwqCx6SQpCpVKxlM0BpIz1KoX2ixmCtzJiCPuTRMr8mJcICZ14pDhmHHqOimnJ1DL8iw5BeL4iUsjhvbVkgzxsGf6IVeEgZK7tidpA_y5VMW4pN4QCFmURgQKTiw1IlAjAjUibFPYVKBGGtCcy1VUm3IqlkJtwM1CNSvP_zj52e-Tn8N2WWKg8jRNqBWTWXIBG9FX8TGdtKDe7Q2Gzy1Yf-y2W3q9_gDJFer8
linkProvider ProQuest
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMw1V1NT-MwEB2xBS1wAHYBUWAXH3YlLoEkdpxYCKGyWwSiqdAKRG8mThyJSwtt-PpT_EbGTtL2hLhw2KOVyHIyzzPz_PEG4FeGKTHlInWUTpCghDillMsTJxeK-ipKMcpktthE2O1GvZ64mIHX-i6MOVZZ-0TrqLNBatbI9zHxN9roQUCP7u4dUzXK7K7WJTRKWJzrlyekbKPDs79o39--f9K-_HPqVFUFnBRdb-HQzA10zrA7xXiOcFac4biixGcZM_REpRyZW669NMrwE10aZAnDNuM0DHJOsd8vMMso40EDZo_b3Yt_41Uds89lBdjKWMCZZ-t3eZy7jqCUWwXKMGICiYqo9H_G7a-wiz4MCV1kf0N5I9IcGLKrFK04Rjrr-HTPMzXhpqPoJDWeL0yEM_KvU1HyZPl_-78rsFTl46RVTqBvMKP732FxSqVxFQ5bJLZ1tskgJ9dakeoKl4E0aT0UA6t6S9rPxbC8JUKOMTXAt_ukF3fW4OpTxr8Ojf6grzeABEKHNGUYQTzFkPFHrki4kURUPPQjRZsgaivKu1JcRCIpMwiQaCg5QYBEBEhEgPRd6VOJCGjCdm1HWbmdkZwYsQkHYyhMPf9g55vvd74D86eXcUd2zrrnW7BQHqgwq1Lb0CiGD_oHzKWPxe1o-LOaHwRuPhsybz7-RdA
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=A+Method+of+Web+Information+Automatic+Extraction+Based+on+XML&rft.jtitle=Applied+mechanics+and+materials&rft.au=Zhang%2C+Na&rft.au=Gu%2C+Jun+Hua&rft.au=Song%2C+Jie&rft.au=Liu%2C+Yan+Liu&rft.date=2010-01-01&rft.pub=Trans+Tech+Publications+Ltd&rft.issn=1660-9336&rft.eissn=1662-7482&rft.volume=20-23&rft.spage=178&rft.epage=183&rft_id=info:doi/10.4028%2Fwww.scientific.net%2FAMM.20-23.178&rft.externalDocID=10_4028_www_scientific_net_AMM_20_23_178
thumbnail_s http://cvtisr.summon.serialssolutions.com/2.0.0/image/custom?url=https%3A%2F%2Fwww.scientific.net%2FImage%2FTitleCover%2F893%3Fwidth%3D600