The Ethicality of Web Crawlers

Search engines largely rely on web crawlers to collect information from the web. This has led to an enormous amount of web traffic generated by crawlers alone. To minimize negative aspects of this traffic on websites, the behaviors of crawlers may be regulated at an individual web server by implemen...

Full description

Saved in:
Bibliographic Details
Published in:2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology Vol. 1; pp. 668 - 675
Main Authors: Sun, Yang, Councill, Isaac G., Giles, C. Lee
Format: Conference Proceeding
Language:English
Published: IEEE 01.08.2010
Subjects:
ISBN:9781424484829, 1424484820
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract Search engines largely rely on web crawlers to collect information from the web. This has led to an enormous amount of web traffic generated by crawlers alone. To minimize negative aspects of this traffic on websites, the behaviors of crawlers may be regulated at an individual web server by implementing the Robots Exclusion Protocol in a file called "robots.txt". Although not an official standard, the Robots Exclusion Protocol has been adopted to a greater or lesser extent by nearly all commercial search engines and popular crawlers. As many web site administrators and policy makers have come to rely on the informal contract set forth by the Robots Exclusion Protocol, the degree to which web crawlers respect robots.txt policies has become an important issue of computer ethics. In this research, we investigate and define rules to measure crawler ethics, referring to the extent to which web crawlers respect the regulations set forth in robots.txt configuration files. We test the behaviors of web crawlers in terms of ethics by deploying a crawler honeypot: a set of websites where each site is configured with a distinct regulation specification using the Robots Exclusion Protocol in order to capture specific behaviors of web crawlers.We propose a vector space model to represent crawler behavior and a set of models to measure the ethics of web crawlers based on their behaviors. The results show that ethicality scores vary significantly among crawlers. Most commercial web crawlers receive fairly low ethicality violation scores which means most of the crawlers' behaviors are ethical; however, many commercial crawlers still consistently violate or misinterpret certain robots.txt rules.
AbstractList Search engines largely rely on web crawlers to collect information from the web. This has led to an enormous amount of web traffic generated by crawlers alone. To minimize negative aspects of this traffic on websites, the behaviors of crawlers may be regulated at an individual web server by implementing the Robots Exclusion Protocol in a file called "robots.txt". Although not an official standard, the Robots Exclusion Protocol has been adopted to a greater or lesser extent by nearly all commercial search engines and popular crawlers. As many web site administrators and policy makers have come to rely on the informal contract set forth by the Robots Exclusion Protocol, the degree to which web crawlers respect robots.txt policies has become an important issue of computer ethics. In this research, we investigate and define rules to measure crawler ethics, referring to the extent to which web crawlers respect the regulations set forth in robots.txt configuration files. We test the behaviors of web crawlers in terms of ethics by deploying a crawler honeypot: a set of websites where each site is configured with a distinct regulation specification using the Robots Exclusion Protocol in order to capture specific behaviors of web crawlers.We propose a vector space model to represent crawler behavior and a set of models to measure the ethics of web crawlers based on their behaviors. The results show that ethicality scores vary significantly among crawlers. Most commercial web crawlers receive fairly low ethicality violation scores which means most of the crawlers' behaviors are ethical; however, many commercial crawlers still consistently violate or misinterpret certain robots.txt rules.
Author Sun, Yang
Councill, Isaac G.
Giles, C. Lee
Author_xml – sequence: 1
  givenname: Yang
  surname: Sun
  fullname: Sun, Yang
  email: yang.sun@corp.aol.com
  organization: AOL Res., Mountain View, CA, USA
– sequence: 2
  givenname: Isaac G.
  surname: Councill
  fullname: Councill, Isaac G.
  email: icouncill@gmail.com
  organization: Google Inc., New York, NY, USA
– sequence: 3
  givenname: C. Lee
  surname: Giles
  fullname: Giles, C. Lee
  email: giles@ist.psu.edu
  organization: Coll. of Inf. Sci. & Technol., Pennsylvania State Univ., University Park, PA, USA
BookMark eNotjEFLw0AQRldUUGvOHgTJH0g7szs7u3MsoWqg0Eukx7JZNzQSW0kC0n9vUU8f7_H47tTV4XhISj0gzBFBFtuqqJb1XMNZGOQLlYnz4FgsoSBd_jKSJvLktdyobBw_AABRA1l_q57qfcpX076Loe-mU35s821q8nII330axnt13YZ-TNn_ztTb86ouX4v15qUql-siIOupaE2QViRY47QkjiQi3hK3HAxGh855etfcRBeMsWCthyhsuDlXQVttZurx77dLKe2-hu4zDKedZWSL3vwAM_w9yA
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/WI-IAT.2010.316
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE/IET Electronic Library (IEL) (UW System Shared)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE/IET Electronic Library
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISBN 9780769541914
0769541917
EndPage 675
ExternalDocumentID 5616518
Genre orig-research
GroupedDBID 6IE
6IL
ACM
ALMA_UNASSIGNED_HOLDINGS
APO
CBEJK
GUFHI
LHSKQ
RIB
RIC
RIE
RIL
ID FETCH-LOGICAL-a162t-f3a9f99a53729e6c49998546f6a31c717784d26bc7a33505580c9636b985a2523
IEDL.DBID RIE
ISBN 9781424484829
1424484820
IngestDate Wed Sep 03 07:11:07 EDT 2025
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a162t-f3a9f99a53729e6c49998546f6a31c717784d26bc7a33505580c9636b985a2523
PageCount 8
ParticipantIDs ieee_primary_5616518
PublicationCentury 2000
PublicationDate 2010-Aug.
PublicationDateYYYYMMDD 2010-08-01
PublicationDate_xml – month: 08
  year: 2010
  text: 2010-Aug.
PublicationDecade 2010
PublicationTitle 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology
PublicationTitleAbbrev wi-iat
PublicationYear 2010
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0001120458
ssj0000452489
Score 1.5662067
Snippet Search engines largely rely on web crawlers to collect information from the web. This has led to an enormous amount of web traffic generated by crawlers alone....
SourceID ieee
SourceType Publisher
StartPage 668
SubjectTerms Costs
Crawlers
ethicality
Ethics
Guidelines
Privacy
Regulation
Robots
robots.txt
Search engines
Vectors
web crawler ethics
Title The Ethicality of Web Crawlers
URI https://ieeexplore.ieee.org/document/5616518
Volume 1
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LTwIxEJ4A8eAJFYxP0oNHV7a7fR4NkUhiCAcM3EjbnU1MDBgE_Pu23QU9ePHWNj20aSffPL8BuBOF1Nx5SStlIROWsSKxwmYJLYy0aWEdd5HE9UWOx2o-15MG3B9qYRAxJp_hQxjGWH6xctvgKut7rBecqiY0pZRVrdbBnxKowVnNnB79KzQQrat9LZdiHur2FE_1XNdUPzTV_dkoGT1Oq1SvPDQ__9VrJULNsP2_Q55A96dmj0wOaHQKDVyeQXvftIHUMtyBnv8YJGa5Rw2crEoyQ0sGa_P17lXBLrwOn6aD56RukpAYKrJNUuZGl1obHuJvKFywYBRnohQmp84ba1KxIhPWSZPnXt3hKnVe6IT1u0zmzdBzaC1XS7wA4hRDj_coES2jThiXordYUOpUWWr4JXTCdRcfFQ_Gor7p1d_L13BcRdpDstwNtDbrLd7Ckdtt3j7Xvfh43w1dkgY
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LTwIxEJ4gmugJFYxP3INHV7a7fR6NkUBEwgEDN9J2ZxMTAwZB_75td0EPXry1TQ9t2sk3z28AbnguFLNO0gqRi5imNI8NN2lMci1MkhvLbCBxHYjhUE6nalSD220tDCKG5DO888MQy88Xdu1dZR2H9ZwRuQO7jNKUlNVaW4-KJwenFXd68LAQT7UuN9Vckjqw25A8VXNVkf2QRHUm_bh_Py6TvTLf_vxXt5UANt3G_455CK2fqr1otMWjI6jh_Bgam7YNUSXFTWi7rxGFPPegg0eLIpqgiR6W-uvNKYMteOk-jh96cdUmIdaEp6u4yLQqlNLMR-CQW2_DSEZ5wXVGrDPXhKR5yo0VOsucwsNkYp3YceN26dQZoidQny_meAqRlRQd4qNANJRYrm2CzmZBoRJpiGZn0PTXnb2XTBiz6qbnfy9fw35v_DyYDfrDpws4KOPuPnXuEuqr5RqvYM9-rl4_lu3wkN_isZVN
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2010+IEEE%2FWIC%2FACM+International+Conference+on+Web+Intelligence+and+Intelligent+Agent+Technology&rft.atitle=The+Ethicality+of+Web+Crawlers&rft.au=Sun%2C+Yang&rft.au=Councill%2C+Isaac+G.&rft.au=Giles%2C+C.+Lee&rft.date=2010-08-01&rft.pub=IEEE&rft.isbn=9781424484829&rft.volume=1&rft.spage=668&rft.epage=675&rft_id=info:doi/10.1109%2FWI-IAT.2010.316&rft.externalDocID=5616518
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781424484829/lc.gif&client=summon&freeimage=true
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781424484829/mc.gif&client=summon&freeimage=true
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781424484829/sc.gif&client=summon&freeimage=true