The Ethicality of Web Crawlers

Search engines largely rely on web crawlers to collect information from the web. This has led to an enormous amount of web traffic generated by crawlers alone. To minimize negative aspects of this traffic on websites, the behaviors of crawlers may be regulated at an individual web server by implemen...

Full description

Saved in:

Bibliographic Details
Published in:	2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology Vol. 1; pp. 668 - 675
Main Authors:	Sun, Yang, Councill, Isaac G., Giles, C. Lee
Format:	Conference Proceeding
Language:	English
Published:	IEEE 01.08.2010
Subjects:	Costs Crawlers ethicality Ethics Guidelines Privacy Regulation Robots robots.txt Search engines Vectors web crawler ethics
ISBN:	9781424484829, 1424484820
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Abstract	Search engines largely rely on web crawlers to collect information from the web. This has led to an enormous amount of web traffic generated by crawlers alone. To minimize negative aspects of this traffic on websites, the behaviors of crawlers may be regulated at an individual web server by implementing the Robots Exclusion Protocol in a file called "robots.txt". Although not an official standard, the Robots Exclusion Protocol has been adopted to a greater or lesser extent by nearly all commercial search engines and popular crawlers. As many web site administrators and policy makers have come to rely on the informal contract set forth by the Robots Exclusion Protocol, the degree to which web crawlers respect robots.txt policies has become an important issue of computer ethics. In this research, we investigate and define rules to measure crawler ethics, referring to the extent to which web crawlers respect the regulations set forth in robots.txt configuration files. We test the behaviors of web crawlers in terms of ethics by deploying a crawler honeypot: a set of websites where each site is configured with a distinct regulation specification using the Robots Exclusion Protocol in order to capture specific behaviors of web crawlers.We propose a vector space model to represent crawler behavior and a set of models to measure the ethics of web crawlers based on their behaviors. The results show that ethicality scores vary significantly among crawlers. Most commercial web crawlers receive fairly low ethicality violation scores which means most of the crawlers' behaviors are ethical; however, many commercial crawlers still consistently violate or misinterpret certain robots.txt rules.
AbstractList	Search engines largely rely on web crawlers to collect information from the web. This has led to an enormous amount of web traffic generated by crawlers alone. To minimize negative aspects of this traffic on websites, the behaviors of crawlers may be regulated at an individual web server by implementing the Robots Exclusion Protocol in a file called "robots.txt". Although not an official standard, the Robots Exclusion Protocol has been adopted to a greater or lesser extent by nearly all commercial search engines and popular crawlers. As many web site administrators and policy makers have come to rely on the informal contract set forth by the Robots Exclusion Protocol, the degree to which web crawlers respect robots.txt policies has become an important issue of computer ethics. In this research, we investigate and define rules to measure crawler ethics, referring to the extent to which web crawlers respect the regulations set forth in robots.txt configuration files. We test the behaviors of web crawlers in terms of ethics by deploying a crawler honeypot: a set of websites where each site is configured with a distinct regulation specification using the Robots Exclusion Protocol in order to capture specific behaviors of web crawlers.We propose a vector space model to represent crawler behavior and a set of models to measure the ethics of web crawlers based on their behaviors. The results show that ethicality scores vary significantly among crawlers. Most commercial web crawlers receive fairly low ethicality violation scores which means most of the crawlers' behaviors are ethical; however, many commercial crawlers still consistently violate or misinterpret certain robots.txt rules.
Author	Sun, Yang Councill, Isaac G. Giles, C. Lee
Author_xml	– sequence: 1 givenname: Yang surname: Sun fullname: Sun, Yang email: yang.sun@corp.aol.com organization: AOL Res., Mountain View, CA, USA – sequence: 2 givenname: Isaac G. surname: Councill fullname: Councill, Isaac G. email: icouncill@gmail.com organization: Google Inc., New York, NY, USA – sequence: 3 givenname: C. Lee surname: Giles fullname: Giles, C. Lee email: giles@ist.psu.edu organization: Coll. of Inf. Sci. & Technol., Pennsylvania State Univ., University Park, PA, USA
BookMark	eNotjEFLw0AQRldUUGvOHgTJH0g7szs7u3MsoWqg0Eukx7JZNzQSW0kC0n9vUU8f7_H47tTV4XhISj0gzBFBFtuqqJb1XMNZGOQLlYnz4FgsoSBd_jKSJvLktdyobBw_AABRA1l_q57qfcpX076Loe-mU35s821q8nII330axnt13YZ-TNn_ztTb86ouX4v15qUql-siIOupaE2QViRY47QkjiQi3hK3HAxGh855etfcRBeMsWCthyhsuDlXQVttZurx77dLKe2-hu4zDKedZWSL3vwAM_w9yA
ContentType	Conference Proceeding
DBID	6IE 6IL CBEJK RIE RIL
DOI	10.1109/WI-IAT.2010.316
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE/IET Electronic Library (IEL) (UW System Shared) IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE/IET Electronic Library url: https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Computer Science
EISBN	9780769541914 0769541917
EndPage	675
ExternalDocumentID	5616518
Genre	orig-research
GroupedDBID	6IE 6IL ACM ALMA_UNASSIGNED_HOLDINGS APO CBEJK GUFHI LHSKQ RIB RIC RIE RIL
ID	FETCH-LOGICAL-a162t-f3a9f99a53729e6c49998546f6a31c717784d26bc7a33505580c9636b985a2523
IEDL.DBID	RIE
ISBN	9781424484829 1424484820
IngestDate	Wed Sep 03 07:11:07 EDT 2025
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-a162t-f3a9f99a53729e6c49998546f6a31c717784d26bc7a33505580c9636b985a2523
PageCount	8
ParticipantIDs	ieee_primary_5616518
PublicationCentury	2000
PublicationDate	2010-Aug.
PublicationDateYYYYMMDD	2010-08-01
PublicationDate_xml	– month: 08 year: 2010 text: 2010-Aug.
PublicationDecade	2010
PublicationTitle	2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology
PublicationTitleAbbrev	wi-iat
PublicationYear	2010
Publisher	IEEE
Publisher_xml	– name: IEEE
SSID	ssj0001120458 ssj0000452489
Score	1.5662067
Snippet	Search engines largely rely on web crawlers to collect information from the web. This has led to an enormous amount of web traffic generated by crawlers alone....
SourceID	ieee
SourceType	Publisher
StartPage	668
SubjectTerms	Costs Crawlers ethicality Ethics Guidelines Privacy Regulation Robots robots.txt Search engines Vectors web crawler ethics
Title	The Ethicality of Web Crawlers
URI	https://ieeexplore.ieee.org/document/5616518
Volume	1
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LTwIxEJ4A8eAJFYxP0oNHV7a7fR4NkUhiCAcM3EjbnU1MDBgE_Pu23QU9ePHWNj20aSffPL8BuBOF1Nx5SStlIROWsSKxwmYJLYy0aWEdd5HE9UWOx2o-15MG3B9qYRAxJp_hQxjGWH6xctvgKut7rBecqiY0pZRVrdbBnxKowVnNnB79KzQQrat9LZdiHur2FE_1XNdUPzTV_dkoGT1Oq1SvPDQ__9VrJULNsP2_Q55A96dmj0wOaHQKDVyeQXvftIHUMtyBnv8YJGa5Rw2crEoyQ0sGa_P17lXBLrwOn6aD56RukpAYKrJNUuZGl1obHuJvKFywYBRnohQmp84ba1KxIhPWSZPnXt3hKnVe6IT1u0zmzdBzaC1XS7wA4hRDj_coES2jThiXordYUOpUWWr4JXTCdRcfFQ_Gor7p1d_L13BcRdpDstwNtDbrLd7Ckdtt3j7Xvfh43w1dkgY
linkProvider	IEEE
linkToHtml	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LTwIxEJ4gmugJFYxP3INHV7a7fR6NkUBEwgEDN9J2ZxMTAwZB_75td0EPXry1TQ9t2sk3z28AbnguFLNO0gqRi5imNI8NN2lMci1MkhvLbCBxHYjhUE6nalSD220tDCKG5DO888MQy88Xdu1dZR2H9ZwRuQO7jNKUlNVaW4-KJwenFXd68LAQT7UuN9Vckjqw25A8VXNVkf2QRHUm_bh_Py6TvTLf_vxXt5UANt3G_455CK2fqr1otMWjI6jh_Bgam7YNUSXFTWi7rxGFPPegg0eLIpqgiR6W-uvNKYMteOk-jh96cdUmIdaEp6u4yLQqlNLMR-CQW2_DSEZ5wXVGrDPXhKR5yo0VOsucwsNkYp3YceN26dQZoidQny_meAqRlRQd4qNANJRYrm2CzmZBoRJpiGZn0PTXnb2XTBiz6qbnfy9fw35v_DyYDfrDpws4KOPuPnXuEuqr5RqvYM9-rl4_lu3wkN_isZVN
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2010+IEEE%2FWIC%2FACM+International+Conference+on+Web+Intelligence+and+Intelligent+Agent+Technology&rft.atitle=The+Ethicality+of+Web+Crawlers&rft.au=Sun%2C+Yang&rft.au=Councill%2C+Isaac+G.&rft.au=Giles%2C+C.+Lee&rft.date=2010-08-01&rft.pub=IEEE&rft.isbn=9781424484829&rft.volume=1&rft.spage=668&rft.epage=675&rft_id=info:doi/10.1109%2FWI-IAT.2010.316&rft.externalDocID=5616518
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781424484829/lc.gif&client=summon&freeimage=true
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781424484829/mc.gif&client=summon&freeimage=true
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781424484829/sc.gif&client=summon&freeimage=true