The Ethicality of Web Crawlers
Search engines largely rely on web crawlers to collect information from the web. This has led to an enormous amount of web traffic generated by crawlers alone. To minimize negative aspects of this traffic on websites, the behaviors of crawlers may be regulated at an individual web server by implemen...
Saved in:
| Published in: | 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology Vol. 1; pp. 668 - 675 |
|---|---|
| Main Authors: | , , |
| Format: | Conference Proceeding |
| Language: | English |
| Published: |
IEEE
01.08.2010
|
| Subjects: | |
| ISBN: | 9781424484829, 1424484820 |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Abstract | Search engines largely rely on web crawlers to collect information from the web. This has led to an enormous amount of web traffic generated by crawlers alone. To minimize negative aspects of this traffic on websites, the behaviors of crawlers may be regulated at an individual web server by implementing the Robots Exclusion Protocol in a file called "robots.txt". Although not an official standard, the Robots Exclusion Protocol has been adopted to a greater or lesser extent by nearly all commercial search engines and popular crawlers. As many web site administrators and policy makers have come to rely on the informal contract set forth by the Robots Exclusion Protocol, the degree to which web crawlers respect robots.txt policies has become an important issue of computer ethics. In this research, we investigate and define rules to measure crawler ethics, referring to the extent to which web crawlers respect the regulations set forth in robots.txt configuration files. We test the behaviors of web crawlers in terms of ethics by deploying a crawler honeypot: a set of websites where each site is configured with a distinct regulation specification using the Robots Exclusion Protocol in order to capture specific behaviors of web crawlers.We propose a vector space model to represent crawler behavior and a set of models to measure the ethics of web crawlers based on their behaviors. The results show that ethicality scores vary significantly among crawlers. Most commercial web crawlers receive fairly low ethicality violation scores which means most of the crawlers' behaviors are ethical; however, many commercial crawlers still consistently violate or misinterpret certain robots.txt rules. |
|---|---|
| AbstractList | Search engines largely rely on web crawlers to collect information from the web. This has led to an enormous amount of web traffic generated by crawlers alone. To minimize negative aspects of this traffic on websites, the behaviors of crawlers may be regulated at an individual web server by implementing the Robots Exclusion Protocol in a file called "robots.txt". Although not an official standard, the Robots Exclusion Protocol has been adopted to a greater or lesser extent by nearly all commercial search engines and popular crawlers. As many web site administrators and policy makers have come to rely on the informal contract set forth by the Robots Exclusion Protocol, the degree to which web crawlers respect robots.txt policies has become an important issue of computer ethics. In this research, we investigate and define rules to measure crawler ethics, referring to the extent to which web crawlers respect the regulations set forth in robots.txt configuration files. We test the behaviors of web crawlers in terms of ethics by deploying a crawler honeypot: a set of websites where each site is configured with a distinct regulation specification using the Robots Exclusion Protocol in order to capture specific behaviors of web crawlers.We propose a vector space model to represent crawler behavior and a set of models to measure the ethics of web crawlers based on their behaviors. The results show that ethicality scores vary significantly among crawlers. Most commercial web crawlers receive fairly low ethicality violation scores which means most of the crawlers' behaviors are ethical; however, many commercial crawlers still consistently violate or misinterpret certain robots.txt rules. |
| Author | Sun, Yang Councill, Isaac G. Giles, C. Lee |
| Author_xml | – sequence: 1 givenname: Yang surname: Sun fullname: Sun, Yang email: yang.sun@corp.aol.com organization: AOL Res., Mountain View, CA, USA – sequence: 2 givenname: Isaac G. surname: Councill fullname: Councill, Isaac G. email: icouncill@gmail.com organization: Google Inc., New York, NY, USA – sequence: 3 givenname: C. Lee surname: Giles fullname: Giles, C. Lee email: giles@ist.psu.edu organization: Coll. of Inf. Sci. & Technol., Pennsylvania State Univ., University Park, PA, USA |
| BookMark | eNotjEFLw0AQRldUUGvOHgTJH0g7szs7u3MsoWqg0Eukx7JZNzQSW0kC0n9vUU8f7_H47tTV4XhISj0gzBFBFtuqqJb1XMNZGOQLlYnz4FgsoSBd_jKSJvLktdyobBw_AABRA1l_q57qfcpX076Loe-mU35s821q8nII330axnt13YZ-TNn_ztTb86ouX4v15qUql-siIOupaE2QViRY47QkjiQi3hK3HAxGh855etfcRBeMsWCthyhsuDlXQVttZurx77dLKe2-hu4zDKedZWSL3vwAM_w9yA |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IL CBEJK RIE RIL |
| DOI | 10.1109/WI-IAT.2010.316 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE/IET Electronic Library (IEL) (UW System Shared) IEEE Proceedings Order Plans (POP All) 1998-Present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE/IET Electronic Library url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Computer Science |
| EISBN | 9780769541914 0769541917 |
| EndPage | 675 |
| ExternalDocumentID | 5616518 |
| Genre | orig-research |
| GroupedDBID | 6IE 6IL ACM ALMA_UNASSIGNED_HOLDINGS APO CBEJK GUFHI LHSKQ RIB RIC RIE RIL |
| ID | FETCH-LOGICAL-a162t-f3a9f99a53729e6c49998546f6a31c717784d26bc7a33505580c9636b985a2523 |
| IEDL.DBID | RIE |
| ISBN | 9781424484829 1424484820 |
| IngestDate | Wed Sep 03 07:11:07 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | false |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-a162t-f3a9f99a53729e6c49998546f6a31c717784d26bc7a33505580c9636b985a2523 |
| PageCount | 8 |
| ParticipantIDs | ieee_primary_5616518 |
| PublicationCentury | 2000 |
| PublicationDate | 2010-Aug. |
| PublicationDateYYYYMMDD | 2010-08-01 |
| PublicationDate_xml | – month: 08 year: 2010 text: 2010-Aug. |
| PublicationDecade | 2010 |
| PublicationTitle | 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology |
| PublicationTitleAbbrev | wi-iat |
| PublicationYear | 2010 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| SSID | ssj0001120458 ssj0000452489 |
| Score | 1.5662067 |
| Snippet | Search engines largely rely on web crawlers to collect information from the web. This has led to an enormous amount of web traffic generated by crawlers alone.... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 668 |
| SubjectTerms | Costs Crawlers ethicality Ethics Guidelines Privacy Regulation Robots robots.txt Search engines Vectors web crawler ethics |
| Title | The Ethicality of Web Crawlers |
| URI | https://ieeexplore.ieee.org/document/5616518 |
| Volume | 1 |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LTwIxEJ4A8eAJFYxP0oNHV7a7fR4NkUhiCAcM3EjbnU1MDBgE_Pu23QU9ePHWNj20aSffPL8BuBOF1Nx5SStlIROWsSKxwmYJLYy0aWEdd5HE9UWOx2o-15MG3B9qYRAxJp_hQxjGWH6xctvgKut7rBecqiY0pZRVrdbBnxKowVnNnB79KzQQrat9LZdiHur2FE_1XNdUPzTV_dkoGT1Oq1SvPDQ__9VrJULNsP2_Q55A96dmj0wOaHQKDVyeQXvftIHUMtyBnv8YJGa5Rw2crEoyQ0sGa_P17lXBLrwOn6aD56RukpAYKrJNUuZGl1obHuJvKFywYBRnohQmp84ba1KxIhPWSZPnXt3hKnVe6IT1u0zmzdBzaC1XS7wA4hRDj_coES2jThiXordYUOpUWWr4JXTCdRcfFQ_Gor7p1d_L13BcRdpDstwNtDbrLd7Ckdtt3j7Xvfh43w1dkgY |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LTwIxEJ4gmugJFYxP3INHV7a7fR6NkUBEwgEDN9J2ZxMTAwZB_75td0EPXry1TQ9t2sk3z28AbnguFLNO0gqRi5imNI8NN2lMci1MkhvLbCBxHYjhUE6nalSD220tDCKG5DO888MQy88Xdu1dZR2H9ZwRuQO7jNKUlNVaW4-KJwenFXd68LAQT7UuN9Vckjqw25A8VXNVkf2QRHUm_bh_Py6TvTLf_vxXt5UANt3G_455CK2fqr1otMWjI6jh_Bgam7YNUSXFTWi7rxGFPPegg0eLIpqgiR6W-uvNKYMteOk-jh96cdUmIdaEp6u4yLQqlNLMR-CQW2_DSEZ5wXVGrDPXhKR5yo0VOsucwsNkYp3YceN26dQZoidQny_meAqRlRQd4qNANJRYrm2CzmZBoRJpiGZn0PTXnb2XTBiz6qbnfy9fw35v_DyYDfrDpws4KOPuPnXuEuqr5RqvYM9-rl4_lu3wkN_isZVN |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2010+IEEE%2FWIC%2FACM+International+Conference+on+Web+Intelligence+and+Intelligent+Agent+Technology&rft.atitle=The+Ethicality+of+Web+Crawlers&rft.au=Sun%2C+Yang&rft.au=Councill%2C+Isaac+G.&rft.au=Giles%2C+C.+Lee&rft.date=2010-08-01&rft.pub=IEEE&rft.isbn=9781424484829&rft.volume=1&rft.spage=668&rft.epage=675&rft_id=info:doi/10.1109%2FWI-IAT.2010.316&rft.externalDocID=5616518 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781424484829/lc.gif&client=summon&freeimage=true |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781424484829/mc.gif&client=summon&freeimage=true |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781424484829/sc.gif&client=summon&freeimage=true |

