Automatically Crawling Dynamic Web Applications via Proxy-Based JavaScript Injection and Runtime Analysis

According to the statistics of BrightPlanet in 2012, the information contained in Deep Web is 400-500 times more than those in the Surface Web. Automatic data collection under Deep Web has become one of the research hotspots in the domain of web crawlers. During the exploration of automatic dynamic...

Full description

Saved in:

Bibliographic Details
Published in:	2018 IEEE Third International Conference on Data Science in Cyberspace (DSC) pp. 242 - 249
Main Authors:	Yan Li, Peiyi Han, Chuanyi Liu, Binxing Fang
Format:	Conference Proceeding
Language:	English
Published:	IEEE 01.06.2018
Subjects:	Browsers Cascading style sheets Crawlers dynamic websites JavaScript analysis proxy server Reactive power Runtime Servers User interfaces web crawler
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	According to the statistics of BrightPlanet in 2012, the information contained in Deep Web is 400-500 times more than those in the Surface Web. Automatic data collection under Deep Web has become one of the research hotspots in the domain of web crawlers. During the exploration of automatic dynamic web crawlers, we discover two inevitable situations which have not been processed properly by existing tools. One is that some websites adopt CSS pseudo-class to locate clickable elements, which prevents the existing technologies from simulating user actions. Another is that pop-up windows can be triggered during automatically elements clicking, where the existing web crawler procedures will be terminated. Addressing the CSS pseudo-class problem, we adopt a proxy server in our system to analyze JavaScript code of target websites both statically and dynamically. In the meantime, we propose a checking mechanism to handle the pop-up windows. Finally, we implement an automatic web crawler AutoCrawler with the aid of the proxy server. Evaluation results indicate that AutoCrawler can not only cover more dynamic interactions but also obtain more valuable data from the real-world web applications.
DOI:	10.1109/DSC.2018.00042