Automatically Crawling Dynamic Web Applications via Proxy-Based JavaScript Injection and Runtime Analysis

According to the statistics of BrightPlanet in 2012, the information contained in Deep Web is 400-500 times more than those in the Surface Web. Automatic data collection under Deep Web has become one of the research hotspots in the domain of web crawlers. During the exploration of automatic dynamic...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:2018 IEEE Third International Conference on Data Science in Cyberspace (DSC) S. 242 - 249
Hauptverfasser: Yan Li, Peiyi Han, Chuanyi Liu, Binxing Fang
Format: Tagungsbericht
Sprache:Englisch
Veröffentlicht: IEEE 01.06.2018
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:According to the statistics of BrightPlanet in 2012, the information contained in Deep Web is 400-500 times more than those in the Surface Web. Automatic data collection under Deep Web has become one of the research hotspots in the domain of web crawlers. During the exploration of automatic dynamic web crawlers, we discover two inevitable situations which have not been processed properly by existing tools. One is that some websites adopt CSS pseudo-class to locate clickable elements, which prevents the existing technologies from simulating user actions. Another is that pop-up windows can be triggered during automatically elements clicking, where the existing web crawler procedures will be terminated. Addressing the CSS pseudo-class problem, we adopt a proxy server in our system to analyze JavaScript code of target websites both statically and dynamically. In the meantime, we propose a checking mechanism to handle the pop-up windows. Finally, we implement an automatic web crawler AutoCrawler with the aid of the proxy server. Evaluation results indicate that AutoCrawler can not only cover more dynamic interactions but also obtain more valuable data from the real-world web applications.
DOI:10.1109/DSC.2018.00042