Automatically Crawling Dynamic Web Applications via Proxy-Based JavaScript Injection and Runtime Analysis

According to the statistics of BrightPlanet in 2012, the information contained in Deep Web is 400-500 times more than those in the Surface Web. Automatic data collection under Deep Web has become one of the research hotspots in the domain of web crawlers. During the exploration of automatic dynamic...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:2018 IEEE Third International Conference on Data Science in Cyberspace (DSC) s. 242 - 249
Hlavní autoři: Yan Li, Peiyi Han, Chuanyi Liu, Binxing Fang
Médium: Konferenční příspěvek
Jazyk:angličtina
Vydáno: IEEE 01.06.2018
Témata:
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:According to the statistics of BrightPlanet in 2012, the information contained in Deep Web is 400-500 times more than those in the Surface Web. Automatic data collection under Deep Web has become one of the research hotspots in the domain of web crawlers. During the exploration of automatic dynamic web crawlers, we discover two inevitable situations which have not been processed properly by existing tools. One is that some websites adopt CSS pseudo-class to locate clickable elements, which prevents the existing technologies from simulating user actions. Another is that pop-up windows can be triggered during automatically elements clicking, where the existing web crawler procedures will be terminated. Addressing the CSS pseudo-class problem, we adopt a proxy server in our system to analyze JavaScript code of target websites both statically and dynamically. In the meantime, we propose a checking mechanism to handle the pop-up windows. Finally, we implement an automatic web crawler AutoCrawler with the aid of the proxy server. Evaluation results indicate that AutoCrawler can not only cover more dynamic interactions but also obtain more valuable data from the real-world web applications.
DOI:10.1109/DSC.2018.00042