文档详情

计算机外文翻译---基于网络爬虫的有效URL缓存.docx

发布:2019-01-12约4.19万字共25页下载文档
文本预览下载声明
外文资料原文 Efficient URL Caching for World Wide Web Crawling Andrei Z. Broder IBM TJ Watson Research Center 19 Skyline Dr Hawthorne, NY 10532 abroder@ Marc Najork Microsoft Research 1065 La Avenida Mountain View, CA 94043 najork@ Janet L. Wiener Hewlett Packard Labs 1501 Page Mill Road Palo Alto, CA 94304 janet.wiener@ ABSTRACT Crawling the web is deceptively simple: the basic algorithm is (a)Fetch a page (b) Parse it to extract all linked URLs (c) For all the URLs not seen before, repeat (a)–(c). However, the size of the web (estimated at over 4 billion pages) and its rate of change (estimated at 7% per week) move this plan from a trivial programming exercise to a serious algorithmic and system design challenge. Indeed, these two factors alone imply that for a reasonably fresh and complete crawl of the web, step (a) must be executed about a thousand times per second, and thus the membership test (c) must be done well over ten thousand times per second against a set too large to store in main memory. This requires a distributed architecture, which further complicates the membership test. A crucial way to speed up the test is to cache, that is, to store in main memory a (dynamic) subset of the “seen” URLs. The main goal of this paper is to carefully investigate several URL caching techniques for web crawling. We consider both practical algorithms: random replacement, static cache, LRU, and CLOCK, and theoretical limits: clairvoyant caching and infinite cache. We performed about 1,800 simulations using these algorithms with various cache sizes, using actual log data extracted from a massive 33 day web crawl that issued over one billion HTTP requests. Our main conclusion is that caching is very effective – in our setup, a cache of roughly 50,000 entries can achieve a hit rate of almost 80%. Interestingly, this cache size falls at a critical point: a substantially smaller cache is much less effective while a substantially larger cache brings little additional benefit. We
显示全部
相似文档