Tractable near-optimal policies for crawling

Yossi Azar, Eric Horvitz, Eyal Lubetzky, Yuval Peres*, Dafna Shahaf

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

20 Scopus citations

Abstract

The problem of maintaining a local cache of n constantly changing pages arises in multiple mechanisms such as web crawlers and proxy servers. In these, the resources for polling pages for possible updates are typically limited. The goal is to devise a polling and fetching policy that maximizes the utility of served pages that are up to date. Cho and Garcia-Molina [(2003) ACM Trans Database Syst 28:390–426] formulated this as an optimization problem, which can be solved numerically for small values of n, but appears intractable in general. Here, we show that the optimal randomized policy can be found exactly in O(n log n) operations. Moreover, using the optimal probabilities to define in linear time a deterministic schedule yields a tractable policy that in experiments attains 99% of the optimum.

Original languageEnglish
Pages (from-to)8099-8103
Number of pages5
JournalProceedings of the National Academy of Sciences of the United States of America
Volume115
Issue number32
DOIs
StatePublished - 7 Aug 2018

Bibliographical note

Publisher Copyright:
© 2018 National Academy of Sciences. All rights reserved.

Keywords

  • Caching policies
  • Scheduling optimization
  • Web crawling

Fingerprint

Dive into the research topics of 'Tractable near-optimal policies for crawling'. Together they form a unique fingerprint.

Cite this