Backfilling using system-generated predictions rather than user runtime estimates

Dan Tsafrir*, Yoav Etsion, Dror G. Feitelson

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

275 Scopus citations

Abstract

The most commonly used scheduling algorithm for parallel supercomputers is FCFS with backfilling, as originally introduced in the EASY scheduler. Backfilling means that short jobs are allowed to run ahead of their time provided they do not delay previously queued jobs (or at least the first queued job). To make such determinations possible, users are required to provide estimates of how long jobs will run, and jobs that violate these estimates are killed. Empirical studies have repeatedly shown that user estimates are inaccurate, and that system-generated predictions based on history may be significantly better. However, predictions have not been incorporated into production schedulers, partially due to a misconception (that we resolve) claiming inaccuracy actually improves performance, but mainly because underprediction is technically unacceptable: Users will not tolerate jobs being killed just because system predictions were too short. We solve this problem by divorcing kill-time from the runtime prediction and correcting predictions adaptively as needed if they are proved wrong. The end result is a surprisingly simple scheduler, which requires minimal deviations from current practices (e.g., using FCFS as the basis) and behaves exactly like EASY as far as users are concerned; nevertheless, it achieves significant improvements in performance, predictability, and accuracy. Notably, this is based on a very simple runtime predictor that just averages the runtimes of the last two jobs by the same user; counterintuitively, our results indicate that using recent data is more important than mining the history for similar jobs. All the techniques suggested in this paper can be used to enhance any backfilling algorithm and are not limited to EASY.

Original languageAmerican English
Pages (from-to)789-803
Number of pages15
JournalIEEE Transactions on Parallel and Distributed Systems
Volume18
Issue number6
DOIs
StatePublished - Jun 2007

Bibliographical note

Funding Information:
This research was supported in part by the Israel Science Foundation (grant no. 167/03). The authors would like to thank the people and organizations who deposited their workload logs in the Parallel Workloads Archive and made this research possible.

Keywords

  • Backfilling
  • Dynamic prediction correction
  • EASY
  • EASY++
  • History-based predictions
  • Parallel job scheduling
  • Performance metrics
  • Runtime estimates
  • SJBF
  • System-generated predictions

Fingerprint

Dive into the research topics of 'Backfilling using system-generated predictions rather than user runtime estimates'. Together they form a unique fingerprint.

Cite this