Abstract
The most commonly used scheduling algorithm for parallel supercomputers is FCFS with backfilling, as originally introduced in the EASY scheduler. Backfilling means that short jobs are allowed to run ahead of their time provided they do not delay previously queued jobs (or at least the first queued job). To make such determinations possible, users are required to provide estimates of how long jobs will run, and jobs that violate these estimates are killed. Empirical studies have repeatedly shown that user estimates are inaccurate, and that system-generated predictions based on history may be significantly better. However, predictions have not been incorporated into production schedulers, partially due to a misconception (that we resolve) claiming inaccuracy actually improves performance, but mainly because underprediction is technically unacceptable: Users will not tolerate jobs being killed just because system predictions were too short. We solve this problem by divorcing kill-time from the runtime prediction and correcting predictions adaptively as needed if they are proved wrong. The end result is a surprisingly simple scheduler, which requires minimal deviations from current practices (e.g., using FCFS as the basis) and behaves exactly like EASY as far as users are concerned; nevertheless, it achieves significant improvements in performance, predictability, and accuracy. Notably, this is based on a very simple runtime predictor that just averages the runtimes of the last two jobs by the same user; counterintuitively, our results indicate that using recent data is more important than mining the history for similar jobs. All the techniques suggested in this paper can be used to enhance any backfilling algorithm and are not limited to EASY.
Original language | English |
---|---|
Pages (from-to) | 789-803 |
Number of pages | 15 |
Journal | IEEE Transactions on Parallel and Distributed Systems |
Volume | 18 |
Issue number | 6 |
DOIs | |
State | Published - Jun 2007 |
Bibliographical note
Funding Information:This research was supported in part by the Israel Science Foundation (grant no. 167/03). The authors would like to thank the people and organizations who deposited their workload logs in the Parallel Workloads Archive and made this research possible.
Keywords
- Backfilling
- Dynamic prediction correction
- EASY
- EASY++
- History-based predictions
- Parallel job scheduling
- Performance metrics
- Runtime estimates
- SJBF
- System-generated predictions