Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling

Ahuva W. Mu'alem*, Dror G. Feitelson

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

538 Scopus citations

Abstract

Scheduling jobs on the IBM SP2 system and many other distributed-memory MPPs is usually done by giving each job a partition of the machine for its exclusive use. Allocating such partitions in the order in which the jobs arrive (FCFS scheduling) is fair and predictable, but suffers from severe fragmentation, leading to low utilization. This situation led to the development of the EASY scheduler which uses aggressive backfilling: Small jobs are moved ahead to fill in holes in the schedule, provided they do not delay the first job in the queue. We compare this approach with a more conservative approach in which small jobs move ahead only if they do not delay any job in the queue and show that the relative performance of the two schemes depends on the workload: For workloads typical on SP2 systems, the aggressive approach is indeed better, but, for other workloads, both algorithms are similar. In addition, we study the sensitivity of backfilling to the accuracy of the runtime estimates provided by the users and find a very surprising result: Backfilling actually works better when users overestimate the runtime by a substantial factor.

Original languageAmerican English
Pages (from-to)529-543
Number of pages15
JournalIEEE Transactions on Parallel and Distributed Systems
Volume12
Issue number6
DOIs
StatePublished - Jun 2001

Bibliographical note

Funding Information:
This research was supported by the Ministry of Science and Technology and by the Israel Science Foundation founded by the Israel Academy of Sciences and Humanities. The workload log from the CTC SP2 was graciously provided by the Cornell Theory Center, a high-performance computing center at Cornell University, Ithaca, New York. The workload log from the KTH SP2 was graciously provided by Lars Malinowsky, who also helped with background information and interpretation. The workload log from the SDSC SP2 was graciously provided by Victor Hazlewood of the HPC Systems group of the San Diego Supercomputer Center (SDSC), which is the leading-edge site of the National Partnership for Advanced Computational Infrastructure (NPACI), and is available from the NPACI JOBLOG repository at http://joblog.npaci.edu. The code for the Jann workload model was graciously provided by Joefon Jann of IBM Research. This paper supercedes the

Keywords

  • Backfilling
  • Parallel job scheduling
  • Performance metrics
  • Runtime estimates
  • Workload modeling

Fingerprint

Dive into the research topics of 'Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling'. Together they form a unique fingerprint.

Cite this