Network contention frequently dominates the run time of parallel algorithms and limits scaling performance. Most previous studies mitigate or eliminate contention by utilizing one of several approaches: communication-minimizing algorithms; hotspot-avoiding routing schemes; topology-aware task mapping; or improving global network properties, such as bisection bandwidth, edge-expansion, partitioning, and network diameter. In practice, parallel jobs often use only a fraction of a host system. How do processor allocation policies affect contention within a partition? We utilize edge-isoperimetric analysis of network graphs to determine whether a network partition defined by a processor allocation has optimal internal bisection. Increasing the bisection allows a more efficient use of the network resources, decreasing or completely eliminating the link contention. We study torus networks and characterize partition geometries that maximize internal bisection bandwidth, and examine the allocation policies of Mira and JUQUEEN, the two largest publicly-accessible Blue∼Gene/Q torus-based supercomputers. Our analysis shows that the bisection bandwidth of their partitions can often be improved by changing the partitions' geometries, yielding up to a x2∼speedup for contention-bound workloads. Benchmark experiments validate the predictions. Our analysis applies to allocation policies of other networks.
|Original language||American English|
|Title of host publication||SPAA 2020 - Proceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures|
|Publisher||Association for Computing Machinery|
|Number of pages||3|
|State||Published - 6 Jul 2020|
|Event||32nd ACM Symposium on Parallelism in Algorithms and Architectures, SPAA 2020 - Virtual, Online, United States|
Duration: 15 Jul 2020 → 17 Jul 2020
|Name||Annual ACM Symposium on Parallelism in Algorithms and Architectures|
|Conference||32nd ACM Symposium on Parallelism in Algorithms and Architectures, SPAA 2020|
|Period||15/07/20 → 17/07/20|
Bibliographical noteFunding Information:
Research is supported by grants 1878/14, and 1901/14 from the Israel Science Foundation (founded by the Israel Academy of Sciences and Humanities) and grant 3-10891 from the Ministry of Science and Technology, Israel. Research is also supported by the Einstein Foundation and the Minerva Foundation. This work was supported by the PetaCloud industry-academia consortium. This research was supported by a grant from the United States-Israel Bi-national Science Foundation (BSF), Jerusalem, Israel. This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 818252). This work was supported by The Federmann Cyber Security Center in conjunction with the Israel national cyber directorate.
The authors gratefully acknowledge the Gauss Centre for Supercomputing e.V. (www.gauss-centre.eu) for funding this project by providing computing time through the John von Neumann Institute for Computing (NIC) on the GCS Supercomputer JUQUEEN at Jülich Supercomputing Centre (JSC). This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.
© 2020 Owner/Author.
- high-performance computing
- network topologies
- parallel computing
- torus networks