The ParPar system is a high-performance cluster environment supporting a multiuser parallel workload. Its design follows a master-nodes structure, where the master controls all aspects of system activity using a dedicated control network. As nearly all control messages are multicast to a set of nodes, we implemented a reliable multicast protocol for this network based on UDP. This was then used to pre-load executable files to the nodes, rather than using demand paging via NFS. Such pre-loading leads to significant reductions in job startup times in most cases. It is also more scalable than an asymmetrical hardware approach giving the master higher bandwidth, which can be used for small clusters.
Bibliographical noteFunding Information:
This research was supported in part by The Ministry of Science Basic Infrastructure Fund, Project 9762, and by The Israel Science Foundation founded by the Israel Academy of Sciences & Humanities. The LANL job log data was graciously provided by Curt Canada, who also helped with its interpretation. the LLNL job log data was graciously provided by Moe Jette, who also helped with background information and interpretation.