TY - CHAP
T1 - Ffmk
T2 - A fast and fault-tolerant microkernel-based system for exascale computing
AU - Weinhold, Carsten
AU - Lackorzynski, Adam
AU - Bierbaum, Jan
AU - Küttler, Martin
AU - Planeta, Maksym
AU - Weisbach, Hannes
AU - Hille, Matthias
AU - Härtig, Hermann
AU - Margolin, Alexander
AU - Sharf, Dror
AU - Levy, Ely
AU - Gak, Pavel
AU - Barak, Amnon
AU - Gholami, Masoud
AU - Schintke, Florian
AU - Schütt, Thorsten
AU - Reinefeld, Alexander
AU - Lieber, Matthias
AU - Nagel, Wolfgang E.
N1 - Publisher Copyright:
© The Author(s) 2020.
PY - 2020
Y1 - 2020
N2 - The FFMK project designs, builds and evaluates a system-software architecture to address the challenges expected in Exascale systems. In particular, these challenges include performance losses caused by the much larger impact of runtime variability within applications, hardware, and operating system (OS), as well as increased vulnerability to failures. The FFMK OS platform is built upon a multi-kernel architecture, which combines the L4Re microkernel and a virtualized Linux kernel into a noise-free, yet feature-rich execution environment. It further includes global, distributed platform management and system-level optimization services that transparently minimize checkpoint/restart overhead for applications. The project also researched algorithms to make collective operations fault tolerant in presence of failing nodes. In this paper, we describe the basic components, algorithms, and services we developed in Phase 2 of the project.
AB - The FFMK project designs, builds and evaluates a system-software architecture to address the challenges expected in Exascale systems. In particular, these challenges include performance losses caused by the much larger impact of runtime variability within applications, hardware, and operating system (OS), as well as increased vulnerability to failures. The FFMK OS platform is built upon a multi-kernel architecture, which combines the L4Re microkernel and a virtualized Linux kernel into a noise-free, yet feature-rich execution environment. It further includes global, distributed platform management and system-level optimization services that transparently minimize checkpoint/restart overhead for applications. The project also researched algorithms to make collective operations fault tolerant in presence of failing nodes. In this paper, we describe the basic components, algorithms, and services we developed in Phase 2 of the project.
UR - http://www.scopus.com/inward/record.url?scp=85089615234&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-47956-5_16
DO - 10.1007/978-3-030-47956-5_16
M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.chapter???
AN - SCOPUS:85089615234
T3 - Lecture Notes in Computational Science and Engineering
SP - 483
EP - 516
BT - Lecture Notes in Computational Science and Engineering
PB - Springer
ER -