FFMK: A fast and fault-tolerant microkernel-based system for exascale computing

Carsten Weinhold*, Adam Lackorzynski, Jan Bierbaum, Martin Küttler, Maksym Planeta, Hermann Härtig, Amnon Shiloh, Ely Levy, Tal Ben-Nun, Amnon Barak, Thomas Steinke, Thorsten Schütt, Jan Fajerski, Alexander Reinefeld, Matthias Lieber, Wolfgang E. Nagel

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

3 Scopus citations

Abstract

In this paper we describe the hardware and application-inherent challenges that future exascale systems pose to high-performance computing (HPC) and propose a system architecture that addresses them. This architecture is based on proven building blocks and few principles: (1) a fast light-weight kernel that is supported by a virtualized Linux for tasks that are not performance critical, (2) decentralized load and health management using fault-tolerant gossip-based information dissemination, (3) a maximally-parallel checkpoint store for cheap checkpoint/restart in the presence of frequent component failures, and (4) a runtime that enables applications to interact with the underlying system platform through new interfaces. The paper discusses the vision behind FFMK and the current state of a prototype implementation of the system, which is based on a microkernel and an adapted MPI runtime.

Original languageEnglish
Title of host publicationSoftware for Exascale Computing - SPPEXA 2013-2015
EditorsWolfgang E. Nagel, Hans-Joachim Bungartz, Philipp Neumann
PublisherSpringer Verlag
Pages405-426
Number of pages22
ISBN (Print)9783319405261
DOIs
StatePublished - 2016
EventInternational Conference on Software for Exascale Computing, SPPEXA 2015 - Munich, Germany
Duration: 25 Jan 201627 Jan 2016

Publication series

NameLecture Notes in Computational Science and Engineering
Volume113
ISSN (Print)1439-7358

Conference

ConferenceInternational Conference on Software for Exascale Computing, SPPEXA 2015
Country/TerritoryGermany
CityMunich
Period25/01/1627/01/16

Bibliographical note

Publisher Copyright:
© Springer International Publishing Switzerland 2016.

Fingerprint

Dive into the research topics of 'FFMK: A fast and fault-tolerant microkernel-based system for exascale computing'. Together they form a unique fingerprint.

Cite this