A fault tolerance framework in a concurrent programming environment

Abstract

As CMOS technology scales ever further, multi-core processors are becoming mainstream both in research and industry. However, the system vulnerability is increasing due to tighter design margins and greater susceptibility to interference, both caused by smaller feature size, lower power supply voltage, higher frequency, greater hardware complexity and more transistors per processor. Meanwhile, concurrent programming environment has emerged, a general designation for the norms in the exploitation of systems with multi-core processors, which is widely believed to be the main approach for gaining scalable performance improvement from multi-core systems based on parallelism exploitation and resource scheduling. In this dissertation, we specifically explore the construction of a fault tolerance framework in a concurrent programming environment. During this process, we investigate the features of a concurrent programming environment. With this knowledge, we design a cross-layer, flexible, low-overhead fault tolerance framework including fault detection, and recovery, as well as fault injection for its evaluation. The proposed fault tolerance framework targets the general paradigm of concurrent programming environments, and is evaluated and implemented in a specific platform, i.e., the Microgrids.