FlexStep: Enabling Flexible Error Detection in Multi/Many-core Real-time Systems

Saved in:
Bibliographic Details
Title: FlexStep: Enabling Flexible Error Detection in Multi/Many-core Real-time Systems
Authors: Wang, Tinglue, Li, Yiming, Tang, Wei, Guan, Jiapeng, Guo, Zhenghui, Jiang, Renshuang, Wei, Ran, Li, Jing, Jiang, Zhe
Source: 2025 62nd ACM/IEEE Design Automation Conference (DAC). :1-7
Publication Status: Preprint
Publisher Information: IEEE, 2025.
Publication Year: 2025
Subject Terms: FOS: Computer and information sciences, Computer Science - Distributed, Parallel, and Cluster Computing, Hardware Architecture (cs.AR), Distributed, Parallel, and Cluster Computing (cs.DC), Computer Science - Hardware Architecture
Description: Reliability and real-time responsiveness in safety-critical systems have traditionally been achieved using error detection mechanisms, such as LockStep, which require pre-configured checker cores,strict synchronisation between main and checker cores, static error detection regions, or limited preemption capabilities. However, these core-bound hardware mechanisms often lead to significant resource over-provisioning, and diminished real-time responsiveness, particularly in modern systems where tasks with varying reliability requirements are consolidated on shared processors to improve efficiency, reduce costs, and save power. To address these challenges, this work presents FlexStep, a systematic solution that integrates hardware and software across the SoC, ISA, and OS scheduling layers. FlexStep features a novel microarchitecture that supports dynamic core configuration and asynchronous, preemptive error detection. The FlexStep architecture naturally allows for flexible task scheduling and error detection, enabling new scheduling algorithms that enhance both resource efficiency and real-time schedulability. We publicly release FlexStep's source code, at https://anonymous.4open.science/r/FlexStep-DAC25-7B0C.
Document Type: Article
DOI: 10.1109/dac63849.2025.11132561
DOI: 10.48550/arxiv.2503.13848
Access URL: http://arxiv.org/abs/2503.13848
Rights: STM Policy #29
arXiv Non-Exclusive Distribution
Accession Number: edsair.doi.dedup.....7fa79fc8c53b4dde414d917365864329
Database: OpenAIRE
Description
Abstract:Reliability and real-time responsiveness in safety-critical systems have traditionally been achieved using error detection mechanisms, such as LockStep, which require pre-configured checker cores,strict synchronisation between main and checker cores, static error detection regions, or limited preemption capabilities. However, these core-bound hardware mechanisms often lead to significant resource over-provisioning, and diminished real-time responsiveness, particularly in modern systems where tasks with varying reliability requirements are consolidated on shared processors to improve efficiency, reduce costs, and save power. To address these challenges, this work presents FlexStep, a systematic solution that integrates hardware and software across the SoC, ISA, and OS scheduling layers. FlexStep features a novel microarchitecture that supports dynamic core configuration and asynchronous, preemptive error detection. The FlexStep architecture naturally allows for flexible task scheduling and error detection, enabling new scheduling algorithms that enhance both resource efficiency and real-time schedulability. We publicly release FlexStep's source code, at https://anonymous.4open.science/r/FlexStep-DAC25-7B0C.
DOI:10.1109/dac63849.2025.11132561