Dynamic reliability management for near-threshold dark silicon processors
In this article, we propose a new dynamic reliability management (DRM) techniques at the system level for emerging low power dark silicon manycore microprocessors operating in near-threshold region. We mainly consider the electromigration (EM) failures. To leverage the EM recovery effects, which was...
Saved in:
| Published in: | Digest of technical papers - IEEE/ACM International Conference on Computer-Aided Design pp. 1 - 7 |
|---|---|
| Main Authors: | , , , , , , |
| Format: | Conference Proceeding |
| Language: | English |
| Published: |
ACM
01.11.2016
|
| Subjects: | |
| ISSN: | 1558-2434 |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | In this article, we propose a new dynamic reliability management (DRM) techniques at the system level for emerging low power dark silicon manycore microprocessors operating in near-threshold region. We mainly consider the electromigration (EM) failures. To leverage the EM recovery effects, which was ignored in the past, at the system-level, we propose a new equivalent DC current model to consider recovery effects for general time-varying current waveforms so that existing compact EM model can be applied. The new equivalent DC current is calculated in two steps: firstly, the equivalent square waveform is calculated so that peak and terminal stresses are matched, secondly, the parameterized equivalent DC current is derived in terms of the parameters of the periodic fitted square waveforms from the first step. The new recovery EM model can allow EM-induced lifetime to be better managed at the system level. The system level energy optimization problem considering EM lifetime subject to power and performance constraints is framed by seeking the best dark silicon cores' voltage and on/off status. The resulting problem is solved by the State-Action-Reward-State-Action (SARSA) reinforcement learning algorithm. Experimental results on a 64-core near-threshold dark silicon processor show that the new equivalent EM DC currents can fully exhibit the recovery effects at the system-level so that trade-off between EM lifetime and energy/performance can be easily made. We further show that the proposed learning-based energy optimization can effectively manage and optimize energy subject to reliability, given power budget and performance limits. When the recovery effects are considered, the new optimization method can achieve 8.6× longer lifetime at the costs of 2.0× more energy and 3.3× more performance degradation. |
|---|---|
| ISSN: | 1558-2434 |
| DOI: | 10.1145/2966986.2980080 |