Analyzing and Increasing the Reliability of Convolutional Neural Networks on GPUs

Graphics processing units (GPUs) are playing a critical role in convolutional neural networks (CNNs) for image detection. As GPU-enabled CNNs move into safety-critical environments, reliability is becoming a growing concern. In this paper, we evaluate and propose strategies to improve the reliabilit...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:IEEE transactions on reliability Ročník 68; číslo 2; s. 663 - 677
Hlavní autori: Santos, Fernando Fernandes dos, Pimenta, Pedro Foletto, Lunardi, Caio, Draghetti, Lucas, Carro, Luigi, Kaeli, David, Rech, Paolo
Médium: Journal Article
Jazyk:English
Vydavateľské údaje: New York IEEE 01.06.2019
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Predmet:
ISSN:0018-9529, 1558-1721
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Popis
Shrnutí:Graphics processing units (GPUs) are playing a critical role in convolutional neural networks (CNNs) for image detection. As GPU-enabled CNNs move into safety-critical environments, reliability is becoming a growing concern. In this paper, we evaluate and propose strategies to improve the reliability of object detection algorithms, as run on three NVIDIA GPU architectures. We consider three algorithms: 1) you only look once; 2) a faster region-based CNN (Faster R-CNN); and 3) a residual network, exposing live hardware to neutron beams. We complement our beam experiments with fault injection to better characterize fault propagation in CNNs. We show that a single fault occurring in a GPU tends to propagate to multiple active threads, significantly reducing the reliability of a CNN. Moreover, relying on error correcting codes dramatically reduces the number of silent data corruptions (SDCs), but does not reduce the number of critical errors (i.e., errors that could potentially impact safety-critical applications). Based on observations on how faults propagate on GPU architectures, we propose effective strategies to improve CNN reliability. We also consider the benefits of using an algorithm-based fault-tolerance technique for matrix multiplication, which can correct more than 87% of the critical SDCs in a CNN, while redesigning maxpool layers of the CNN to detect up to 98% of critical SDCs.
Bibliografia:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:0018-9529
1558-1721
DOI:10.1109/TR.2018.2878387