Analyzing and Increasing the Reliability of Convolutional Neural Networks on GPUs

Graphics processing units (GPUs) are playing a critical role in convolutional neural networks (CNNs) for image detection. As GPU-enabled CNNs move into safety-critical environments, reliability is becoming a growing concern. In this paper, we evaluate and propose strategies to improve the reliabilit...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	IEEE transactions on reliability Ročník 68; číslo 2; s. 663 - 677
Hlavní autori:	Santos, Fernando Fernandes dos, Pimenta, Pedro Foletto, Lunardi, Caio, Draghetti, Lucas, Carro, Luigi, Kaeli, David, Rech, Paolo
Médium:	Journal Article
Jazyk:	English
Vydavateľské údaje:	New York IEEE 01.06.2019 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Predmet:	Algorithm-based fault tolerance (ABFT) Algorithms Artificial neural networks convolutional neural networks (CNNs) embedded systems Error correcting codes Error correction Error correction codes Fault tolerance Fault tolerant systems Graphics processing units Hardware Image detection Multiplication Network reliability Neural networks Neutron beams Object recognition reliability Reliability analysis Safety critical soft errors
ISSN:	0018-9529, 1558-1721
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Popis
Shrnutí:	Graphics processing units (GPUs) are playing a critical role in convolutional neural networks (CNNs) for image detection. As GPU-enabled CNNs move into safety-critical environments, reliability is becoming a growing concern. In this paper, we evaluate and propose strategies to improve the reliability of object detection algorithms, as run on three NVIDIA GPU architectures. We consider three algorithms: 1) you only look once; 2) a faster region-based CNN (Faster R-CNN); and 3) a residual network, exposing live hardware to neutron beams. We complement our beam experiments with fault injection to better characterize fault propagation in CNNs. We show that a single fault occurring in a GPU tends to propagate to multiple active threads, significantly reducing the reliability of a CNN. Moreover, relying on error correcting codes dramatically reduces the number of silent data corruptions (SDCs), but does not reduce the number of critical errors (i.e., errors that could potentially impact safety-critical applications). Based on observations on how faults propagate on GPU architectures, we propose effective strategies to improve CNN reliability. We also consider the benefits of using an algorithm-based fault-tolerance technique for matrix multiplication, which can correct more than 87% of the critical SDCs in a CNN, while redesigning maxpool layers of the CNN to detect up to 98% of critical SDCs.
Bibliografia:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	0018-9529 1558-1721
DOI:	10.1109/TR.2018.2878387