A Differential Testing Framework to Identify Critical AV Failures Leveraging Arbitrary Inputs

The proliferation of autonomous vehicles (AVs) has made their failures increasingly evident. Testing efforts aimed at identifying the inputs leading to those failures are challenged by the input's long-tail distribution, whose area under the curve is dominated by rare scenarios. We hypothesize...

Full description

Saved in:
Bibliographic Details
Published in:Proceedings / International Conference on Software Engineering pp. 360 - 372
Main Authors: Woodlief, Trey, Hildebrandt, Carl, Elbaum, Sebastian
Format: Conference Proceeding
Language:English
Published: IEEE 26.04.2025
Subjects:
ISSN:1558-1225
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The proliferation of autonomous vehicles (AVs) has made their failures increasingly evident. Testing efforts aimed at identifying the inputs leading to those failures are challenged by the input's long-tail distribution, whose area under the curve is dominated by rare scenarios. We hypothesize that leveraging emerging open-access datasets can accelerate the exploration of long-tail inputs. Having access to diverse inputs, however, is not sufficient to expose failures; an effective test also requires an oracle to distinguish between correct and incorrect behaviors. Current datasets lack such oracles and developing them is notoriously difficult. In response, we propose Difftest4av,a differential testing framework designed to address the unique challenges of testing AV systems: 1) for any given input, many outputs may be considered acceptable, 2) the long tail contains an insurmountable number of inputs to explore, and 3) the AV's continuous execution loop requires failures to persist in order to affect the system. Difftest4av integrates statistical analysis to identify meaningful behavioral variations, judges their importance in terms of the severity of these differences, and incorporates sequential analysis to detect persistent errors indicative of potential system-level failures. Our study on 5 versions of the commercially-available, road-deployed comma.ai OpenPilot system, using 3 available image datasets, demonstrates the capabilities of the framework to detect high-severity, high-confidence, long- running test failures.
ISSN:1558-1225
DOI:10.1109/ICSE55347.2025.00163