Article Preview
Top1. Introduction
Autonomous systems make use of a suite of algorithms in order to understand the environment in which they are deployed and make independent decisions. These algorithms typically solve one or more classic problems, such as classification and prediction. Artificial neural networks (ANNs) are one such class of algorithms, which have shown great promise in view of their ability to learn complicated patterns underlying high-dimensional data. The decision boundary approximated by such networks is highly non-linear and difficult to interpret, which is particularly problematic in cases where these decisions can compromise the safety of either the system itself, or people. Furthermore, the choice of data used to prepare and test the network can have a dramatic impact on performance and, in consequence, safety.
Verification and validation (V&V) are vital parts of the development and deployment of any engineering system. V&V processes are well established in more mature sectors of engineering such as aerospace and automotive. However, they are not as well developed in areas such as autonomy and machine learning (ML), and the broader field of artificial intelligence (AI). Since ML technologies are being more widely adopted, it is ever more important that they behave as expected, and interact safely with people. Our focus is on the verification of ANNs when used for image classification in safety-critical systems.
Systems are verified with respect to the specified requirements. One such requirement for a classifier might state a necessary level of classification performance, and this requirement can be verified by dynamic testing. However, it might be the case that such a requirement does not specify any properties of the test dataset. If a test dataset provides only a modest classification challenge to a network, then a high-level of classification performance does not mean that the network will operate well during operation. An additional condition needs to be specified i.e. the properties of the test dataset used to evaluate the classification performance. For example, the test dataset might be characterized in terms of its relation to the dataset used to train the classifier, or in terms of its noise content, or in terms of the intrinsic separability of its component classes. System requirements addressing discriminative capability could then state the permitted form of a function mapping test dataset properties to classifier performance. If these requirements are specified and verified, we can have a degree of confidence that the classifier will perform at a certain level in an operational mode when applied to input instances of a certain type.
This paper introduces a measure and its variants that can be used to quantify the dissimilarity between a test dataset and a training dataset. This dissimilarity will henceforth be termed ‘dataset dissimilarity’. Classifier performance for a particular test dataset might itself be measured in terms of accuracy for example. If so, classifier accuracy can then be given as a function of this dataset dissimilarity measure i.e. each test dataset is assigned a dataset dissimilarity value, and this quantity will map to an accuracy value. This in turn allows system-level requirements to be formulated in terms of the required relationship between performance and the test dataset dissimilarity measure. If such a requirement is verified, evidence has been gathered that a classifier will perform at a certain level when applied to test datasets; there will be a greater level of confidence that a classifier will generalise as required to data which is dissimilar to the training dataset.
The contribution made by the study reported in this paper is, firstly, the introduction of a novel measure which gauges the dissimilarity between a test dataset and a training dataset. This measure adopts and extends some of the concepts reported in DeepGauge on testing criteria (Ma et al., 2018). Secondly, we demonstrate that the measure can be used to determine the relationship between test dataset dissimilarity and classifier performance. Thirdly, we investigate the suitability of the MMD, an established measure, for gauging test dataset dissimilarity and thereby predicting classifier performance. Finally, we propose an integrated process for the verification of ANN classifier generalisation performance. Dissimilarity measures play a key role within this verification process. The outputs of the verification process presented in this paper have “cross-domain usage” across many industries including maritime, transportation, and aviation.