Skip to main content

AI systems are increasingly being used in mission and safety-critical applications. One of the more high-profile uses of AI is in the development of autonomous and semi-autonomous vehicles. A key component of these systems are models that perform image classification: they take an input image and predict a label. These systems help classify objects like the type of sign on the side of a road: does it indicate to stop, yield, or is it simply a warning about a turtle crossing? As we have discussed previously, these systems have experienced several high-profile failures that appear to have come about in naturally-occurring scenarios (see Figure 1) rather than intentional manipulation (e.g. a hacker accessing and attacking the system). While these particular failures are specific to autonomous vehicles, any AI system that performs image classification is similarly vulnerable. Such failures rightly jeopardize users’ trust in these systems. If we want to establish justified confidence in these systems, we must start by quantifying how robust they are to these, and other types of manipulation.

Verification and validation tests are at the heart of CalysoAI’s VESPR platform. Every image classification model developed through the platform is evaluated using a battery of assessments, including a suite of tests for naturally-occurring image corruptions. These assessments can also be used to evaluate externally-trained TensorFlow (or Keras) models.

As a quick example, we used the VESPR platform to evaluate an off-the-shelf image classification model published by DeepMind that achieves near-perfect (99% top-1 accuracy) performance on in-sample images. The model uses a VGG-style architecture and was trained on the popular CIFAR10 dataset of 32 x 32-pixel images. However, when this model is evaluated using our suite of 25 different corruption tests (5 different corruption types with 5 levels each) the top-1% accuracy drops to between 7.8% and 11.3%, which is equivalent to the model randomly guessing one of the 10 possible labels.

These tests effectively simulate things like digital noise, an out-of-focus camera, and distortions caused by up-sampling a low resolution image. While these tests do not increase the robustness of the pre-trained model to naturally-occurring corruptions, they do provide important diagnostic information. At the lowest level, this information can be used to compare multiple candidate models and select the best among them. However, this information is also critical for decision-makers who must assess if the model will perform adequately in real-world conditions.

Overlaying the context in which the model is deployed informs the deployment decision. There are a few things we should ask ourselves: Should a model be deployed in isolation? Should it be deployed with a human-in-the-loop to review decisions in certain circumstances? Or, should the model not be deployed at all? Answering these questions in an evidence-based way is necessary to enable confidence, trust, and transparency in any AI system.