Adversarial Machine Learning - overview
Overview
The use of machine learning models has become ubiquitous. Their predictions are used to make decisions about healthcare, security, investments and many other critical applications. Thus it is not surprising that bad actors would want to manipulate such systems for nefarious purposes. All machine learning systems are trained using data sets that are assumed to be representative and valid for the subject matter in question. However, malicious actors can impact how the artificial intelligence system functions by poisoning the training data. This threat is exacerbated when the machine learning pipeline that includes data collection, curation, labeling, and training is not controlled completely by the model owner.
In this project, we seek to answer these questions: How can you tell when the training data has been poisoned? Can you repair a model that has been poisoned?
Generally speaking, malicious actors poison training data to
- Misclassify inputs - Here, the adversary aims to shift the decision boundary of the model to ensure that a specific input is misclassified to a targeted class. For example, such attacks might categorize certain pollutants as innocuous, a sick person as healthy, or an anomaly as normal. A particularly insidious attack in this category is the backdoor or trojan attack, where the adversary carefully poisons the model by inserting a backdoor key to ensure it will perform well on standard training data and validation samples, but misbehaves only when a backdoor key is present. Thus an attacker can selectively make a model misbehave by introducing backdoor keys once the model is deployed. In one traffic example, a backdoor causes the model to misclassify a stop sign as speed limit whenever a post-it note has been placed on the stop sign. However, the model performs as expected on stop signs without the post-it note, making the backdoor difficult to detect since users do not know the backdoor key (a post-it note in this case) a priori. Clearly, such a backdoor can result in disastrous consequences for autonomous vehicles.
- Reduce model performance - The objective of the adversary in this case is to limit the system's usefulness. Here the adversary attempts to reduce the overall accuracy of the model resulting in general misclassifications.
Poisoning threats are particularly relevant when training data is obtained from untrusted sources, such as crowdsourced data or customer behavior data. Additionally, the risk increases when the model requires frequent retraining or customization. Lastly, the ability to detect when models have been poisoned or tampered with is vital when they are trained by untrusted third-parties (e.g. obtained from a model marketplace).
Up to this point (2018), most research has focused on demonstrating and categorizing the types of malicious attacks against machine learning systems training data. However, few defenses have been proposed to proactively detect and revert poisoning attacks. Our work goes directly at identifying and correcting malicious attacks on training data. Thus far, we propose innovations using two different methodologies to detect and repair different types of poisoning attacks:
- The provenance-based RONI approach is appropriate for models when there is a trusted provenance feature in the dataset.
- The activation cluster approach is appropriate for detecting backdoors and distinct decision pathways that lead to a common classification.
Provenance-Based RONI
Recently, a number of secure provenance frameworks have been developed for Internet of Things environments. These frameworks use modern cryptographic methodologies to ensure that provenance data, which describes the origin and lineage of collected datapoints, cannot be modified by adversaries. In our work, we take advantage of these frameworks to help detect poisonous data inserted to reduce model performance.
Intuitively, it is difficult for an adversary to compromise every data source, due to time and resource constraints. For this reason, we can often expect poisonous data to originate from a limited number of sources. Our method segments the training data into groups according to provenance data where the probability of poisoning is highly correlated across samples in each group. Once the training data has been segmented appropriately, data points in each segment are evaluated together by comparing the performance of the classifier trained with and without that group. The figure below depicts this process.
In contrast, a prior approach called Reject on Negative Impact (RONI) evaluated the effect of individual data points on the performance of the final classifier. However, single data points often have minimal impact on the overall performance, and, as a result, poisonous data may escape detection. Additionally, evaluating each data point incurs significant computation and time costs. By using provenance data, our method is able to appropriately group datapoints together and evaluate their cumulative effect on the classifier, thereby increasing detection rates and reducing computational costs.
Finally, we note that our provenance-based defense can be applied to any setting in which a trusted feature that is indicative of where poisonous data might be concentrated is available. For example, if an adversary tried to poison a machine learning model that detects fraudulent credit card transactions, the account number can be used as a trusted feature. Adversaries may falsely report transaction as fraudulent and/or legitimate, but they cannot manipulate the account to which the transaction is posted. Additionally, they can only compromise a limited number of credit cards. For more information, please see the Publications section
Neural Network Activation Clusters
Our team has developed a method to detect backdoor attacks by analyzing differences in how a neural network decides on the classification. "Activations" are the intermediary computations made by the network before making its final classification. Our approach segments training data according to its labels and clusters the corresponding activations from the last hidden layer of the neural network. Poisonous and legitimate data immediately separate into distinct clusters, akin to the way in which different areas of the brain light up on scans when subjected to different stimuli. This can be readily seen in the figure below in which a backdoor trigger, a post-it note on a stop sign, has been categorized as a speed limit.

Each cluster can then be examined for poison. Our results show over 99% accuracy on both the MNIST and LISA image datasets. We created an averaged image to examine the clusters. In both cases the image trigger was apparent. Similarly, we have tested the approach on a text-based data set using Rotten Tomatoes Sentiment Analysis in which we injected a backdoor signature. Here we achieved 97% accuracy and identification of the poisonous signature. Once the backdoor has been identified, we demonstrate how the backdoor can be effectively removed by further training the neural network using the poisoned data which is relabeled for correctness.
Publications
- Detecting Backdoor Attacks on Deep Neural Networks by Activation Clustering. Submission to NIPS December 2018