The number of articles posted on arxiv on countering attacks shows how hot the field is.

Now if you want to enter the field of adversarial attacks at this time, you may need to explore three ways: 1. find new application scenarios; 2. design new algorithms; and 3. study how to circumvent existing defense schemes.

In order to facilitate the exploration of the third research route, this paper summarizes and analyzes the characteristics of the existing defense schemes.

An adversarial example is a sample formed in the original dataset by artificially adding subtle perturbations that are invisible to the naked eye or that do not affect the whole after processing, which results in a trained model giving a different classification output than the original sample with high confidence.

Perturbation is an important part of anti-sample generation. In general, the perturbation needs to have two requirements: one is to ensure its microscopicness, so as to achieve the effect of being invisible to the naked eye or visible to the naked eye after addition but not affecting the whole; The second is that after adding it to a specific pixel of the original image, the resulting new image has the effect of confusing the original classification depth model.

If you feed adversarial samples into a deep learning model, you can cause errors in the classification of the model. Existing models are vulnerable to attack against samples, which can cause models to misjudge, and we know that the process of countering attacks involves two key components of the input sample and the model. Then if we want to defend, we can also do it from these two directions, for the sample, you can test whether it is a countermeasure sample, and convert it to eliminate the adversarial perturbation unique to the adversarial perturbation; For models, you can modify the model structure or training process to enhance the robustness of the model against countermeasures. Next, we will summarize and analyze representative defense schemes from both model and sample aspects.

Confrontation training

Adversarial training is the addition of adversarial samples to the training set to improve the robustness of the model.

Adversarial training can also be divided into two types, the general adversarial training[1] added by attacking the own model is generated, and there is an adversarial training method[2] added by attacking other models, by increasing the diversity of the adversarial samples, improving the robustness of the model lower than other adversarial attacks.

Specifically,[1] the adversarial robustness of neural networks is studied from the perspective of robust optimization, using the min-max formula to incorporate attack and defense into a common theoretical framework, which is essentially a saddle point problem, while the end can be optimized directly through adversarial training. The questions are as follows

We quantify p as the adversarial loss of the network. The inside of the problem is the maximization problem, and the outside is the combination of the minimization problem. The goal of the internal maximization problem is to find an adversarial sample with a high loss for a given data point x, which is the target of the adversarial attack, while on the other hand, the goal of the external minimization problem is to find the model parameters so that the adversarial loss given by the internal problem is minimized, which is the goal of the adversarial robustness of the model. This article solves the solution of this saddle point problem in the neural network environment, so as to achieve adversarial training to improve the adversarial robustness of the model. Some of the experimental results are as follows

The figure shows the cross-entropy loss of the adversarial sample during training, and it can be seen that the adversarial loss is continuously declining when using PGD attacks on MNIST and CIFAR10, which means that the model has become more and more robust through adversarial training.

[2] It is noted that since adversarial samples can migrate between models, perturbations generated on other models are a good approximation of the maximization problem in the formula proposed [1], and that the model trained by them does not affect the strength of these adversarial samples, because the problem of minimizing training losses means increasing robustness to black-box attacks from other models. Some of the experimental results are as follows

where Step-LL is a variant of FGSM and the adversarial samples generated by it can be represented as

While Max.Black-Box is the data with the highest error rate among several black-box attacks (Step-LL, R+step-LL, FGSM, I-FGSM, PGD, etc.). The representation of the subscript adv-ens* in the first class is the model obtained by integrated adversarial training, and from the data in the table, it can be seen that the error rate of the model obtained by integrated adversarial training is the lowest, indicating that its robustness is the best.

[16] Through GAN, two networks are trained alternately, one generating adversarial samples and the other trying to distinguish them, and by playing games with each other, they construct classified networks that are more robust and can also detect adversarial samples. The two networks in the GAN are actually playing a min-max game in which both parties participate in the value function V (G, D):

This game forces both models to improve their abilities until the discriminator cannot distinguish between the generated sample and the original sample. [16] This scheme is applied to improve the adversarial robustness of the model, as shown in the following diagram

Use generator G to generate perturbations that can use the gradient of each image to trick the classifier network, while on the side of the classifier network F, it is trained to distinguish the adversarial sample generated by G from the original sample

The generator is the key to this, and its loss function is defined as follows

The above equation consists of two parts, one is the loss function, which is used to find the perturbation image that can reduce the probability of being correctly classified by F, the other is the cost function, which limits the perturbation power so that it is not too large, and cg is a hyperparameter used to adjust the ratio between the two functions. The loss function of classifier F is based on the adversarial objective function and is expressed as follows

Some of the experimental results are as follows

In the figure, a, b are experimented on different data sets, the black line represents the accuracy change of the model trained using GAN in the face of anti-sample attack, the red line is the basic confrontation training, the blue line is normal training, and it can be seen from the figure that the anti-robustness obtained by GAN training is higher than the basic confrontation training, higher than the normal training.

Gradient hiding

Most attacks use the model’s gradient information to generate an adversarial sample (FGSM. BIM etc.), we already know that gradient-based attacks on non-differential-based models (decision trees, KNN, random forests) are ineffective, so defenses can be achieved by hiding the gradient information of the model.

[3] The Deep Contractive Network (DCN) is proposed to introduce a contractive autoencoder (CAE) smoothing penalty term during the training process, so that the output changes of the model are less sensitive to the input, so as to achieve the purpose of hiding the gradient information. Ideally, the DCN model will punish the targets below

But the upper equation is expensive because the partial derivative needs to be calculated on each layer of backpropagation because the upper equation can be simplified as follows

Compressing penalty layer by layer at this point makes the method of calculating the partial derivative the same as in CAE and is easily incorporated into the backpropagation process.

Some of the results are as follows

As can be seen from the table, DCN requires an anti-sample to increase the distortion rate compared to the original model in order to launch an attack, and it has been said that DCN has stronger robustness than the general model.

[4] Defensive distillation was proposed, using the distillation method, but without changing the scale of the model, just to make the output of the model smoother and more robust. A schematic diagram of the scheme is shown below

First train an initial network F on the training set X at the soft temperature, then get its prediction vector F(X), and use the class probabilistic knowledge contained therein to train the distillation network Fd on the same data X at the temperature T

The core idea behind this scheme is that the knowledge extracted by distillation is transmitted to other networks in the form of probability vectors, which can maintain an accuracy similar to the original network, and at the same time improve the generalization ability of the network outside the training set, thereby enhancing its anti-interference ability to perturbation, that is, robustness. Some of the experimental results are as follows

As can be seen from the figure, when distillation is not applied, the success rate of countering attacks is 95.89% on MNIST and 87.89% on CIFAR10, while after applying distillation, the higher the temperature to a certain extent, the stronger the robustness, as can be seen from the figure, for the MNIST dataset, the robustness is strongest when the temperature is 40 and 100, and for CIFAR10, the robustness is strongest when the temperature reaches 100.

[5] Training using input gradient regularization penalizes the degree of change of the output to the input on the objective function, and limits the small adversarial perturbations to a certain extent without significantly changing the prediction results of the final model, improving the robustness of the model. The scheme is trained by minimizing the energy of the network and its rate of change relative to the input characteristics, expressed below

Can be more accurately expressed as

where lamda is a hyperparameter that specifies the magnitude of the penalty.

The goal of this scheme is to ensure that any small changes in input, the KL divergence between predictions and labels does not change significantly, and applying it to model training can improve model robustness. Some of the experimental results are as follows

As can be seen from the above figure, the gradient regularization scheme does greatly improve the robustness of the model, in addition, if you can combine gradient regularization and adversarial training at the same time, the robustness of the model will be higher.

The scheme based on gradient hiding in nature also proposes a new framework that produces gradient update information that is statistically indistinguishable, so that through gradient regularization, significant information that may lead to adversarial samples can be removed; In addition, [11] using a higher-order characterization-guided denoiser (HGD), training a neural network-based denoiser to eliminate adversarial perturbations also enhances the adversarial robustness of the model.

Enter the transformation

[6] The five most commonly used image preprocessing and conversion methods are used to directly convert the input samples at the same time, improving the accuracy of model prediction against the samples.

These 5 methods can be further divided into three categories, cropping-rescaling, bit-depth reduction, and compression are the first category. This type of scheme reduces the amount of perturbation that may exist in the sample to be predicted by various conversion methods, and then feeds the converted sample into the model for prediction. Image cropping-scaling can change the spatial position against perturbations, which is very important for the success of the attack, the researchers cropped and scaled the image during training time as part of data enhancement, while doing random cropping of input samples during testing, and then taking the predicted average. Bit reduction is a simple quantization operation that removes small perturbations of pixel values from the image; Compression is also a similar method for removing perturbations.

The second category is total variance minimization. This method randomly selects a small set of pixels and reconstructs the simplest image consistent with the selected pixels, the reconstructed image does not contain adversarial perturbations, which tend to be small and localized. First, select a random set of pixels by sampling a Bernoulli random variable X(i, j, k) for each pixel position (i, j, k); When X(i, j, k) = 1, we hold one pixel. Next, we construct an image z using total variance minimization, which is similar to the (perturbed) input image x for the selected set of pixels, and is also the simplest in terms of total variation, by solving the equation

TVp in the above equation represents the total variance, calculated by the following equation

Small disturbances can be removed by minimizing the TV.

The third type of method is image quilting, which removes adversarial perturbations by building a patch database containing only “clean” images (without adversarial perturbations); The patches used in compositing an image are selected by finding K nearest neighbors (pixel space) of the corresponding patch from the adversarial images in the patch database, and then uniformly and randomly selecting one of the neighbors. The idea behind this defense scheme is that the resulting image contains only pixels that have not been modified by the attacker, because the database of real patches is unlikely to contain structures that appear in the attacker’s image. Some of the experimental results are as follows

In the figure, the comparison of the processing scheme in the 5 and the situation without defense is intuitively compared, and it can be seen that the accuracy of the proposed 5 defense schemes is higher than that of no defense, and in addition, the plan based on cut-scaling has the best defense effect.

[7] Add an additional layer of randomization (including random resizing and random padding (filling 0 around the input image in a random manner)) before the predicted sample is fed into the original model, and then predict with the original model, the processing process is shown below

The input image is first randomly resized, and then randomly filled on the basis of random resizing, allowing the model to predict the filled sample. Some of the experimental results are as follows

You can see that the randomization layer effectively mitigates all adversarial attacks, and when used in conjunction with adversarial training, the defense is better.

[8] Pixel Defend was designed to “purify” the image by moving the adversarial sample toward the distribution presented in the training set.

The idea behind this scheme is to transform the inputs, and by making small transformations of the input samples, they can be moved back to the normal distribution of the training data, that is, to move the image towards the region of high probability. The process can be represented by the following equation

where p is the training set IDE distribution, the input sample is X, we want to get the processed sample as X, here the constraint Edefend, is used for trade-off, if the value is too large, then X\ may be different from the semantics of X, and the value is small, it may not be enough to return to the normal distribution. The researchers used the distribution of PixelCNN pCNN (X) to approximate p(X).

The table was recorded during experimentation on Fashion_MNIST dataset. The upper half of the table is the effect of the baseline defense scheme, the lower half is not the effect of the work of this article, the format x/y in the cell, where x represents the accuracy of attacking the image when the constraint is 8, and y represents the accuracy rate when the constraint is 5. The best case is indicated in bold in each column, and as can be seen from the table, the effect of the scheme is optimal in most cases.

Also belonging to this category is [17] to convert the bitmap input image into vector image space and return it before classification to avoid being deceived by adversarial structures; [9] Continuous input samples are discretized using thermometer encoding, which uses thermometer-encoded samples during both the training and prediction phases, limited by space and no longer expanding.

Input detection

[12] Assuming that the adversarial sample is not in a non-adversarial data manifold, two adversarial detection methods are proposed: nuclear density method and Bayesian uncertainty estimation. We know that adversarial samples allow humans to classify correctly, but models misclassify, and we can understand this phenomenon from the perspective of manifolds in the training data. Images are thought to be located on low-dimensional manifolds, and studies have shown that by traversing data manifolds, the true categories of images can be altered. The key idea behind this is that adversarial perturbations (no meaningful changes to the input) need to push samples out of the data manifold, and that adversarial samples are located near the class boundary at the edge of the data manifold, and based on these studies, we believe that adversarial samples do not exist on data manifolds. Suppose the original sample is x, its category is cx, the resulting adversarial sample is x, x\ is not in the manifold, and its category is cx*.

In the case of binary classification, the simplified schematic diagram is as follows, in which the adversarial sample x*, is pushed out from the “-“, and the black dotted line is the decision boundary, at which point there may be three cases:

In a , x is farther away from the ” + ” submanifold; In b, the “+” submanifold has a pocket in which x\ is located; In c, x* is close to the boundary and the decision boundary of both submanifolds.

The researchers modeled each class of submanifolds by estimating the kernel density in the feature space of the last layer of the hidden layer, and the formula for density estimation is as follows, which can be used to measure how far the x is from the t submanifold

For a point x of the prediction class t, the density estimate is defined as:

The φ is the activation vector of the last hidden layer of point x, and Xt is the set of data points of the training set of class t.

For the MNIST dataset, BIM against attacks, we draw density estimates as follows

As you can see from the figure, the estimate of the original class is falling (blue), the estimate of the error class is rising (red), and the intersection is basically near the prediction class change (black line). So this method can indeed detect adversarial samples, but the method of density estimation can only easily detect the antagonistic samples of the principle cx* submanifold, so the second method, Bayesian uncertainty estimation, can be calculated by the following equation

The following figure compares the Bayesian uncertainty distribution of the adversarial sample with the normal sample and the noise sample, and experiments are performed on both BIM and JSMA, and from the results of the following figure, it can be seen that the indeterminate distribution of the adversarial sample is statistically different from the other two types of samples, so it can be detected by this scheme.

Other schemes of this type include[13] using ADN as a binary detector network to distinguish between normal and adversarial samples,[10] and [10] using the property that the LID (local eigendimensional number) of the adversarial sample is much larger than that of the normal sample to distinguish the adversarial sample from the normal sample.

The road to defense against the sample is long and arduous.