Adversarial Machine Learning Defenses

Deep networks are brittle: a perturbation imperceptible to humans can flip a classifier's decision with high confidence. These adversarial examples threaten autonomous driving, malware detection and biometric systems, making adversarial defence a core security topic.

How attacks work

Most attacks follow the loss gradient with respect to the input. The Fast Gradient Sign Method (FGSM) takes one step in the sign of the gradient; Projected Gradient Descent (PGD) iterates this within a small ε-ball, producing the strongest first-order attack. Defences must withstand such adaptive, white-box adversaries — not just weak ones.

Figure 1. An adversarial example is built by nudging the input along the loss gradient — small in pixel space, large in the model's decision space.

Defence strategies

Adversarial training — train on PGD-generated examples; the strongest empirical defence, formulated as a min-max (robust optimisation) problem
Certified defences — randomised smoothing gives a provable robustness radius, trading some clean accuracy for guarantees
Input transformation / purification — denoise or use a diffusion model to project inputs back to the data manifold before inference
Detection — flag inputs whose statistics look adversarial

Table 1. Defence approaches and trade-offs
Defence	Guarantee	Cost
Adversarial training	Empirical	Expensive training, lower clean acc.
Randomised smoothing	Certified radius	Many forward passes at inference
Diffusion purification	Empirical	Heavy inference compute
Detection	None (filter only)	Can be evaded by adaptive attacks

Critical caveatMany published defences were later broken because they caused gradient masking rather than true robustness. Always evaluate against adaptive attacks designed with full knowledge of the defence.

Applications

Robust perception for autonomous vehicles and traffic-sign recognition
Malware and spam classifiers facing evasive adversaries
Biometric and content-moderation systems

References & further reading

Goodfellow et al., “Explaining and Harnessing Adversarial Examples,” ICLR 2015.
Madry et al., “Towards Deep Learning Models Resistant to Adversarial Attacks (PGD),” ICLR 2018.
Cohen et al., “Certified Adversarial Robustness via Randomized Smoothing,” ICML 2019.