Home/ CSE/ Adversarial Machine Learning Defenses
CSE · Seminar 09 · Hardening models against crafted attacks

Adversarial Machine Learning Defenses

Adversarial defences protect ML models from inputs perturbed to cause misclassification, using techniques such as adversarial training, certified robustness and input purification.

adversarial MLFGSMPGDadversarial trainingrobustness

Deep networks are brittle: a perturbation imperceptible to humans can flip a classifier's decision with high confidence. These adversarial examples threaten autonomous driving, malware detection and biometric systems, making adversarial defence a core security topic.

How attacks work

Most attacks follow the loss gradient with respect to the input. The Fast Gradient Sign Method (FGSM) takes one step in the sign of the gradient; Projected Gradient Descent (PGD) iterates this within a small ε-ball, producing the strongest first-order attack. Defences must withstand such adaptive, white-box adversaries — not just weak ones.

Clean input1Add ε·sign(∇ₓ loss)2Adversarial input3Model4Wrong prediction5FGSM attack: a single gradient-sign perturbation
Figure 1. An adversarial example is built by nudging the input along the loss gradient — small in pixel space, large in the model's decision space.

Defence strategies

  • Adversarial training — train on PGD-generated examples; the strongest empirical defence, formulated as a min-max (robust optimisation) problem
  • Certified defences — randomised smoothing gives a provable robustness radius, trading some clean accuracy for guarantees
  • Input transformation / purification — denoise or use a diffusion model to project inputs back to the data manifold before inference
  • Detection — flag inputs whose statistics look adversarial
Table 1. Defence approaches and trade-offs
DefenceGuaranteeCost
Adversarial trainingEmpiricalExpensive training, lower clean acc.
Randomised smoothingCertified radiusMany forward passes at inference
Diffusion purificationEmpiricalHeavy inference compute
DetectionNone (filter only)Can be evaded by adaptive attacks
Critical caveatMany published defences were later broken because they caused gradient masking rather than true robustness. Always evaluate against adaptive attacks designed with full knowledge of the defence.

Applications

  • Robust perception for autonomous vehicles and traffic-sign recognition
  • Malware and spam classifiers facing evasive adversaries
  • Biometric and content-moderation systems

References & further reading

  1. Goodfellow et al., “Explaining and Harnessing Adversarial Examples,” ICLR 2015.
  2. Madry et al., “Towards Deep Learning Models Resistant to Adversarial Attacks (PGD),” ICLR 2018.
  3. Cohen et al., “Certified Adversarial Robustness via Randomized Smoothing,” ICML 2019.