huggingface.co web signal

AGVBench stress-tests 30 augmentations for vein biometrics

TL;DR

  • AGVBench evaluates 30 augmentation methods across 5 vein datasets and 7 backbones on six axes, not just top-1 accuracy.
  • MixUp posts the strongest clean numbers but collapses to 4.87% under PGD attack, while LabelSmoothing holds 70.37%.
  • Natural-image staples like Flip and Rotate degrade vein recognition, and composed stacks outperform any single method.

Vein recognition is quietly deployed in ATMs, phones and access-control gates, and it lives or dies on how well the model generalises past its training set. A new benchmark called AGVBench, posted to Hugging Face, takes a long look at the data-augmentation half of that story and argues that the usual scoring lets some methods look far safer than they actually are.

The authors evaluate 30 augmentation strategies across 5 public palm- and finger-vein datasets and 7 backbone architectures, spanning classic CNNs, vision transformers and vein-specific models. Instead of stopping at top-1 accuracy, they score every combination along six dimensions: recognition, calibration, corruption robustness, adversarial robustness, occlusion, and efficiency. That is where the story gets interesting.

Multi-image mixing methods (MixUp, PuzzleMix, StarMixup) post the strongest clean numbers, pushing ResNet18 on VERA220 from a vanilla 71.45% up to 95.55% with PuzzleMix. But those same methods are, in the paper's words, "poorly calibrated and vulnerable to adversarial perturbations." Under a PGD attack on TJU600, MixUp drops to 4.87% accuracy while LabelSmoothing holds 70.37%. Simple geometric augmentations borrowed from natural-image pipelines, like Flip and Rotate, actively degrade the models by disturbing the vein topology.

The honest caveat is that a benchmark still has to pick datasets and threat models, and this one leans on 5 public sets rather than production sensor feeds. The paper does not audit demographic skew, its adversarial regime is limited to FGSM and PGD at ε = 0.2/255, and what the reporting does not give you is any read on how these rankings translate to on-device inference on a phone or an ATM sensor.

The forward-looking piece, and the part that matters if you build these systems, is that composed stacks beat any single method. AutoAugment + PuzzleMix + LabelSmoothing lifts VERA220 from 80.82% with AutoAugment alone to 98.00%, and the LabelSmoothing family is the one to reach for when adversarial robustness is a hard requirement. The code is released, so teams working on fingerprint, iris or periocular systems can extend the protocol rather than rebuilding it.