AGVBench stress-tests 30 augmentations for vein biometrics
TL;DR
- AGVBench evaluates 30 augmentation methods across 5 vein datasets and 7 backbones on six axes, not just top-1 accuracy.
- MixUp posts the strongest clean numbers but collapses to 4.87% under PGD attack, while LabelSmoothing holds 70.37%.
- Natural-image staples like Flip and Rotate degrade vein recognition, and composed stacks outperform any single method.
Vein recognition is quietly deployed in ATMs, phones and access-control gates, and it lives or dies on how well the model generalises past its training set. A new benchmark called AGVBench, posted to Hugging Face, takes a long look at the data-augmentation half of that story and argues that the usual scoring lets some methods look far safer than they actually are.
The authors evaluate 30 augmentation strategies across 5 public palm- and finger-vein datasets and 7 backbone architectures, spanning classic CNNs, vision transformers and vein-specific models. Instead of stopping at top-1 accuracy, they score every combination along six dimensions: recognition, calibration, corruption robustness, adversarial robustness, occlusion, and efficiency. That is where the story gets interesting.
Multi-image mixing methods (MixUp, PuzzleMix, StarMixup) post the strongest clean numbers, pushing ResNet18 on VERA220 from a vanilla 71.45% up to 95.55% with PuzzleMix. But those same methods are, in the paper's words, "poorly calibrated and vulnerable to adversarial perturbations." Under a PGD attack on TJU600, MixUp drops to 4.87% accuracy while LabelSmoothing holds 70.37%. Simple geometric augmentations borrowed from natural-image pipelines, like Flip and Rotate, actively degrade the models by disturbing the vein topology.
The honest caveat is that a benchmark still has to pick datasets and threat models, and this one leans on 5 public sets rather than production sensor feeds. The paper does not audit demographic skew, its adversarial regime is limited to FGSM and PGD at ε = 0.2/255, and what the reporting does not give you is any read on how these rankings translate to on-device inference on a phone or an ATM sensor.
The forward-looking piece, and the part that matters if you build these systems, is that composed stacks beat any single method. AutoAugment + PuzzleMix + LabelSmoothing lifts VERA220 from 80.82% with AutoAugment alone to 98.00%, and the LabelSmoothing family is the one to reach for when adversarial robustness is a hard requirement. The code is released, so teams working on fingerprint, iris or periocular systems can extend the protocol rather than rebuilding it.
Originally reported by huggingface.co
Read the original article →Original headline: HF Paper 'AGVBench': Reliability-Oriented Benchmark of Data Augmentation for Vein Recognition