Key challenges blocking global rollout of facial recognition payment
Deep learning has emerged as the dominant driver of anti-spoofing accuracy. Techniques like MobileFaceNet integrated with Coordinate Attention achieved remarkable accuracy of 1.39% Average Classification Error Rate (ACER), while models such as Heterogeneous Kernel Convolutional Neural Networks demonstrated nearly perfect recognition accuracy, exceeding 99% on benchmark datasets. These models not only increased detection precision but also optimized processing speed, crucial for real-time transaction environments.

A new study explores the current state of face recognition payment (FRP) systems, shedding light on groundbreaking technological progress and persistent vulnerabilities that could hinder widespread adoption. Published in Information, the study titled “Advancing Secure Face Recognition Payment Systems: A Systematic Literature Review”, the research critically evaluates the effectiveness, strengths, and limitations of current algorithmic innovations in FRP, especially in real-world environments.
The study draws upon a systematic literature review of 219 articles, narrowing down to 10 highly relevant studies using the PRISMA methodology. These core studies reveal key advances in face anti-spoofing, the deployment of deep learning and transformer-based models, and the challenges posed by environmental variables and cross-dataset limitations.
What are the key technological advancements driving face recognition payment security?
Over the past five years, the field has witnessed transformative innovations across three primary fronts: deep learning, multimodal feature integration, and transformer-based architectures.
Deep learning has emerged as the dominant driver of anti-spoofing accuracy. Techniques like MobileFaceNet integrated with Coordinate Attention achieved remarkable accuracy of 1.39% Average Classification Error Rate (ACER), while models such as Heterogeneous Kernel Convolutional Neural Networks demonstrated nearly perfect recognition accuracy, exceeding 99% on benchmark datasets. These models not only increased detection precision but also optimized processing speed, crucial for real-time transaction environments.
Multimodal systems further elevated the security threshold by combining RGB with infrared (IR) and depth information. This fusion allowed systems to differentiate real faces from presentation attacks more effectively. For instance, Vision Transformer (ViT) models paired with Adaptive Multimodal Adapters and Modality-Asymmetric Masked Autoencoders (M2A2E) yielded ACER values as low as 2.12% on the CASIA-SURF dataset, indicating high reliability under variable input conditions.
Meanwhile, transformer-based architectures introduced advanced training strategies such as the Dynamic Feature Queue and Progressive Training Strategy. These mechanisms allowed models to generalize more robustly to unseen attack types and environmental distortions. In particular, systems using such frameworks performed well on challenging datasets like SuHiFiMask, with Area Under Curve (AUC) scores of 98.38%, underscoring improved resilience to complex threats.
How effective are current anti-spoofing mechanisms against emerging threats?
Despite notable technical progress, the study underscores an uneven landscape in anti-spoofing effectiveness, especially when confronting evolving attack vectors like deepfakes and complex real-world scenarios.
Legacy threats, such as print, replay, and 3D mask attacks, are now largely manageable under controlled conditions. Techniques employing Remote Photoplethysmography (rPPG) and contextual patch-based CNNs achieved 0% Equal Error Rate on benchmark datasets, signaling maturity in handling traditional spoofing attempts.
However, real-world performance tells a different story. Deepfake detection, for example, continues to experience a significant drop in accuracy when image quality is degraded or when models are tested on low-resolution or compressed data. One model dropped from 97.12% accuracy on high-quality input to 91.26% on lower-quality images, revealing vulnerability to compression artifacts and real-time streaming inconsistencies.
The cross-dataset generalization challenge presents a critical bottleneck. Models that perform well on their training dataset frequently falter when exposed to unfamiliar data. A Siamese network-based model, while achieving competitive results on its native dataset, showed a sharp performance decline when evaluated on the CASIA-FASD and Replay-Attack datasets, recording Half Total Error Rates of 23.9% and 38.0%, respectively. Such steep degradation demonstrates that current algorithms are not yet robust enough to function reliably across the demographic and environmental diversity typical of global deployment scenarios.
What are the major obstacles to deploying FRP systems in real-world applications?
The study identifies three core challenges threatening the real-world viability of FRP systems: environmental instability, computational inefficiency, and inadequate evaluation frameworks.
Environmental instability includes lighting variations, facial pose changes, image resolution inconsistencies, and occlusions, all of which dramatically reduce model performance. These factors are particularly problematic in retail or public environments where controlled lighting or user cooperation cannot be guaranteed.
Computational limitations further constrain scalability. While high-performance transformer-based models deliver strong results, they often require computational resources that exceed the capacity of embedded systems used in point-of-sale terminals. Some advanced systems demand hardware well beyond the power envelope of mobile or edge devices, making them impractical for large-scale, low-cost deployment. Although some lightweight models, like those based on MobileFaceNet, offer faster processing times (as low as 45 milliseconds), this often comes with a trade-off in detection accuracy or robustness.
A third and frequently overlooked issue is the lack of standardized evaluation metrics and datasets. Research currently employs a fragmented array of performance indicators, ACER, EER, AUC, HTER, and TPR, making inter-study comparison difficult. Furthermore, most models are trained and tested on outdated or homogeneous datasets like CASIA-FASD or Replay-Attack, which do not reflect modern attack strategies or global facial diversity. The absence of payment-specific benchmarks and unified performance measures hampers meaningful progress and obscures real-world readiness.
Closing the gap between laboratory success and commercial viability
The review makes it clear that while academic advances have significantly pushed the frontier of FRP capabilities, real-world deployment is hindered by unresolved technical issues. The gap between lab-grade performance and field-level reliability remains wide, with cross-dataset generalization and adaptive threat detection standing out as priority research areas.
The study calls for a more holistic approach to model development that simultaneously addresses performance, efficiency, and adaptability. This includes establishing standardized metrics and real-world payment datasets that reflect environmental, demographic, and device-level complexities. Moreover, efforts must pivot toward designing algorithms that can sustain accuracy with reduced computational demand, ensuring that performance is not sacrificed for portability.
- FIRST PUBLISHED IN:
- Devdiscourse