AI system tracks leaked ID images online in real time

Most enterprise-grade DLP solutions are reactive and confined to internal networks, often failing to identify confidential images already disseminated on public platforms. The study highlights that modern risks stem from images posted online, whether by accident or malicious intent, through social media, misconfigured cloud storage, or unsecured collaboration tools.


CO-EDP, VisionRICO-EDP, VisionRI | Updated: 02-06-2025 08:57 IST | Created: 02-06-2025 08:57 IST
AI system tracks leaked ID images online in real time
Representative Image. Credit: ChatGPT

In a critical step toward mitigating online data breaches, researchers at National Taiwan University have introduced a novel artificial intelligence-based system designed to detect sensitive image leaks already circulating on the internet. The peer-reviewed study, titled “An Automatic Sensitive Image Search System with Generative Artificial Intelligence to Identify Data Leaks on Internet”, was published in the journal Electronics.

Unlike conventional Data Loss Prevention (DLP) solutions that focus on blocking internal leaks through file transfers or email surveillance, this new framework emphasizes post-leakage detection. Leveraging generative AI (GenAI), image-based search, and optical character recognition (OCR), the system proactively scans the internet for images containing exposed personal identifiable information (PII), such as passports, ID cards, and financial documents. A local large language model (LLM) then delivers remediation guidance, making the platform not only detection-capable but also actionable.

What makes this system different from traditional data protection tools?

Most enterprise-grade DLP solutions are reactive and confined to internal networks, often failing to identify confidential images already disseminated on public platforms. The study highlights that modern risks stem from images posted online, whether by accident or malicious intent, through social media, misconfigured cloud storage, or unsecured collaboration tools.

To bridge this gap, the proposed system introduces a GenAI-powered pipeline composed of six integrated modules: input selection, synthetic image generation, search and filtering, sensitive data recognition, risk detection and marking, and report generation. In the first stage, users define the document type and country of origin, such as a Taiwanese driver’s license or a UK ID card. Using prompt-based synthesis, the system generates a photorealistic but artificial replica of the document using standardized layouts and placeholder personal data.

This synthetic image is then deployed in large-scale image-based web searches using Selenium and OpenCV tools. To increase the precision of match filtering, the system employs BRISK feature descriptors combined with a dynamic masking algorithm that isolates structurally relevant document regions. This approach outperforms common feature matching strategies and significantly reduces false positives, as confirmed through real-world experiments and algorithmic benchmarks.

How accurate is the detection and what risks have been uncovered?

System testing revealed robust detection capabilities. In one evaluation involving 610 publicly available web images, the system successfully identified 27 out of 29 actual leaked images, yielding a precision rate of 100% and a recall rate of 93.1%. The false negatives were traced back to OCR misinterpretations, such as confusing the letter “O” with the digit “0.” Once corrected through character-variant post-processing, accuracy improved further.

One of the most alarming case studies involved the discovery of personal driver’s license data being exposed through websites with broken access control mechanisms. In these instances, users could alter URL parameters, some encoded in outdated MD5 hashes, and retrieve other individuals’ sensitive information without authentication. The GenAI-generated report flagged these cases as high-risk, recommending immediate content takedown, access control audits, and formal notifications to authorities such as Taiwan’s Personal Data Protection Commission (PDPC).

Another performance test using the IDNet dataset of 17,937 identity document images further validated the recognition engine. The system achieved a 100% precision rate and 99.7% recall across ID cards from Estonia, Finland, and Spain. This indicates strong robustness even under real-world imbalances where sensitive leaks are rare occurrences.

What are the implementation challenges and future directions?

While the tool shows impressive accuracy, the study acknowledges key challenges to real-world deployment. The GenAI-based image generation and LLM-powered report modules are the most computationally demanding components, particularly when run locally. Benchmark tests showed that on a mid-range Intel-NVIDIA setup, each image required approximately 17.3 seconds for GenAI generation and 9.2 seconds for report compilation. Despite optimizations, this still poses scalability concerns for enterprise-level use where millions of images may need real-time analysis.

To address these bottlenecks, the researchers suggest future improvements such as integrating retrieval-augmented generation (RAG) to accelerate report production and deploying latent diffusion models for lighter image synthesis. Additionally, performance could be enhanced through parallelization using GPU clusters and Kubernetes-based orchestration, especially in cloud-native enterprise environments.

Robustness testing remains an area for expansion. While the current system performs well on frontal, clearly printed documents, it needs further training on rotated, skewed, or low-resolution images to extend its generalizability. The study also notes that although BRISK performs well in common leak formats, it may falter under heavy visual transformations, warranting exploration of alternative or ensemble descriptors.

The final layer of defense, according to the authors, lies in human-centered security practices. Even with state-of-the-art AI, cybersecurity awareness, employee training, and procedural safeguards must coexist to create a holistic risk mitigation framework.

  • FIRST PUBLISHED IN:
  • Devdiscourse
Give Feedback