Next-gen traffic surveillance powered by AI: CNNs, transformers and MLLMs lead the charge

The study outlines the evolution from traditional surveillance to AI-enhanced systems capable of recognizing deviations in traffic behavior without human intervention. AI systems now detect anomalies like illegal parking, jaywalking, speeding, erratic driving, and accidents using complex visual and motion data inputs. These capabilities enable city traffic operators to respond faster to incidents, reduce human error, and maintain smoother traffic flows.


CO-EDP, VisionRICO-EDP, VisionRI | Updated: 16-05-2025 18:20 IST | Created: 16-05-2025 18:20 IST
Next-gen traffic surveillance powered by AI: CNNs, transformers and MLLMs lead the charge
Representative Image. Credit: ChatGPT

Cities worldwide are facing increasing pressure to manage traffic flows, enhance road safety, and adapt to real-time disruptions. Amidst this crisis, artificial intelligence (AI) is emerging as a critical ally in the transformation of urban transportation systems. Intelligent surveillance and automated anomaly detection systems are being deployed across smart cities worldwide, tasked with identifying everything from minor infractions to life-threatening accidents - all in real time.

A newly published study titled “Innovative Approaches to Traffic Anomaly Detection and Classification Using AI” in the journal Applied Sciences, conducted by researchers from Universidad Carlos III de Madrid and the Technological Institute of Aragón, provides a sweeping review of AI-driven methods used in traffic anomaly detection. The paper compares cutting-edge techniques such as CNNs, GANs, Transformers, and multimodal large language models (MLLMs), highlighting their real-world applications, limitations, and future potential.

How is AI transforming traffic anomaly detection?

The study outlines the evolution from traditional surveillance to AI-enhanced systems capable of recognizing deviations in traffic behavior without human intervention. AI systems now detect anomalies like illegal parking, jaywalking, speeding, erratic driving, and accidents using complex visual and motion data inputs. These capabilities enable city traffic operators to respond faster to incidents, reduce human error, and maintain smoother traffic flows.

Machine learning, particularly unsupervised methods, has been foundational. Trajectory-based analysis using techniques like Dynamic Time Warping (DTW) and clustering allows detection of outlier vehicle movements. However, such methods depend heavily on feature selection and require substantial tuning.

Convolutional Neural Networks (CNNs) have advanced detection accuracy by analyzing spatio-temporal features in video feeds. They are effective at identifying sudden changes in behavior, such as vehicles stopping in undesignated zones or unexpected congestion. Still, CNNs require extensive labeled datasets and are prone to challenges in interpretability and adaptability.

In contrast, GANs generate synthetic traffic scenarios to enhance anomaly detection. These models leverage a generator–discriminator architecture to identify unusual events based on frame prediction error. Although GANs excel at learning diverse behaviors, they are computationally expensive and difficult to stabilize during training.

What role do transformers and multimodal models play?

Transformer-based models, particularly those using attention mechanisms, have gained traction for their ability to understand long-term temporal dependencies in traffic video sequences. By processing entire scenes holistically, Transformers can identify patterns such as gradual congestion buildup or subtle lane violations. Methods such as TransAnomaly and CViT combine U-Net and Transformer structures for frame prediction and anomaly localization, showing notable improvements over conventional CNNs.

Three-stage frameworks that integrate feature extraction, segment-level detection, and video-level scoring are proving effective at reducing false positives while ensuring real-time responsiveness. Other approaches, such as RTFM (Robust Temporal Feature Magnitude learning), detect anomalies based on shifts in temporal feature magnitude.

Multimodal Large Language Models (MLLMs) mark a significant leap forward. These models integrate text, image, and video data to deliver richer contextual analysis. Tools like AccidentBlip2 use multi-view vehicle data to detect incidents, while AVACA combines audio and video streams to enhance recognition accuracy by over 5%. CityLLaVA and VisionGPT adapt LLMs for real-time scene comprehension, emphasizing applications in road hazard detection and accessibility.

MLLMs rely on sophisticated reasoning across data modalities and require prompt engineering and fine-tuning. While they offer superior detection in complex environments, their high computational costs and “hallucination” risks (generating false or ungrounded outputs) limit practical deployment in low-resource or real-time settings.

What are the current challenges and future directions?

Despite remarkable advancements, AI systems for traffic anomaly detection still face several limitations. The foremost challenges include:

  • Data dependency: Many models require vast amounts of labeled video data for effective training.
  • Computational burden: Especially for GANs and MLLMs, training and inference are resource-intensive.
  • Interpretability: Models often behave as black boxes, making it difficult for city officials to audit or explain decisions.
  • Generalization: Performance can degrade in low-light or occluded conditions, or when anomalies differ from training scenarios.

To address these, the study highlights new trends such as:

  • Model compression: Techniques to reduce computational load without sacrificing accuracy.
  • Semi-supervised and weakly supervised learning: Reduce labeling costs by using only video-level tags.
  • Explainable AI (XAI): Visualizations and score-based anomaly localization help build trust.
  • Multimodal integration: Combining text, image, video, and sensor data ensures robust interpretation of complex scenes.

The authors propose a roadmap prioritizing scalable, interpretable, and real-time systems adaptable to diverse environments. Integrating AI into broader smart city infrastructures, from traffic lights to emergency services, can create proactive urban mobility ecosystems capable of not just reacting to anomalies but preventing them.

  • FIRST PUBLISHED IN:
  • Devdiscourse
Give Feedback