Gender-based violence remains hidden crisis due to broken data systems
The study identifies pervasive under-reporting, social stigma, and data fragmentation as foundational barriers to GBV research. Victims, often fearing reprisal or social alienation, are reluctant to report abuse, particularly when the perpetrator is a partner or family member. In many countries, administrative data such as police or hospital records tend to be high-level and lack critical details about victims and perpetrators. These limitations restrict researchers’ ability to identify trends or design targeted interventions.

A major new study published in Women, titled “Enhancing Gender-Based Violence Research: Holistic Approaches to Data Collection and Analysis”, highlights the critical shortcomings and untapped opportunities in the way gender-based violence (GBV) data is collected, analyzed, and used globally. Authored by researchers from London Metropolitan University, the study explores a wide range of challenges in sourcing, handling, and ethically managing GBV datasets while offering robust solutions grounded in technological innovation and methodological rigor.
As GBV continues to be one of the most underreported and poorly documented human rights issues worldwide, the authors emphasize that addressing this data deficit is key to driving effective policy, victim support, and long-term social change.
What are the main obstacles in collecting gender-based violence data?
The study identifies pervasive under-reporting, social stigma, and data fragmentation as foundational barriers to GBV research. Victims, often fearing reprisal or social alienation, are reluctant to report abuse, particularly when the perpetrator is a partner or family member. In many countries, administrative data such as police or hospital records tend to be high-level and lack critical details about victims and perpetrators. These limitations restrict researchers’ ability to identify trends or design targeted interventions.
To assess the scope and quality of available data, the researchers analyzed multiple sources including police records, publicly available online datasets, and news reports. Many of these sources, while well-intentioned, were either incomplete, regionally narrow, or heavily abstracted. In some instances, datasets contained large volumes of null or missing entries. Law enforcement data, though routinely collected, was primarily intended for internal documentation rather than research purposes and lacked the necessary detail for robust analysis.
Attempts to augment limited datasets using synthetic data generation were also explored. By replicating data structures and inferring new variables like the nature of perpetrator relationships or incident severity, the team sought to simulate missing details. However, synthetic data proved insufficient in capturing the complexity and variability of real-world GBV cases. Concerns about accuracy and the potential for misleading patterns led the researchers to discard synthetic data as a primary research source.
Similarly, social media platforms and news websites were examined for potential data extraction through web scraping. Although they offered access to high-volume content and real-time narratives, the lack of structured metadata, inconsistencies in terminology, and tendency toward sensationalism undermined their reliability as standalone sources for scientific research.
Which dataset offers the most comprehensive basis for cross-country GBV analysis?
After exhaustive evaluation, the researchers selected the USAID-supported Demographic and Health Surveys (DHS) dataset as the most appropriate foundation for their analysis. The DHS offers structured, large-scale, and nationally representative data from over 90 countries, including detailed information on domestic violence, family structures, education, and employment.
Even with its advantages, the DHS dataset presented notable hurdles. It was initially supplied in formats incompatible with many common analytics tools, requiring conversion and cleanup before processing. The vast number of variables and inconsistent survey years across countries introduced additional complexity. The researchers addressed these challenges by limiting their analysis to the most recent survey phases (7 and 8), spanning the years 2015 to 2022. They implemented stratified sampling to mitigate overrepresentation by countries with disproportionately large datasets, ensuring a balanced comparative analysis.
The final dataset included data from 19 countries and over 96,000 respondents. Key variables were grouped into thematic modules and reduced to 64 core columns through a rigorous selection and coding process. Although the DHS covers a wide swath of low- and middle-income countries, data from European countries were excluded due to incompatibility in structure and outdated timeframes.
To manage the scale of the dataset and ensure compliance with privacy standards, researchers used tools such as IBM SPSS and applied statistical disclosure controls to protect anonymity. This involved aggregating data into broader demographic categories, ensuring individual identities could not be reverse-engineered from the final results.
How can ethical and technological strategies improve GBV research outcomes?
The study emphasizes that improving GBV research requires a combination of advanced technology, ethical sensitivity, and consistent methodological standards. The research team used Python, Power BI, and cloud platforms to clean, visualize, and analyze the data. These tools enabled the construction of interactive dashboards and data models to explore relationships between variables such as education level, marital status, and attitudes toward violence.
Among the key innovations was the use of network graphing to examine how different factors interact in cases of intimate partner violence (IPV). By mapping out connections between variables, such as history of abuse, attitudes toward female subordination, and prior exposure to violence, the study revealed how certain demographics are disproportionately affected. This level of analysis enables policymakers to better identify at-risk populations and develop targeted prevention strategies.
The researchers also explored the feasibility of integrating generative artificial intelligence and large language models into GBV data collection. These tools could support more efficient and cost-effective surveying by automating elements of the data gathering process and enabling victims to share experiences anonymously and safely. However, the study cautions that such technologies must be employed with strict ethical oversight to avoid reinforcing biases or compromising the privacy of respondents.
Another pressing issue raised in the study is the lack of standardization across global GBV data sources. Differences in how incidents are defined, coded, or disaggregated by gender, age, or perpetrator type make cross-country comparisons extremely difficult. The authors argue for the development of global indicators and uniform data registration practices that can bridge these disparities and enhance research comparability.
The study further calls attention to underexplored populations, such as older women, women with disabilities, and Indigenous communities, who are often overlooked in mainstream GBV research. Remote data collection methods, while offering accessibility, pose new challenges around informed consent, referral pathways, and digital literacy that must be addressed through better-designed ethical protocols.
- FIRST PUBLISHED IN:
- Devdiscourse