From Spreadsheets to Insights: Can it be done with AI?

22 hours ago8 min read

Updated: 9 hours ago

Performance Comparison of Large Language Models in Processing Sensitive Statistical Data

Introduction

In an era where artificial intelligence is becoming deeply woven into the fabric of our society, we face a critical question: How reliable are AI models when handling sensitive social data? Today, I'm excited to share insights from a groundbreaking research project that put two leading AI models - ChatGPT version 4o and Claude AI 3.5 Sonnet - through their paces in analysing domestic abuse statistics from England and Wales. The results not only illuminate the capabilities of these models but also raise important questions about the future of AI in social research.

Why Does This Matter?

Before exploring the technical aspects, it's crucial to understand why this research matters so profoundly. Statistical accuracy in domestic abuse reporting transcends mere numbers - it represents real lives, real stories, and real consequences. When AI models analyse these statistics, their accuracy can have far-reaching implications, influencing policy decisions, resource allocation, and society's understanding of this critical issue. Misinterpretations or inconsistencies in AI outputs could lead to flawed conclusions, misallocated resources, or poorly informed policies.

In 2023, as government agencies, researchers, and nonprofit organisations increasingly turn to AI for data analysis, the stakes have never been higher. This isn't simply a technical evaluation - it's about ensuring we can trust AI to handle some of society's most pressing challenges with the sensitivity and accuracy they deserve.

The Research Setup: Putting AI to the Test

We designed our research methodology to be both rigorous and transparent. Here's exactly how we conducted the comparison:

Data Preparation
1. We obtained official statistical datasets from the Office for National Statistics (ONS)
2. These datasets covered domestic abuse and criminal justice system statistics, prevalence rates and victim characteristics, victim services data, and detailed stalking crime survey findings
3. The data was converted into PDF format to ensure both models had identical access to the information
Model Interaction Process
1. We used the official web interfaces for both ChatGPT-4 and Claude AI 3.5 Sonnet
2. The PDF files were uploaded as attachments to each model's dialogue interface
3. Each model received identical prompts for each analysis task
4. All responses were documented and compared against official ONS statistics

For readers less familiar with computational processes, think of this as giving both AI models the exact same textbook and asking them the exact same questions. We then compared their answers not only with each other but also with the official statistics.

The Results: A Tale of Two AIs

Basic Statistical Interpretation

Prompt used: "How many people experienced domestic abuse in England and Wales for the year ending March 2023?"

Prompt: "What was the total number of domestic abuse victims, and how was this distributed by gender?"

The differences between the models became apparent even in basic statistical analysis. When asked about domestic abuse victims in England and Wales for the year ending March 2023:

Figure 1: ONS Statistics on Domestic Abuse Victims, 2023 - Source: Office for National Statistics

Figure 2: ONS Statistics on Estimated victims of domestic abuse by sex - Source: Office for National Statistics

Claude AI demonstrated remarkable precision, providing the exact figure of 2,124,000 victims, further breaking this down into 1,377,000 women and 751,000 men. This level of detail showed a sophisticated understanding of the importance of gender-specific data in domestic abuse statistics.

ChatGPT, while technically accurate, took a more generalised approach, reporting "approximately 2.1 million people." While this rounded figure was accessible, it omitted crucial demographic details vital for policymakers and researchers.

Crime Statistics Analysis

Prompt: "How many domestic abuse-related crimes were recorded by the police in the year ending March 2023?"

The divergence between the models became even more pronounced when handling complex crime statistics. A particularly revealing example emerged when analysing police-recorded domestic abuse-related crimes:

Figure 5: ONS statistics on police recorded domestic abuse- related crimes. Source: Office for National Statistics

Claude AI consistently reported 889,918 crimes, always noting the exclusion of Devon and Cornwall data - a crucial detail that could significantly impact regional policy decisions. The model maintained this precision across multiple queries, demonstrating remarkable consistency in data handling.

Figure 6: ClaudeAI response on police recorded domestic abuse- related crimes.

ChatGPT's responses showed more variability, sometimes including Devon and Cornwall's data and sometimes not, highlighting a potential challenge in maintaining consistent data parameters across multiple queries.

Figure 7: Chatgpt’s response on police recorded domestic abuse-related crimes.

Incident Reporting and Prosecution Data

Prompt: "Provide the number of domestic abuse-related incidents and their regional distribution."

The models' handling of incident reporting revealed further distinctions:

Claude AI reported 563,949 domestic abuse-related incidents, carefully noting the exclusion of Devon and Cornwall due to data supply issues. The model provided valuable context about regional variations and reporting patterns, helping users understand broader systemic trends.

Figure 8: ClaudeAI response on the number of domestic abuse-related incidents and their regional distribution.

ChatGPT reported 690,264 incidents - a figure that appeared to combine different data points, potentially leading to misinterpretation of the actual situation. Hence, requiring cross-referencing.

Figure 9: Chatgpt response on the number of domestic abuse-related incidents and their regional distribution.

Prosecution and Conviction Analysis

Prompt: "What were the prosecution and conviction rates for domestic abuse cases?"

The difference in analytical capability became even more apparent in handling prosecution data:

Claude AI precisely identified 51,288 completed prosecutions, breaking this down into 39,198 convictions (76.4%) and 12,090 non-convictions (23.6%). Claude AI delivered a precise prosecution figure and included historical trends and outcome breakdown by showing a decline in prosecutions over the past decades.

Figure 10: ClaudeAI response on prosecution and conviction rate

ChatGPT struggled to pinpoint exact figures, often providing estimated ranges and requiring additional guidance for dataset navigation.

Figure 11: Chatgpt’s response on prosecution and conviction rate

Domestic abuse related offenders Analysis

Prompt: "How many domestic abuse-related offenders were convicted in England and Wales in the year ending March 2023?"

Similarly, in handling how many domestic abuse-related offenders were convicted in England and Wales in the year ending March 2023?

Figure 12: ONS statistics on domestic abuse related crimes - Source: Office for National Statistics

ClaudeAI delivered a more accurate and reliable response, showcasing its capability to navigate complex datasets effectively.

Figure 13: ClaudeAI’s response to domestic abuse convicted

ChatGPT could not locate the exact number of domestic abuse related offenders convicted.

Figure 14: Chatgpt’s response to domestic abuse related offenders

Suspect Referrals and Charging Decisions

Prompt: "How many suspects were referred for domestic abuse-related offences, and what were the charging outcomes?"

The differences became even more pronounced when examining suspect referrals and charging decisions:

Figure 15: ONS statistics on Domestic abuse crime suspects referred to the Crown Prosecution Service - Source: Office for National Statistics

In response to questions about suspect referrals, Claude AI demonstrated remarkable precision by reporting exactly 69,314 suspects referred to the Crown Prosecution Service (CPS) for charging decisions in 2022/23.

Figure 16: Claude’s response to domestic abuse crime suspects

ChatGPT, however, struggled to locate this specific figure and instead provided general commentary on dataset limitations.

Figure 17: Chatgpt’s response to domestic abuse crime suspects

In this instance, When analysing charging decisions (suspects of domestic abuse-related crimes that were charged, ClaudeAI demonstrated superior accuracy and clarity in extracting numerical data.

Figure 18: ONS statistics on suspects of domestic abuse related crimes that were charged - Source: Office for National Statistics

Claude AI maintained its precision, reporting that 47,361 suspects were charged, representing 76.5% of legal decisions made. The model enhanced this data with valuable contextual insights about prosecution patterns.

Figure 19: ClaudeAI’s response to suspects of domestic abuse related crimes that were charged

ChatGPT again faced challenges in pinpointing specific figures, reverting to general observations and speculative commentary.

Figure 20: Chatgpt’s response to suspects of domestic abuse related crimes that were charged

Temporal Analysis: Understanding Year-Over-Year Changes

Prompt: "How did domestic abuse statistics change compared to the previous year?"

The models demonstrated markedly different approaches when analysing temporal trends for instance, when we asked about changes in police-recorded domestic abuse-related crimes compared to the previous year.

Claude AI provided a sophisticated multi-layered analysis of year-over-year changes. The model not only extracted precise figures from the dataset but also delivered a comprehensive regional breakdown, showcasing percentage changes across different geographic areas. Importantly, Claude AI's analysis emphasised that while regional variations existed, the overall change was not statistically significant, indicating stability in the recorded crime volume. This detailed interpretation demonstrated an advanced understanding of statistical significance in social data analysis.

Figure 21: Claude’s response to temporal analysis

ChatGPT took a notably different approach, offering a concise summary that acknowledged the lack of statistical significance in the decrease. However, its analysis lacked the detailed regional breakdown and comprehensive temporal comparisons that characterised Claude AI's response. While ChatGPT's summary was accurate, it missed the opportunity to provide valuable insights about regional variations and trends.

Figure 22: Chatgpt’s response to temporal analysis

Regional Analysis

Prompt: "Please rank police forces based on their percentage of cases charged or summonsed."

In analysing regional variations when asked to rank police forces based on percentage charged or summonsed, both models diverged significantly:

Figure 23: ONS statistics on Regional analysis- Source: Office for National Statistics

Claude AI provided comprehensive rankings of police forces, maintained consistent geographic comparisons. The model's ability to handle complex regional data rivalled that of experienced statisticians.

Figure 24: Claude’s response to Regional analysis

ChatGPT excelled at providing general trends and accessible summaries but sometimes struggled with specific regional rankings, Produced inconsistent results, with discrepancies between expected figures and those in the dataset. The output included mismatches and ranked forces inaccurately.

Figure 25: Chatgpt’s response to Regional analysis

Practical Implications and Best Practices

Strategic Model Selection

Our research has revealed that understanding and leveraging each model's unique strengths is crucial for effective data analysis. Claude AI has demonstrated particular excellence in precise numerical analysis, detailed statistical breakdowns, and complex comparative studies, making it an ideal choice for detailed quantitative research.

Meanwhile, ChatGPT shows remarkable strength in creating accessible summaries and communicating general trends, making it valuable for broader communication purposes. For optimal results, organizations should consider combining both approaches, using Claude AI for detailed analysis and ChatGPT for making that analysis more accessible to broader audiences.

Data Verification Protocol

To ensure the highest standards of accuracy and reliability, organisations should implement a comprehensive verification system that encompasses multiple layers of validation. This begins with cross-referencing AI analyses against traditional statistical methods to ensure consistency and accuracy. Results should be compared across different AI models to identify potential discrepancies or biases. Crucial statistics must be independently verified by human experts, particularly when dealing with sensitive social data. Regular accuracy checks should be conducted throughout the analysis process, while context verification protocols ensure that AI interpretations align with real-world situations. Finally, all AI conclusions should undergo critical evaluation by subject matter experts before being accepted.

Looking Towards the Future

As we continue to advance in this field, several crucial areas demand our attention and research focus. We need to develop standardised testing protocols specifically designed for evaluating how AI handles sensitive social data. This goes hand in hand with creating new accuracy measures that are specifically tailored to social statistics, considering their unique complexities. There's also a pressing need to deepen our understanding of how AI models interpret social context, ensuring their analysis remains relevant and appropriate. Additionally, we must explore and enhance AI capabilities in handling intersectional data, recognising that social issues often involve multiple overlapping factors and demographics.

The Path Forward

While both models demonstrated impressive capabilities, they also revealed clear limitations. Claude AI showed superior performance in handling statistical data, maintaining consistency across complex queries, and managing detailed datasets with minimal errors. It is exceptional at extracting precise figures and maintaining consistency across multiple datasets. However, despite being highly accurate, its responses can sometimes be verbose.

ChatGPT, while sometimes struggling with precise numerical data, excelled at creating accessible summaries and communicating general trends.

The key takeaway isn't about choosing one model over another, but rather understanding how to leverage their respective strengths while maintaining appropriate human oversight.

As we continue to integrate AI into sensitive data analysis, the focus should be on using these tools thoughtfully and responsibly.

As we stand at this intersection of technology and social research, your experiences and perspectives matter. Have you worked with AI models in analysing sensitive data? What challenges and successes have you encountered? Share your thoughts and experiences in the comments below.

Note: This analysis was conducted using publicly available data from the Office for National Statistics (ONS). For the most current domestic abuse statistics and support resources, please visit official government websites or contact relevant support organizations.