May 39 min read

Predicting Risk: An example with COMPAS data

Updated: May 13

Many forces in UK are looking to standardise their risk assessment for high harm individuals. Some are turning to predictive methods. Here we showcase a worked example of RUDI for risk prediction using COMPAS data.

NOTE: This article is an example, designed for educational purposes, of how case history data is being used.

In this document, we are outlining an example of a machine learning model and filling out parts of the RUDI framework as a guideline. The full framework guide is available online as a webpage or as PDF, and this example is available in full here:

The code is available on GitHub.

Please contact us with any questions or help with working through the RUDI framework.

Introduction

In England and Wales there have been several models that predict future risk of suspects, for example, HART used by Durham Constabulary or this Met algorithm for using prediction to forecast domestic abuse (DA) homicides. In essence, all of these models are doing the same thing: predicting potential future harm based on past histories. There is a drive to automate risk prediction for serial and violent DA offenders and while some forces are using ranking based on manually developed measures based on recency, frequency, gravity, and victim numbers, others are considering ML-based predictive methods.

This article has three purposes:

To show, essentially, the bit that most people think is magical, the Machine Learning bit, can be very quick to do.
To demonstrate that the important parts are the ones around the ML part, process analysis, data unification and selection, model integration into policing processes, and continued maintenance of the models.
To give an example of a filled out RUDI document.

So in this article we model the prediction of high risk, where high risk is defined as a serious crime committed within one year of the current arrest. This is just one sort of proxy measure for likelihood of future escalation.

We use the COMPAS dataset, because it is a publicly available dataset of criminal histories. It is a result of a freedom information request by a US organisation called ProPublica. COMPAS is an automated tool for predicting recidivism within two years for the purposes of determining sentence recommendations and is actively used by judges in the US. The dataset contains criminal histories of people at the point of sentencing along with scores that were assigned by the COMPAS system.

ProPublica was concerned the algorithm was unfairly biased against the African American population, but there was a wider problem of unfettered access to poorly understood automated risk scoring to consider as well. It is important to note that just because it is possible to train a model to predict potential for future harm with some degree of accuracy it may not be ethical to implement such a model, even with the best attempts to introduce fairness during training and mitigate underlying biases in data.

The most important aspect of fairness is how a model is deployed, monitored, and used by people who understand the context. All models will have an incomplete understanding of the problem and will be trained on incomplete data. Therefore, they will embody biases of all the previous training data, whether training or pre-training, or the procedures that have been used to select that data. We must hold these limitations in mind when using a model to aid decision making that affects human lives.

Estimating future risk requires careful consideration

Conceptualisation

The key step in planning a solution is to identify the actual underlying problem:

Is the problem a low conviction rate?
Is the problem effective disruption of criminal activity?
Is the problem one of capacity?
Will identifying high-risk offenders automatically but with moderate accuracy help (meaning that some people will be missed or suspects unlikely to be high harm will be suggested), given that manual processing of the suggested individuals will still be required?

If you are trying to improve conviction rates or design better interventions then maybe identifying high-risk individuals is not the right solution and there might be other operational improvements that would yield better results.

Will ranking suffice or is prediction more accurate?

The difference between ranking and prediction is that technically ranking relies on current knowledge of the criminal history to rank people, while prediction aims to assess who is most likely to predict dangerous crimes in the future. That means that some people feel more comfortable with ranking, while prediction invokes Minority Report.

With ranking we can order people according to their current history and see who is currently doing most damage and make sure they are being managed appropriately; however, lines often blur when people start to theorise about what things make a criminal more high risk. And what does high risk actually mean? Well, it usually means that we think that someone is likely to escalate in the near future. So when designing the ranking algorithms people debate what indicates higher risk, unaware they are manually veering into prediction without applying machine learning.

Similarly what the predictive models are trying to do is learn what are the salient factors that can lead to high harm recidivism, as defined by some manual measure of high risk. In the end, both ranking and prediction give some sort of semi-opaque risk score and so one might be preferable due to higher interpretability. In the end, whether one method is better than the other is something that is empirically testable through simulation of performance in the operating environment.

Conceptualisation of our example

We use the COMPAS dataset to mimic the high-risk prediction task, which is close to the original purpose of the dataset, but this is a sample from court sentencing time. This means it may not have the natural distribution of all arrest data in a police database, and is skewed towards people who have been charged and successfully prosecuted. The underlying data will also embody different statistical properties than data from England and Wales, with regards to label distributions and racial biases.

We would always recommend that any type of automated risk assessment is made available only to a small group of well-informed individuals who have the ability to look at each case as a whole and discuss the next steps, ideally a multi-agency task force. That is why in this scenario we are saying that the model is used to assess periodically the latest arrests and recommend individuals to an intervention clinic. This also takes the decision making away from a single individual, who could be prone to decision deferral, to a group who can discuss the best approaches to individual management.

The conceptualisation template

As recommended in RUDI, an interdisciplinary team should be formed from the outset to ensure that informed decisions are guiding the development of the model and there is a significant commitment to the process that the model is supporting. Domain knowledge about how the data is collected and how particular crimes are recorded is integral to the development of any model that uses crime history data. Below is a briefly populated example of the conceptualisation template that comes before modelling is started in earnest. It allows the design team to discuss what are the underlying problems and what would be the most effective solution, computational or otherwise.

What is the problem to be addressed?	We want to reduce the number of felonies committed in Florida
What is the proposed solution?	A machine learning algorithm that can predict if someone who has been arrested will commit a felony over the next 12 months so that they can be offered an intervention
What is the overall aim of the initiative and why is this important?	The aim is to find suitable candidates (those who are predicted to commit a felony within 12 months of their most recent arrest) for our intervention. This is important because it will reduce the likelihood of new felonies being committed
Who/what does the initiative target and why?	It targets suspects who have been arrested and who the algorithm predicts will commit a felony in the 12 months following their arrest
What are the key definitions being used and how are they operationalised? (e.g., high harm, recidivism risk)	Arrestee – anyone who has been arrested of any crime, using their most recent arrest as a trigger to be included in the algorithm. They don’t have to have been convicted or charged with any crime, just arrested. Suspects are either considered likely to commit a felony in a 12-month period (high risk) or not (low risk) Felony – uses the definition from Florida Felony Charges: Know the Different Levels of Offenses (criminaldefenseattorneytpa.com )
What is the mechanism (e.g. professional judgement, structured professional judgement using a tool, static algorithm, machine-learning based algorithm) by which cases of interest will be identified and why?	A machine learning based algorithm is being used due to the large quantity of arrest data
Will identified cases go through further assessment? If so, what does this look like?	Cases identified by the algorithm will be sent to the intervention team to be assessed as to whether they believe they are a suitable candidate based on their own criteria
What kind of action will be taken? How is this justified and is there resource available for this?	Suspects who are deemed suitable candidates will be offered the intervention
How does all of the above fit within your forces legal and ethical framework?	The intervention is voluntary and will target predictors of antisocial behaviour
What evidence base are you using to justify all of the above steps? (briefly state underlying theory & hypotheses)	For this analysis example, we did not draw on any evidence base or theory

Data processing

In a real police force, the data will be split across various databases, case files, prison histories, etc. We have a small simulation of that here and, indeed, most of the code we have, even in this simple case, has to do with transforming the various databases so that they are unified across individuals into a format that allows us to look at the data as a sequence of events over time.

The COMPAS dataset

The COMPAS dataset is often used for fairness modelling, so there are many instances of its use. More information on the dataset can be found in the following resources:

Original article by ProPublica entitled Machine Bias
ProPublica Methodology article and GitHub link
Fairness in ML tutorial that covers a lot of important points related to working with COMPAS as well as a few notes on the dataset itself at about the 1hr mark
Fair prediction with disparate impact: A study of bias in recidivism prediction instruments an article describing why calibration and fairness in error rates are incompatible. Calibration ensures that when the algorithm returns the value of 0.8 about 8 out of 10 people with that value in the test data were correctly labelled as relapsing within the next two years. Obviously, we don’t know the true accuracy of such model after it has been implemented as it would have influenced the future outcomes through sentencing. Due to the population statistics of the dataset (available in the Development section) any errors by the calibrated model would lead to disparate outcomes for the different populations.
COMPAS data datasheet from the Datasheet Repository for Criminal Justice Datasets

There are visible biases in the data with regards to race distribution. Below is a table captured on USA Facts:

We can see that the black population is about 15% of the overall population of Florida, where the COMPAS dataset originates. In the people table of the dataset we see the following distribution where almost 50% of the suspects are reported as African American.

We can also see that there are correlations between the Filing Agency field and race. By normalising row-wise (Figure 3) we see that certain locations like Broward Sheriff Office/Lauderdale Lakes or Grand Jury predominantly recommend charges for African American suspects; County Court predominantly deals with Hispanic suspects; while Fire Marshalls and US Marshals seem to deal with Caucasian suspects. We don't know enough about the workings of the American justice system to draw conclusions from these findings so we find it best to remove the Filing Agency column from training data for the purposes of this experiment.

Further analysis is available in the full document.

Modelling

At the modelling stage, we define a high risk a person who is likely to commit a felony within a year of an arrest. For each person in the dataset we make a training point for each arrest. Then we use the offences associated with that arrest and the prior history as training data. We take any offences committed in the year after to calculate the label: True if there was a felony, False otherwise. We skip anyone who spent time in prison over the prediction year.

We divide the data into train, validation, and test portions. The models train on the training portion and model selection is made on the validation portion. The model performance is finally judged on the test portion.

We use a Python programming language-based toolkit which allows us to train various models, compare them, and choose an ensemble of the best ones to produce a final model. All within a few lines of code. (Of course for a final product one might explore more complex options.) The resulting model is a choice of a couple of different neural network solutions. We also adjust the model to compensate for the racial imbalance in the training data, using a toolkit extension for fairness modelling.

Results summary: If we look at historical data and pretend we have a 100 cases each week, and we take the top 10 to suggest to the intervention clinic, the algorithm would suggest an average of 4.3 people who did go on to commit felonies. Because this is an artificial dataset containing mostly people who are at the point of sentencing this number is higher than we can expect if similar models were trained on a force's arrest database.

Some details

Skipping a few KEY steps, like transforming data from the COMPAS database into training data, we come to the modelling stage.

To generate a quick baseline model we use the auto-ML toolkit AutoGluon and the fairness extension package AutoGluon.Fair. Auto-ML is any set of machine learning tools that incorporate much of data handling and train a variety of ML models in order to find the best combination for our particular task as we have defined it.

Our code is available on GitHub, and due to simplicity of the auto-ML approach, the code for actual training is only several lines long. Majority of the code is in preprocessing of the data, and the two Jupyter notebooks which allow us to peek into the data and model performance.

We trained two models: one including race (original race) and one that did not (original). The model without race data was a weighted ensemble of NeuralNetFastAI and NeuralNetTorch. The model including race also contained XGBoost. The models were further adjusted for fairness by minimising demographic parity (updated).

We also examine what sort of features contribute to the model and find that they match the recency, frequency, gravity framework (we have no knowledge of numbers of victims in this dataset). So features to do with the total number of past incidents in particular felonies and misdemeanours are highly weighted as are words related to drugs (possession, cocaine) and violent offences (battery), and driving without a license (license). We looked only 5 cases into the past so whether the 5th oldest case was a felony and the recency of that oldest case is also highly weighted. Different ways of encoding the RFGV features might help models pick up even more useful statistics.

Discussion

There are several important takeaways from this exercise:

The modelling is easy, hard things are:
- Data unification (across different policing databases), processing, and integrity validation
- Integration into organisational processes
- Ensuring fairness of the approach by fully understanding the data, underlying processes, and effects of using the model vs other approaches
Performance is low:
- Using just case data without any text, forensics, or other insightful information doesn't lead to high performance
- You need to consider how you want to tune your model depending on the costs of missing particular crimes vs the amount of time it will take to sift through low precision output
  - This will depend on what other methods for identifying high-risk individuals are also used, such as officer judgement, or third party referral
- You need to compare the performance of the model to business as usual. Does it add value? Is there capacity to process the output?
Modelling is just a tool; you can model any process you have data for. Is risk prediction what is needed? Or can you optimise some other aspect of your workflow?

Please contact us with any questions or help with working through the RUDI framework.

Predicting Risk: An example with COMPAS data

Introduction

Conceptualisation

Will ranking suffice or is prediction more accurate?

Conceptualisation of our example

The conceptualisation template

Data processing

The COMPAS dataset

Modelling

Some details

Discussion

Recent Posts

Comments