Implementation

This section outlines some aspects of deploying a machine learning model that directly impacts decisions pertaining to individuals and communities who interact with police. More detailed information for Data Scientists is available here.

Responsible team members: The whole team

Fitting model into current practice

Have a data literacy strategy that includes training

Processes: who will access the model, how will they access and interpret the model results, limitations of the model, what to do if the model agrees/disagrees with end user assessment, relevant interventions etc.

Good data entry: modelling is less effective with unreliable, missing or invalid data

Confirmation bias: the tendency of people to look for information that confirms or strengthens their beliefs and overlook information that challenges or contradicts them. It is difficult to dislodge and can affect decision making if you ignore or do not seek out information that may contradict your initial assessment Use “consider the opposite” strategy

Automation bias: the tendency to automatically defer to the algorithm’s result, despite contradictory information. This is most likely to occur when the model is seen as highly accurate and reliable and therefore may become more problematic as staff become more accustomed to using algorithmic outputs. Exposing staff to failures during training can guard against complacency, whereas just telling them about the limitations and warning them to always verify does not sufficiently reduce automation bias. Nevertheless, staff should be made aware of the limitations of the model, its failure rate and the importance of keeping a ‘human in the loop’ because of their ability to consider context and their legal requirement to hold ultimate responsibility for any decisions made.

Capacity

Additional resources are necessary to support the implementation of algorithms, both in terms of developing and maintaining the models but also to provide a full and appropriate response to algorithmic predictions

Where & how does it fit in the investigative process

Usability testing: can reveal previously unknown user experience problems.

Feedback: soliciting feedback from staff can help determine the best way to implement modelling in practice.

A/B testing: the user is randomly given either design A or design B, both of which differ in one specific way (e.g., how the output is presented or when the output is presented etc.). The highest performing variation can then be identified according to predefined metrics to make data-driven design decisions.

Implementation Decisions

Who will access the model results?
What training is provided to staff accessing model results? What provision is in place to ensure relevant staff get this training?
At what point in the investigative process will staff access the model results?
How will model results be presented?
How will model limitations be presented?
When should predictions be acted upon? What should staff do?
Are there occasions where predictions should not be acted on? If so, when?
How do staff document why they have either agreed/disagreed with the model’s predictions? Can they document contradictory evidence?
What additional resources will be needed to act on the model’s predictions?

Ensuring the model works in practice

Shadowing

The model is deployed alongside existing practice to test its performance and investigate how it compares to current procedures. This process should last as long as necessary to enable the accuracy of the model’s predictions with real-world data to be assessed (i.e., if it predicts risk over 12 months it needs to run for 12 months). The accuracy can then also be compared to current practice.

Ongoing monitoring

Data drift: The underlying data distribution changes, which changes the relationship between the input and the output and reduces accuracy. This can be gradual or can be sudden (e.g., counting rule changes)

Concept drift: the relationship between your input and what you want to predict changes

Managing drift

Maintain baseline models for performance comparisons
Regularly retrain and update models
Evaluate the importance of new data
Develop a monitoring dashboard

Decide on metrics you are going to track to guard against data drift and concept drift and set threshold alarms. These can be based on numerous metrics depending on what is most relevant
Metrics and thresholds can also change over time and therefore these also need to be monitored

Plan for feedback loops caused by interventions

Plan what to do if the model no longer works satisfactorily

Develop criteria to determine how you know the modelling is making a difference i.e., what outcome is expected as part of your rationale .

Model maintenance

What metrics will be used to track model performance and why have these been chosen?
If applicable, what thresholds for each metric will trigger an alarm and why have these been chosen?
How many thresholds need to be breached for the model’s performance to be reassessed? Why?
How will the model’s performance be reassessed?
What will happen if the model’s performance needs to be reassessed? (e.g., stop using the model entirely? warn users the output may be faulty?)
What additional resources will be needed to maintain the model?