Development
for Data Scientists

This section provides a more detailed guide for developing machine learning (ML) models intended for use in policing. It's not a complete list of analyses to conduct. Whether you're new to machine learning, have academic experience but little exposure to deploying AI in human-centric environments, or you're an expert, here are some reminders about the steps involved in building a model-based product. Towards the end of the document, you'll find a list of useful resources, including links to online courses that cover much of this material. They also provide general advice on building and deploying ML models.

Project Preparation

Many projects begin their journey as prototypes, and while proof of concept is essential, the transition to a deployable product with real-world impact requires: a) adherence to sound coding and engineering practices, and b) the establishment of a reproducible and well-documented process, even if it demands additional time investment

Data Preparation

Data profiling and visualisation tools can illuminate insights and facilitate comprehensive understanding during the decision-making process.

Splitting Data

How you split the data will depend on the amount you have. The rule of thumb is to keep as much data as possible for training while allowing for a representative sample of different cases for testing, something like 80% for training, 10% for validation, and 10% for test.

Modelling 1: Knowing the data

Modelling is as much of an art as it is a science. It’s important to familiarise yourself with the training data set and its properties to get the feeling for the likely algorithms that would work best for your data.

Modelling 2: Knowing the data sources and biases

Knowing your data also needs to include understanding how the data was produced, and the procedures that could lead to idiosyncratic entries and variations in distribution over time.

Modelling 3: Understanding the model

It is incredibly easy to produce models by just putting data through a set of algorithms and picking some that perform best on some select criteria; however, pattern recognition algorithms can be very sneaky, and strive to find the easiest route to optimal performance.