top of page

Data Preparation

Data profiling and visualisation tools can illuminate insights and facilitate comprehensive understanding during the decision-making process. Here is a non-exhaustive guide to data preparation:


Create a unified dataset by importing relevant data from various sources

 

Cleaning and feature manipulation

The presence of incorrect or inconsistent data can distort the model’s results. Data cleaning involves removing or correcting entries, so the data is valid and reliable. This includes removing cases that are inappropriate, as outlined in the rationale.

  • Correct data as necessary:

    • Tabular data can have a lot of errors in it because of manual entry and various changes to the input systems. You should decide what to do for each column based on types.  Some examples include: 

      • Unlikely age values less than 0 or bigger than 100 

      • Dates that are unlikely e.g. 1900

      • Values that can be propagated when they are available for other entries for the same person, such as gender. 

      • Columns with single or too few values should be dropped. While there are imputation techniques, you need to be careful about the implications for your data, as imputation in some cases might be equivalent to introducing false facts.  

      • Only some ML models can use null values, if using data preprocessing pipelines like those available in sklearn, you can delay decisions and apply transformations on a case-by-case basis. 

      • Correctly encode and process categorical features and ordinal features. Again, data processing pipelines can help you process each of these separately and customise input for different models. 

      • The efficient way to improve the accuracy of AI model: Andrew Ng’s Data-centric AI | by Dasol Hong | AI Network | Medium


  • Sequence/Geospatial

    • Standardise time variables and consider treating parts of the timestamp as separate features.

    • Unlike tabular data, sequence data may require imputation of missing values

    • Features may need to be encoded or transformed to make them more standard or useful

    • Some sequence processing examples

    • Leverage distance measures and shape processing for geospatial data.

    • Some geospatial examples

  • Image/Video

  • Text data processing is a rapidly changing field. Many techniques that apply to older models are unnecessary if you are using deep-learning-based methods. Treating text data as categorical so it can be used in models suitable for tabular data like SVMs or trees leads to word meaning in context being disregarded. However, you can combine tabular and text data by generating the text vectors separately, for example, by using a model like BERT. 

    • If you are using single tokens or n-grams, consider a) removing very high and very low frequency words; b) using stemming; c) l2-normalising by row, not by column; d) using methods such as PPMI to reweigh the features; e) applying SVD or LDA to smooth over similar words; etc. A combination of PPMI, l2-norm, followed by SVD can be very useful.

  • Some manipulations depend on the statistical distribution of known data, such as statistics-based normalisations and data scaling. Take care to analyse how these will be done on incoming cases as you might need to:

    • Estimate them only on the training portion of your data

    • Save any scaling factors so you can apply them to the test/production data

  • The effects of missing data depend on the task, but it’s important to test what is the best strategy for dealing with your data. This may be tricky as you can end up changing both your training and test sets, making comparison of the methods more difficult. Keep track of issues and performance. Two main ways of dealing with missing data are deletion and imputation, you can delete bad features or bad training examples, but you might be missing information. You can estimate the missing data using imputation, collaborative filtering, or dimensionality reduction techniques such as singular value decomposition; however, you need to be careful not to introduce erroneous information. Likewise, you need to think about how you will deal with the missing data when your system is running in production. 


    You might be able to find an algorithm that suits your data because it handles the missing values in a particular way or you might find that knowledge of how an algorithm works can tell you the best way to represent the missing values. For example, using a 0 might be mistaken for genuine 0 values, as can something like –1, so you might be better off using a very unlikely value like –999 if you are using tree-based methods. On the other hand, if you are normalising, this can cause you issues. 


 

Labelling 

There are many types of modelling tasks that can be helpful in the policing environment. Some may be clearly defined like document labelling, others might be predictive, like risk assessment or future crime locations, others may be subjective like document summarisation. There will be different labelling best practices for different tasks, but in each case, to help prediction it is important to be very clear about what the goal of the modelling is and to define the task in a way that is best supported by the data.

  • Make sure that the process that leads to the labels is clearly defined. For example, if you are replicating a human task, try to make sure borderline cases are well defined so there is consistency in labelling. Consistent labelling can improve your modelling more than any algorithm choice. 

  • If you are deciding labels based on data, consider all factors that influence the labels and clearly define the meaning of the label. Try to avoid shifting goalposts based on population statistics, e.g. ‘top 5%’ most violent criminals up to date, because this can change depending on the number of people you include in your sample group.  Instead look for the patterns in that group and try to align them with theory and policy and potentially the input data features to ensure clarity. 

    • If you are working on predicting risk of future harm. What constitutes harm, what are the outcomes that you are trying to prevent? Were there interventions that might have prevented the person from committing harm, such as prison terms, or protection orders? How do you model this knowledge? For example, can you record these as features if they are part of future events? Or do you disregard cases where someone’s future risk was altered due to interventions? Test out best strategies.

    • If you are predicting event locations based on past events, are there interventions such as police presence or general unrest that have altered the labels in some way?

  • In cases where you are predicting future events you may want to consider that your labels represent a noisy ground truth and use modelling techniques that account for noisy labels.

  • The Ultimate Guide to Data Labeling: How to Label Data for ML | by SuperAnnotate | Geek Culture | Medium


 

Assess and mitigate bias

Bias can be introduced at any point in the life cycle of data, e.g., how it was gathered, how it was entered, what is and is not available for modelling, word choice in text data, the locations that were sampled, societal impacts, etc. It is important to consider and record all possible sources of bias in the data cards.


bottom of page