Data Preparation
Data profiling and visualisation tools can illuminate insights and facilitate comprehensive understanding during the decision-making process. Here is a non-exhaustive guide to data preparation:
Create a unified dataset by importing relevant data from various sources
Check your data at each point in pre-processing.
Perform rigorous checks at each stage of pre-processing, especially if the unification code is developed by another party.
Carefully address potential errors and misalignments resulting from data manipulations.
It’s easy to inadvertently introduce features that mirror labels, particularly if labels are auto-generated from the data.
Regularly assess the impact of data manipulations on model performance.
Data profiling tools can be free and easy to use like YData Profiling , although there are lots that are not code based 16 Open Source Data Profiling Tools (Plus Benefits) | Indeed.com
Visualisation tools can also help
A comprehensive and practical guide to Image Processing and Computer Vision using Python: Part 1 (Introduction) | by Pranav Natekar | Towards Data Science
Cleaning and feature manipulation
The presence of incorrect or inconsistent data can distort the model’s results. Data cleaning involves removing or correcting entries, so the data is valid and reliable. This includes removing cases that are inappropriate, as outlined in the rationale.
Correct data as necessary:
Tabular data can have a lot of errors in it because of manual entry and various changes to the input systems. You should decide what to do for each column based on types. Some examples include:
Unlikely age values less than 0 or bigger than 100
Dates that are unlikely e.g. 1900
Values that can be propagated when they are available for other entries for the same person, such as gender.
Columns with single or too few values should be dropped. While there are imputation techniques, you need to be careful about the implications for your data, as imputation in some cases might be equivalent to introducing false facts.
Only some ML models can use null values, if using data preprocessing pipelines like those available in sklearn, you can delay decisions and apply transformations on a case-by-case basis.
A nice example of pipelines for use in the stacking classifier with tree-based and linear methods
Correctly encode and process categorical features and ordinal features. Again, data processing pipelines can help you process each of these separately and customise input for different models.
The efficient way to improve the accuracy of AI model: Andrew Ng’s Data-centric AI | by Dasol Hong | AI Network | Medium
Sequence/Geospatial
Standardise time variables and consider treating parts of the timestamp as separate features.
Unlike tabular data, sequence data may require imputation of missing values
Features may need to be encoded or transformed to make them more standard or useful
Some sequence processing examples
Leverage distance measures and shape processing for geospatial data.
Some geospatial examples
Image/Video
Text data processing is a rapidly changing field. Many techniques that apply to older models are unnecessary if you are using deep-learning-based methods. Treating text data as categorical so it can be used in models suitable for tabular data like SVMs or trees leads to word meaning in context being disregarded. However, you can combine tabular and text data by generating the text vectors separately, for example, by using a model like BERT.
If you are using single tokens or n-grams, consider a) removing very high and very low frequency words; b) using stemming; c) l2-normalising by row, not by column; d) using methods such as PPMI to reweigh the features; e) applying SVD or LDA to smooth over similar words; etc. A combination of PPMI, l2-norm, followed by SVD can be very useful.
Some manipulations depend on the statistical distribution of known data, such as statistics-based normalisations and data scaling. Take care to analyse how these will be done on incoming cases as you might need to:
Estimate them only on the training portion of your data
Save any scaling factors so you can apply them to the test/production data
The effects of missing data depend on the task, but it’s important to test what is the best strategy for dealing with your data. This may be tricky as you can end up changing both your training and test sets, making comparison of the methods more difficult. Keep track of issues and performance. Two main ways of dealing with missing data are deletion and imputation, you can delete bad features or bad training examples, but you might be missing information. You can estimate the missing data using imputation, collaborative filtering, or dimensionality reduction techniques such as singular value decomposition; however, you need to be careful not to introduce erroneous information. Likewise, you need to think about how you will deal with the missing data when your system is running in production.
You might be able to find an algorithm that suits your data because it handles the missing values in a particular way or you might find that knowledge of how an algorithm works can tell you the best way to represent the missing values. For example, using a 0 might be mistaken for genuine 0 values, as can something like –1, so you might be better off using a very unlikely value like –999 if you are using tree-based methods. On the other hand, if you are normalising, this can cause you issues.
How to Handle Missing Data with Python - MachineLearningMastery.com
7 Ways to Handle Missing Values in Machine Learning | by Satyam Kumar | Towards Data Science
Labelling
There are many types of modelling tasks that can be helpful in the policing environment. Some may be clearly defined like document labelling, others might be predictive, like risk assessment or future crime locations, others may be subjective like document summarisation. There will be different labelling best practices for different tasks, but in each case, to help prediction it is important to be very clear about what the goal of the modelling is and to define the task in a way that is best supported by the data.
Make sure that the process that leads to the labels is clearly defined. For example, if you are replicating a human task, try to make sure borderline cases are well defined so there is consistency in labelling. Consistent labelling can improve your modelling more than any algorithm choice.
If you are deciding labels based on data, consider all factors that influence the labels and clearly define the meaning of the label. Try to avoid shifting goalposts based on population statistics, e.g. ‘top 5%’ most violent criminals up to date, because this can change depending on the number of people you include in your sample group. Instead look for the patterns in that group and try to align them with theory and policy and potentially the input data features to ensure clarity.
If you are working on predicting risk of future harm. What constitutes harm, what are the outcomes that you are trying to prevent? Were there interventions that might have prevented the person from committing harm, such as prison terms, or protection orders? How do you model this knowledge? For example, can you record these as features if they are part of future events? Or do you disregard cases where someone’s future risk was altered due to interventions? Test out best strategies.
If you are predicting event locations based on past events, are there interventions such as police presence or general unrest that have altered the labels in some way?
In cases where you are predicting future events you may want to consider that your labels represent a noisy ground truth and use modelling techniques that account for noisy labels.
Assess and mitigate bias
Bias can be introduced at any point in the life cycle of data, e.g., how it was gathered, how it was entered, what is and is not available for modelling, word choice in text data, the locations that were sampled, societal impacts, etc. It is important to consider and record all possible sources of bias in the data cards.