Modelling 2: Knowing the data sources and biases
Knowing your data also needs to include understanding how the data was produced, and the procedures that could lead to idiosyncratic entries and variations in distribution over time. For example:
The definition of a particular crime can vary over time and therefore lead to a loosely coherent set of features. This needs to be understood and recorded correctly in the data and model cards.
Data entry systems have changed over the years, if using historical data, be mindful of shifts and consider mitigating strategies and feature engineering that could help override differences.
Biases in the data can come from socio-economic, procedural, and other sources. Think about how the data you have has been collected and what could be the sources there. For example, whether it came from called-in crimes, or stop-and-search could make a big difference.
If you are using locations, or if your data is predominantly from a particular community, you need to avoid a feedback loop where your model does not look outside the sample area.
If you are using language modelling tools, consider examining the potential biases in the model that you are introducing and the fine-tuned model that you are producing.
Fairness: Types of Bias | Machine Learning | Google for Developers
Where it is not practical to remove bias, quantify and minimise it where possible. The team should decide what is the appropriate quantitative and qualitative definition of discrimination to determine what is acceptable error, for example, what is the allowable difference in error rate for different groups.
Then test as much as possible if your model is amplifying these biases. There are several key steps that can be performed and documented:
Test correlations between data features and protected attributes or sources of bias (e.g., ethnicity, location). Data features should be associated with the outcome rather than any protected features.
For each level of each protected attribute the proportion in the overall population should be computed to learn which group errors may be more susceptible to estimation errors.
If the correlations or group proportions are above a certain threshold (set by the team’s acceptable error standards) minimise bias though:
Before Modelling: Relabelling, reweighting or resampling examples near the classification margin e.g. Mitigating Bias in Machine Learning: An introduction to MLFairnessPipeline | by Mark Bentivegna | Towards Data Science
Any training data adjustment strategies should be tested to ensure that the model performance is still correct.
Test data proportions should not be adjusted in any way.
During Modelling: using ‘fairness regularisers’ that consider differences in how the algorithm classifies protected vs. non-protected classes and penalises the model based on the extent of the difference. This is an active research area and it’s worth looking for the right regularisers for your algorithm.
After Modelling: investigate bias with the test dataset
Fairness: Evaluating for Bias | Machine Learning | Google for Developers
Fairness | Machine Learning | Google for Developers
A Tutorial on Fairness in Machine Learning | by Ziyuan Zhong | Towards Data Science