Splitting Data

How you split the data will depend on the amount you have. The rule of thumb is to keep as much data as possible for training while allowing for a representative sample of different cases for testing, something like 80% for training, 10% for validation, and 10% for test. Key aspects of data splitting include:

Making sure the different splits don’t have overlapping examples. For instance, if you are using a suspect's case history, make sure there is no overlap between people in the different splits e.g., different crimes by the same suspect.
Preserving the distribution of the test data. The test data should mirror the distribution of future values as much as possible. One way to do that is to randomly sample the data, preferably using stratified sampling to ensure enough of each class is in the test data. Alternatively, if the data is temporal, you can preserve a certain time span for test data.
- While you may want to change the distribution of the training data, for example by under or oversampling a minority class, you should not change the distribution of the test data, although you can also look at subsets of test data to examine trends in errors, e.g. bias in a protected characteristic.
As mentioned in the labelling section, errors can occur whether you are using manual or automatic labelling. Manually verifying the labels in the test data may improve accuracy. For example, if you have cases labelled low risk but the individual appears to be high risk, were there interventions in place that lead to that label?
Validation data should be used to evaluate models in training and to choose a best model, the test data should be held out for final testing, otherwise the models will be overfitted and the test data will not give correct representation of future model accuracy.
Cross-validation can be useful if there isn’t much training data, but you should still try to have a held-out test set if possible or plan to collect test data during development so that it is ready for when you have likely models to compare.

Train Test Split in Deep Learning | by Igor Susmelj | Towards Data Science