Modelling 1: Knowing the data

Modelling is as much of an art as it is a science. It’s important to familiarise yourself with the training data set and its properties to get the feeling for the likely algorithms that would work best for your data. #13 Machine Learning Engineering for Production (MLOps) Specialization [Course 1, Week 2, Lesson 5]

For tabular data:
- One way to do this is to use a dataset visualisation package like pandas-dq or ydata-profiling, which are one of several choices people could make including deep learning-oriented ones like tensorflow data validation.
- Clustering can also help uncover the landscape of the data, you can play around with different normalisation and preprocessing strategies to see if the data clusters become more coherent. You can compare the ground truth labels to see how well they align with clusters.
- Visualising the labelled data also helps. For example, if the labels when plotted into 2D or 3D space (using a dimensionality reduction technique such as PCA) do not exhibit any clustering, it may indicate that you need an algorithm that can work with complicated and intermingled data. So, you might consider deeper neural networks or a combination of multiple algorithms, all working on the same data or divided by modality (e.g. using a different type of classifier for text components and tabular components).
Visualisation and exploration techniques also exist for timeseries, geospatial, and text data. For example:
- For geospatial data:
  - Exploratory Spatial Data Analysis - an overview | ScienceDirect Topics
  - Analyze Geospatial Data in Python: GeoPandas and Shapely – LearnDataSci
- Examples of text data exploration:
  - Although most advanced NLP modelling techniques look at full sentences and words in context rather than the n-gram and keyword exploration, it is useful to see what sort of patterns occur in your data A Complete Exploratory Data Analysis and Visualization for Text Data | by Susan Li | Towards Data Science
Contrastive analysis can be a useful tool to how your dataset differs from general data, e.g. how does data specific to one crime differ from the data on all crimes mixed together. This works well with text for example comparing text in police reports against text in Wikipedia or newspaper corpora to see which phrases are outliers and how the language use is different might help you figure out what are salient features, or why some out of the box language modelling isn’t working for you.