Project Preparation
Many projects begin their journey as prototypes, and while proof of concept is essential, the transition to a deployable product with real-world impact requires: a) adherence to sound coding and engineering practices, and b) the establishment of a reproducible and well-documented process, even if it demands additional time investment.
Creating a well-organised work environment with traceable code and dataset modifications is pivotal to achieving these objectives. Resources like DVC's best practices offer insightful guidelines, or you can explore comprehensive solutions such as ClearML or MLFlow, providing local management capabilities and seamless integration with diverse cloud computing platforms. Alternatively, if a single provider is your preference, platforms like Azure Pipelines offer integrated solutions.
Each machine learning project involves a collection of decisions, some o f which are testable, and others less tangible. Even seemingly trivial choices, such as row normalisation versus column normalisation, or the selection of a model family, can significantly impact model effectiveness. The risk of coding errors and faulty assumptions is heightened when working in isolation. We strongly recommend collaboration with other data scientists. Regular discussions within interdisciplinary teams about assumptions and decisions that influence the final product can enhance the overall quality of the project.
A deep understanding of the dataset, including its contents, errors, and the underlying police procedures and processes that contribute to its creation, is essential. Recognising inherent biases is crucial for informed decision-making and selecting appropriate modelling techniques. Automatic data visualisation packages and automated ML training can be beneficial in gaining insights. Techniques such as data clustering can unveil patterns, assess linear separability, and guide automated labelling processes towards promising directions.