As artificial intelligence becomes more prevalent in our daily lives, machine learning (ML), its core enabler, is consuming a greater share of software development efforts across industries. Therefore, more machine learning tools, methods, and products are developed using the principles, processes, and tools of Agile methodologies like scrum, kanban, and lean.
However, ML modeling (the tasks involved with identifying and implementing an appropriate machine learning algorithm; selecting the data, metrics, training; tuning the features and algorithm; and then producing the target model) is often conducted by data scientists who are not familiar with software engineering or Agile approaches and who have difficulty harmonizing their research activity with Agile project management, time boxes, or engineering configuration management. This article proposes a set of better practices, designed by and for eBay ML scientists, for facilitating weaving ML modeling into the cyclical Agile process flow.
Is research ever “done”?
One important element of the Agile methodology is the Definition of Done (DoD) for “shippability” and keeping each modicum of incremental work shippable at all times. The DoD is a list of requirements, or acceptance criteria, to which software must always adhere in order to be called complete and accepted by an end user customer, team, or consuming system. However, standard acceptance criteria, such as unit test coverage, code reviewed, or functional tests passed, are inappropriate to ensure ML modeling quality, as they fail to address essential success criteria of the modeling task. Indeed, the absence of quantifiable requirements in modeling quite often leads to misunderstandings, sometimes to frustration (on all sides), when for example an engineering scrum master asks a data scientist “when will you complete your research?”
We argue that well-known ML best practices can be very helpful in enabling Agile modeling if specified as requirements from the very beginning of a project and repeated throughout the entire life cycle of an ML model, from problem definition all the way through deployment to production, maintenance, refactoring, and end-of-life. More precisely, we believe that specifying and agreeing upfront on requirements elicits a discussion around their achievability which, in turn, naturally leads to an iterative mindset, a core tenet of Agile. Figure 1 highlights six important phases of ML modeling and their acceptance criteria. The rest of this article describes these requirements in more detail.
Figure 1. The six phases of ML modeling and their acceptance criteria.
Before you start. At the beginning of an AI project, before any technical work has started, it is critical to get clarity on the business or technical problem for which the ML model will be applied and how the accuracy of the model’s predictions relate to the overall objective. More precisely, we have observed that answering the following questions before a sprint starts helps communicate precise qualitative requirements for the performance of the model:
- What business problem are you trying to solve? For which business measurement are you optimizing? Increased net revenue? Increased transaction rate? Increased market share in a different category? Acquiring new profitable, high-spending buyers or frequent shoppers?
- What are your scientific evaluation criteria? How does your scientific optimization correlate with your business optimization?
- What is your baseline? What is the industry baseline? What is the minimum viable performance you must achieve to declare this iteration of your model a success at the end of a time box?
The answers to the last two questions are particularly critical. Understanding how the scientific metrics correlate with business metrics allows you to quantify the return on investment (ROI) of each measured increment of improvement of the scientific metrics. What would be the business impact of a model with an “accuracy” of 80%? 90%? 95%? Clarifying the current baseline helps define the minimum success bar. In a competitive market, it is important to understand how other companies perform compared to your current feature or service. If you are iterating on an existing model, you need to clarify how much better the new model must perform. If not, you must still quantify the minimum performance needed to reach a satisfactory level for success of your effort.
While it is obvious that data quality is paramount for ML modeling, two aspects of data preparation are often overlooked: how the data is sampled and how it is split into training, validation, and test.
Data sampling: don’t forget the body and the tail. Critically, the data used for training, tuning, and evaluating an ML model should be as close as possible to the production data and its distribution. In particular, attention must be paid to the head, body, and tail of the distribution of interest. Evaluating a model only on the head of the distribution is a common pitfall of ML modeling. Note however that some scenarios (unbalanced classes) require re-sampling the training data and purposefully training the model on a different distribution. Furthermore, evaluating a model on old test data should be avoided as they run the risk of rewarding old models for being outdated and punishing newer model for being current. And of course, seasonality and other time series patterns in data should accounted for when sampling data.
Data splitting: no déjà vu! Any ML scientist knows that training, validation, and test data should not overlap in order to ensure a reliable estimation of the performance of the model on future unseen data. However, it is sometimes overlooked that real life data may contain identical or near duplicate samples. While this may be due to the nature of the underlying distribution governing the data, this deserves special attention to make sure that duplicate samples are not dominating the validation and test data and are not biasing the estimation of the performance of the model.
To summarize, the following two questions must be addressed in the data preparation phase:
- Did you sample separately from the head, body, and tail of the distribution so that you can evaluate your model on each of these?
- Does your training data overlap with validation data or test data?
Use industry standards! While the target metrics should have been identified in the Problem Definition phase, it is important to crystalize them before starting the training and evaluation phase, for two reasons. First, to ensure that industry standards are used. Table 1 lists some of the most commonly used ML metrics. While it is sometimes justified to create a new metric, standard metrics can be effectively used in a wide range of settings. Second, to ensure that the metrics used to evaluate the model and the loss function used to train it are consistent.
In summary, the requirements for the metrics definition phase can be formulated as:
- Are you using industry standards?
- Are your metrics and your loss function consistent?
Accuracy, Precision and Recall, F1, ROC curves, Precision and Recall curves.
Root Mean Squared Error, Maximum Absolute Error
Probability Distribution Estimation
Log loss scores such as Negative Log Likelihood, Cross Entropy, KL Divergence.
nDCG, DCG, AUC, Kendal Tau, Precision @k for various low values of k
BLEU, TER, WER, ROUGE
Table 1. Some of the most common machine learning success measurements.
The training phase of an ML model preparation is mostly about hyperparameter tuning, the task of identifying the parameters of the learning algorithm that result in the best model. Hyperparameters should be tuned on the validation data only, not on the test data. Ideally, the test data should be used only once, to confirm that the model provides consistent performance on unseen data. There are two frequent reasons that performance is inconsistent. The most common cause is overfitting the training and validation data, which can be prevented using well-known techniques, such as removing features, adding training data, early stopping, etc. The second-most common cause is that the test data and the validation data are not within the same distribution, and one of them is not representative of production data. In the latter case, the data preparation phase must be revisited.
Note that if error analysis is performed using the test data, the test data must be discarded (or added to the training or validation sets) and a new set should be generated to avoid overfitting the test data. In all cases, having a good synthetic data generator or a frequent feed of redacted production data for testing are invaluable.
The requirements of the training phase can be summarized with one question:
- Did you tune your hyperparameters on the validation set only?
It’s all about the baseline. We highlighted in the Problem Definition section the importance of identifying the strongest possible baseline. Of course, beating the baseline and achieving minimum viable performance is a requirement of the Evaluation phase. However, aiming initially at modest improvements over the baseline and iteratively publishing shippable models through successive refinement, is an invaluable and key benefit of the Agile methodology.
Statistical significance. If the improvement over the baseline is small, statistical significance testing should be performed to ensure that the apparently superior performance of the model is not due to chance or noise in the data and is likely to be observed in production on unseen data. Student’s t-test, Welch’s t-test, and the Mann-Whitney U test are examples of well-known tests for regression; McNemar’s test and the Stuart-Maxwell test for classification.
Confidence Score: “a reasonable probability is the only certainty” (E.W. Howe). If it is required that your model outputs a confidence score for its prediction, it is important to ensure that the score is a well-calibrated probability that means that the confidence score matches the true correctness likelihood. This consists of ensuring that when the model is p% confident about its prediction (say 80% confident), it is actually correct p% of the time. Confidence scores can be calibrated using a validation set.
Don’t forget operating constraints. Finally, it is critical to ensure that the model also meets operating requirements early on. Examples of operating constraints include inference latency, throughput and availability, expected CPU, GPU, TPU, memory, SSD, HDD, and network bandwidth.
To summarize, the following requirements must be met in the Evaluation phase:
- Do you exceed your baseline and reach minimum viable performance?
- Is your result statistically significant?
- Is your confidence score well calibrated?
- Do you meet the operating constraints?
“ML models need love, too” (J. Kobielus1). The task of model building does not end with a specific model being handed over for production deployment. It is important to establish and enforce a maintenance plan that ensures pro-active refresh of the model (as opposed to waiting until some metrics go down). Besides, formalizing such a plan forces a dialog between the modeling Agile team and the engineering Agile team and facilitates weaving modeling into the Agile work process.
Before handing a model to production, the following questions must be answered:
- How will you monitor performances?
- How often will you retrain the model?
Good times come and go but good documentation is forever! While the Agile manifesto favors “working software over comprehensive documentation,” ML modeling is not as self explanatory or reproducible as standard code and needs to be documented appropriately. In particular, we believe that the following should be archived and documented:
- The code used to sample the data (training, validation, and test).
- How to reproduce and operate the model.
- The test data.
Hopefully, we have convinced you that specifying upfront clear and quantifiable requirements for each phase of the ML modeling process fosters model quality, quick iterations, better communication, and closer collaboration between the ML scientists and their partners, namely the business and the Agile engineering team responsible for building the inferencing engine and deploying the model in production.
We intentionally kept these requirements simple and actionable, in the spirit of the Agile manifesto to favor “individuals and interactions over processes and tools.” However, acceptance criteria are just one of the tools that the Agile methodologies advocate. And if you have experience with extending some of these tools to ML modeling or data science, we would love to hear from you!
The author would like to thank all the co-workers that have been involved in the design of these best practices: Alex Cozzi, John Drakopoulos, Giri Iyengar, Alan Lu, Selcuk Kopru, Sriganesh Madhvanath, Robinson Piramuthu, Ashok Ramani, and Mitch Wyle.
Special thanks to Robinson Piramuthu and Mitch Wyle for their careful review of the draft of this article.
1J. Kobielus. “Machine learning models need love, too”. https://www.infoworld.com/article/3029667/machine-learning-models-need-love-too.html