ML Test Score Rubric for Production Readiness

The ML Test Score:

A Rubric for ML Production Readiness and Technical Debt Reduction

Eric Breck, Shanqing Cai, Eric Nielsen, Michael Salib, D. Sculley

Google, Inc.

ebreck, cais, nielsene, msalib, dsculley@google.com

Abstract—Creating reliable, production-level machine learn-

ing systems brings on a host of concerns not found in

small toy examples or even large offline research experiments.

Testing and monitoring are key considerations for ensuring

the production-readiness of an ML system, and for reducing

technical debt of ML systems. But it can be difficult to formu-

late specific tests, given that the actual prediction behavior of

any given model is difficult to specify a priori. In this paper,

we present 28 specific tests and monitoring needs, drawn from

experience with a wide range of production ML systems to help

quantify these issues and present an easy to follow road-map

to improve production readiness and pay down ML technical

debt.

Keywords-Machine Learning, Testing, Monitoring, Reliabil-

ity, Best Practices, Technical Debt

I. INTRODUCTION

As machine learning (ML) systems continue to take on

ever more central roles in real-world production settings,

the issue of ML reliability has become increasingly critical.

ML reliability involves a host of issues not found in small

toy examples or even large offline experiments, which can

lead to surprisingly large amounts of technical debt [1].

Testing and monitoring are important strategies for improv-

ing reliability, reducing technical debt, and lowering long-

term maintenance cost. However, as suggested by Figure

1, ML system testing is also more complex a challenge

than testing manually coded systems, due to the fact that

ML system behavior depends strongly on data and models

that cannot be strongly specified a priori. One way to see

this is to consider ML training as analogous to compilation,

where the source is both code and training data. By that

analogy, training data needs testing like code, and a trained

ML model needs production practices like a binary does,

such as debuggability, rollbacks and monitoring.

So, what should be tested and how much is enough?

In this paper, we try to answer this question with a test

rubric, which is based on engineering decades of production-

level ML systems at Google, in systems such as ad click

prediction [2] and the Sibyl ML platform [3].

We present a rubric as a set of 28 actionable tests, and

offer a scoring system to measure how ready for production

a given machine learning system is. This rubric is intended

to cover a range from a team just starting out with machine

learning up through tests that even a well-established team

may find difficult. Note that this rubric focuses on issues

specific to ML systems, and so does not include generic

software engineering best practices such as ensuring good

unit test coverage and a well-defined binary release process.

Such strategies remain necessary as well. We do call out

a few specific areas for unit or integration tests that have

unique ML-related behavior.

How to read the tests: Each test is written as an

assertion; our recommendation is to test that the assertion is

true, the more frequently the better, and to fix the system if

the assertion is not true.

Doesn’t this all go without saying?: Before we enu-

merate our suggested tests, we should address one objection

the reader may have – obviously one should write tests for

an engineering project! While this is true in principle, in a

survey of several dozen teams at Google, none of these tests

was implemented by more than 80% of teams (though, even

in a engineering culture valuing rigorous testing, many of

these ML-centric tests are non-obvious). Conversely, most

tests had a nonzero score for at least half of the teams

surveyed; our tests do represent practices that teams find

to be worth doing.

In this paper, we are largely concerned with supervised

ML systems that are trained continuously online and perform

rapid, low-latency inference on a server. Features are often

derived from large amounts of data such as streaming logs

of incoming data. However, most of our recommendations

apply to other forms of ML systems, such as infrequently

trained models pushed to client-side systems for inference.

A. Related work

Software testing is well studied, as is machine learning,

but their intersection has been less well explored in the

literature. [4] reviews testing for scientific software more

generally, and cites a number of articles such as [5], who

present an approach for testing ML algorithms. These ideas

are a useful complement for the tests we present, which are

focused on testing the use of ML in a production system

rather than just the correctness of the ML algorithm per se.

Zinkevich provides extensive advice on building effective

machine learning models in real world systems [6]. Those

rules are complementary to this rubric, which is more

 2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media,

including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to

servers or lists, or reuse of any copyrighted component of this work in other works. Published as [7].

Figure 1. ML Systems Require Extensive Testing and Monitoring. The key consideration is that unlike a manually coded system (left), ML-based

system behavior is not easily specified in advance. This behavior depends on dynamic qualities of the data, and on various model configuration choices.

concerned with determining how reliable an ML system is

rather than how to build one.

Issues of surprising sources of technical debt in ML

systems has been studied before [1]. It has been noted that

the prior work has identified problems but been largely silent

on how to address them; this paper details actionable advice

drawn from practice and verified with extensive interviews

with the maintainers of 36 real world systems.

II. TESTS FOR FEATURES AND DATA

Machine learning systems differ from traditional software-

based systems in that the behavior of ML systems is not

specified directly in code but is learned from data. Therefore,

while traditional software can rely on unit tests and integra-

tion tests of the code, here we attempt to add a sufficient

set of tests of the data.

Data 1: Feature expectations are captured in a

schema: It is useful to encode intuitions about the data

in a schema so they can be automatically checked. For

example, an adult human is surely between one and ten

feet in height. The most common word in English text is

probably ‘the’, with other word frequencies following a

power-law distribution. Such expectations can be used for

tests on input data during training and serving (see test

Monitor 2).

How? To construct the schema, one approach is to start

with calculating statistics from training data, and then ad-

justing them as appropriate based on domain knowledge. It

may also be useful to start by writing down expectations

and then compare them to the data to avoid an anchoring

1 Feature expectations are captured in a schema.

2 All features are beneficial.

3 No feature’s cost is too much.

4 Features adhere to meta-level requirements.

5 The data pipeline has appropriate privacy controls.

6 New features can be added quickly.

7 All input feature code is tested.

Table I

BRIEF LISTING OF THE SEVEN DATA TESTS.

bias. Visualization tools such as Facets

can be very useful

for analyzing the data to produce the schema. Invariants to

capture in a schema can also be inferred automatically from

your system’s behavior [8].

Data 2: All features are beneficial: A kitchen-sink

approach to features can be tempting, but every feature

added has a software engineering cost. Hence, it’s important

to understand the value each feature provides in additional

predictive power (independent of other features).

How? Some ways to run this test are by computing

correlation coefficients, by training models with one or two

features, or by training a set of models that each have one

of k features individually removed.

Data 3: No feature’s cost is too much: It is not

only a waste of computing resources, but also an ongoing

maintenance burden to include -features that add only

minimal predictive benefit [1].

How? To measure the costs of a feature, consider not

only added inference latency and RAM usage, but also

more upstream data dependencies, and additional expected

instability incurred by relying on that feature. See Rule#22

[6] for further discussion.

Data 4: Features adhere to meta-level requirements:

Your project may impose requirements on the data coming

in to the system. It might prohibit features derived from user

data, prohibit the use of specific features like age, or simply

prohibit any feature that is deprecated. It might require all

features be available from a single source. However, during

model development and experimentation, it is typical to try

out a wide variety of potential features to improve prediction

quality.

How? Programmatically enforce these requirements, so

that all models in production properly adhere to them.

Data 5: The data pipeline has appropriate privacy

controls: Training data, validation data, and vocabulary files

all have the potential to contain sensitive user data. While

teams often are aware of the need to remove personally iden-

tifiable information (PII), during this type of exporting and

https://pair-code.github.io/facets/

transformations, programming errors and system changes

can lead to inadvertent PII leakages that may have serious

consequences.

How? Make sure to budget sufficient time during new

feature development that depends on sensitive data to allow

for proper handling. Test that access to pipeline data is

controlled as tightly as the access to raw user data, especially

for data sources that haven’t previously been used in ML.

Finally, test that any user-requested data deletion propagates

to the data in the ML training pipeline, and to any learned

models.

Data 6: New features can be added quickly: The

faster a team can go from a feature idea to the feature

running in production, the faster it can both improve the

system and respond to external changes. For highly efficient

teams, this can be as little as one to two months even for

global-scale, high-traffic ML systems. Note that this can

be in tension with Data 5, but privacy should always take

precedence.

Data 7: All input feature code is tested: Feature

creation code may appear simple enough to not need unit

tests, but this code is crucial for correct behavior and so

its continued quality is vital. Bugs in features may be

almost impossible to detect once they have entered the data

generation process, especially if they are represented in both

training and test data.

III. TESTS FOR MODEL DEVELOPMENT

While the field of software engineering has developed a

full range of best practices for developing reliable software

systems, similar best-practices for ML model development

are still emerging.

Model 1: Every model specification undergoes a

code review and is checked in to a repository: It can

be tempting to avoid code review out of expediency, and

run experiments based on one’s own personal modifications.

In addition, when responding to production incidents, it’s

crucial to know the exact code that was run to produce a

given learned model. For example, a responder might need

to re-run training with corrected input data, or compare the

result of a particular modification. Proper version control of

the model specification can help make training auditable and

improve reproducibility.

1 Model specs are reviewed and submitted.

2 Offline and online metrics correlate.

3 All hyperparameters have been tuned.

4 The impact of model staleness is known.

5 A simpler model is not better.

6 Model quality is sufficient on important data slices.

7 The model is tested for considerations of inclusion.

Table II

BRIEF LISTING OF THE SEVEN MODEL TESTS

Model 2: Offline proxy metrics correlate with actual

online impact metrics: A user-facing production system’s

impact is judged by metrics of engagement, user happiness,

revenue, and so forth. A machine learning system is trained

to optimize loss metrics such as log-loss or squared error.

A strong understanding of the relationship between these

offline proxy metrics and the actual impact metrics is needed

to ensure that a better scoring model will result in a better

production system.

How? The offline/online metric relationship can be mea-

sured in one or more small scale A/B experiments using an

intentionally degraded model.

Model 3: All hyperparameters have been tuned:

A ML model can often have multiple hyperparameters,

such as learning rates, number of layers, layer sizes and

regularization coefficients. Choice of the hyperparameter

values can have dramatic impact on prediction quality.

How? Methods such as a grid search [9] or a more

sophisticated hyperparameter search strategy [10] [11] not

only improve prediction quality, but also can uncover hid-

den reliability issues. Substantial performance improvements

have been realized in many ML systems through use of an

internal hyperparameter tuning service[12]

Model 4: The impact of model staleness is known:

Many production ML systems encounter rapidly changing,

non-stationary data. Examples include content recommen-

dation systems and financial ML applications. For such

systems, if the pipeline fails to train and deploy sufficiently

up-to-date models, we say the model is stale. Understanding

how model staleness affects the quality of predictions is

necessary to determine how frequently to update the model.

If predictions are based on a model trained yesterday versus

last week versus last year, what is the impact on the

live metrics of interest? Most models need to be updated

eventually to account for changes in the external world;

a careful assessment is important to decide how often to

perform the updates (see Rule 8 in [6] for related discussion).

How? One way of testing the impact of staleness is with

a small A/B experiment with older models. Testing a range

of ages can provide an age-versus-quality curve to help

understand what amount of staleness is tolerable.

Model 5: A simpler model is not better: Regularly

testing against a very simple baseline model, such as a linear

model with very few features, is an effective strategy both

for confirming the functionality of the larger pipeline and

for helping to assess the cost to benefit tradeoffs of more

sophisticated techniques.

Model 6: Model quality is sufficient on all important

data slices: Slicing a data set along certain dimensions of

interest can improve fine-grained understanding of model

quality. Slices should distinguish subsets of the data that

might behave qualitatively differently, for example, users by

The service is closely related to HyperTune[13].

Overview

ML Test Score Rubric for Production Readiness

The ML Test Score provides a comprehensive rubric designed to assess the production readiness of machine learning systems and reduce technical debt. It outlines 28 specific tests and monitoring practices based on extensive experience with real-world ML systems. This framework is essential for teams aiming to ensure reliability and maintainability in their ML projects. By implementing these tests, organizations can improve their ML systems' performance and reduce long-term maintenance costs. This resource is invaluable for data scientists and engineers working in machine learning environments. Key Points Presents 28 actionable tests for assessing ML system readiness Focuses on reducing technical debt in machine learning projects Offers a scoring system to measure production read…

/ 10

147

FAQs

What is the purpose of the ML Test Score rubric?

The ML Test Score rubric is designed to evaluate the production readiness of machine learning systems and to help teams identify and reduce technical debt. It provides a structured approach to testing and monitoring, ensuring that ML systems are reliable and maintainable over time. By following this rubric, organizations can systematically improve their ML practices and enhance the overall quality of their systems.

How many tests are included in the ML Test Score rubric?

The rubric includes 28 specific tests that cover various aspects of machine learning system readiness. These tests are drawn from real-world experiences and aim to address the unique challenges faced by ML systems in production. The tests help teams evaluate their systems' performance, reliability, and overall effectiveness.

Who can benefit from using the ML Test Score rubric?

Data scientists, machine learning engineers, and organizations developing ML systems can all benefit from the ML Test Score rubric. It serves as a practical guide for teams looking to improve their testing and monitoring practices, ultimately leading to more reliable and maintainable ML systems. The rubric is suitable for both novice and experienced teams, providing a roadmap for enhancing production readiness.

What are some key areas covered by the ML Test Score tests?

The tests in the ML Test Score rubric cover various key areas, including feature validation, model development, infrastructure reliability, and ongoing monitoring. Each area addresses specific challenges and best practices relevant to machine learning systems, ensuring comprehensive evaluation and improvement. By focusing on these critical aspects, teams can better manage their ML projects and reduce potential risks.

How does the scoring system work in the ML Test Score rubric?

The scoring system in the ML Test Score rubric assigns points based on the implementation of the 28 tests. Teams can earn half a point for manually executing a test and a full point for automating it. The final score is determined by taking the minimum score from four sections, emphasizing the importance of addressing all areas of testing and monitoring for comprehensive system readiness.

Key Points

You May Also Like