The New Machine Learning Life Cycle: 5 Challenges for DevOps

by Diego Oppenheimer

SEATTLE – Machine learning is fundamentally different from traditional software development applications and requires its own, unique process: the ML development life cycle.

More and more companies are deciding to build their own, internal ML platforms and are starting down the road of the ML development life cycle. Doing so, however, is difficult and requires much coordination and careful planning. In the end, though, companies are able to control their own ML futures and keep their data secure.

After years of helping companies achieve this goal, we have identified five challenges every organization should keep in mind when they build infrastructure to support ML development.

Related: How to build train and test a machine learning model


Depending on the ML use case, a data scientist might choose to build a model in Python, R, or Scala and use an entirely different language for a second model. What’s more, within a given language, there are numerous frameworks and toolkits available. TensorFlow, PyTorch, and Scikit-learn all work with Python, but each is tuned for specific types of operations, and each outputs a slightly different type of model.

ML–enabled applications typically call on a pipeline of interconnected models, often written by different teams using different languages and frameworks.

An ML pipeline that extracts text from scanned documents and analyzes its sentiment using several languages and frameworks


In machine learning, your code is only part of a larger ecosystem—the interaction of models with live, often unpredictable data. But interactions with new data can introduce model drift and affect accuracy, requiring constant model tuning and retraining.

As such, ML iterations are typically more frequent than traditional app development workflows, requiring a greater degree of agility from DevOps tools and staff for versioning and other operational processes. This can drastically increase work time needed to complete tasks.

The ML development life cycle with iterations


Machine learning is all about selecting the right tool for a given job. But selecting infrastructure for ML is a complicated endeavor, made so by a rapidly evolving stack of data science tools, a variety of processors available for ML workloads, and the number of advances in cloud-specific scaling and management.

To make an informed choice, you should first identify project scope and parameters. The model-trainingprocess, for example, typically involves multiple iterations of the following:

  • an intensive compute cycle
  • a fixed, inelastic load
  • a single user
  • concurrent experiments on a single model

After deployment and scale, ML models from several teams enter a shared production environment characterized by:

  • short, unpredictable compute bursts
  • elastic scaling
  • many users calling many models simultaneously

Operations teams must be able to support both of these very different environments on an ongoing basis. Selecting an infrastructure that can handle both workloads would be a wise choice.


To address the unpredictability of ML workloads and the premium on low latency, organizations must build compute capacity to support substantial bursts.

There are three typical approaches—each with benefits and drawbacks. Architects can implement any of the three approaches in the cloud or in a physical datacenter, though elastic scaling or serverless approaches requires far more effort to manage in-house.

1: Traditional capacity planning

In a traditional architecture, operations reserve compute resources capable of scaling to maximum anticipated demand.


  • Reserved capacity always visible
  • Easy to administer


  • Extremely wasteful and expensive
  • Often subject to hard limits – unanticipated demand can exceed the capacity

2. Elastic scaling

Standard elastic scaling designs for a local maximum, scaling machines up and down based on step functions.


  • Huge cost improvements over traditional architecture


  • Inefficient hardware utilization
  • Different to manage heterogeneous workloads on the same hardware
  • Slightly more management overhead than traditional architecture

3: Serverless  

Serverless approaches spin up models as requests come in.


  • Huge cost improvements over traditional architecture


  • Inefficient hardware utilization
  • Different to manage heterogeneous workloads on the same hardware
  • Slightly more management overhead than traditional architecture

Auditability and Governance

Explainability—understanding why models make given predictions or classifications—is a hot topic and an essential part of ML infrastructure.

Equally important and often overlooked, however, are the related topics of model auditability and governance—understanding and managing access to models, data, and related assets.

An audit trail can help inform decisions

To make sense of complex pipelines, a multitude of users, and rapid model and data iteration, deployment systems should attempt to identify:

  • Who called which version of a particular model
  • The time a model was called
  • Which data the model used
  • What result was produced

As organizations add machine learning to their development plans, it’s imperative they consider how their specific use cases will determine the ways these challenges can be overcome.

Diego Oppenheimer is Founder and CEO of Algorithmia, a service that enables the creation of applications through the use of community contributed machine learning models