AI Business is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 3099067.

The New Machine Learning Life Cycle: 5 Challenges for DevOps

Article Image

by Diego Oppenheimer

SEATTLE - Machine learning is fundamentally different from traditional software development applications and requires its own, unique process: the ML development life cycle.

More and more companies are deciding to
build their own, internal ML platforms and are starting down the road of the ML
development life cycle. Doing so, however, is difficult and requires much
coordination and careful planning. In the end, though, companies are able to
control their own ML futures and keep their data secure.

After years of helping companies achieve this goal, we have identified five challenges every organization should keep in mind when they build infrastructure to support ML development.

Related: How to build train and test a machine learning model


Depending on the ML use case, a data
scientist might choose to build a model in Python, R, or Scala and use an
entirely different language for a second model. What’s more, within a given
language, there are numerous frameworks and toolkits available. TensorFlow,
PyTorch, and Scikit-learn all work with Python, but each is tuned for specific
types of operations, and each outputs a slightly different type of model.

ML–enabled applications typically call on a pipeline of interconnected models, often written by different teams using different languages and frameworks.

An ML pipeline that extracts text from scanned documents and analyzes its sentiment using several languages and frameworks


In machine learning, your code is only part
of a larger ecosystem—the interaction of models with live, often unpredictable
data. But interactions with new data can introduce model drift and affect
accuracy, requiring constant model tuning and retraining.

As such, ML iterations are typically more frequent than traditional app development workflows, requiring a greater degree of agility from DevOps tools and staff for versioning and other operational processes. This can drastically increase work time needed to complete tasks.

The ML development life cycle with iterations


Machine learning is all about selecting the right tool for a
given job. But selecting infrastructure for ML is a complicated endeavor, made
so by a rapidly evolving stack of data science tools, a variety of processors
available for ML workloads, and the number of advances in cloud-specific
scaling and management.

To make an informed choice, you should
first identify project scope and parameters. The model-trainingprocess, for example, typically
involves multiple iterations of the following:

  • an intensive compute cycle
  • a fixed, inelastic load
  • a single user
  • concurrent experiments on a single model

After deployment and scale, ML models from
several teams enter a shared production environment characterized by:

  • short, unpredictable compute bursts
  • elastic scaling
  • many users calling many models

Operations teams must be able to support
both of these very different environments on an ongoing basis. Selecting an
infrastructure that can handle both workloads would be a wise choice.


To address the unpredictability of ML workloads and the premium on low latency, organizations must build compute capacity to support substantial bursts.

There are three typical approaches—each with benefits and drawbacks. Architects can implement any of the three approaches in the cloud or in a physical datacenter, though elastic scaling or serverless approaches requires far more effort to manage in-house.

Traditional capacity planning

In a traditional architecture, operations reserve compute resources capable of scaling to maximum anticipated demand.


  • Reserved capacity always visible
  • Easy to administer


  • Extremely wasteful and expensive
  • Often subject to hard limits - unanticipated demand can exceed the capacity

2. Elastic scaling

Standard elastic scaling designs for a local maximum, scaling machines up and down based on step functions.


  • Huge cost improvements over traditional architecture


  • Inefficient hardware utilization
  • Different to manage heterogeneous workloads on the same hardware
  • Slightly more management overhead than traditional architecture


Serverless approaches spin up models as requests come in.


  • Huge cost improvements over traditional architecture


  • Inefficient hardware utilization
  • Different to manage heterogeneous workloads on the same hardware
  • Slightly more management overhead than traditional architecture

Auditability and Governance

Explainability—understanding why models make given predictions or classifications—is a hot topic and an essential part of ML infrastructure.

Equally important and often overlooked, however, are the related topics of model auditability and governance—understanding and managing access to models, data, and related assets.

An audit trail can help inform decisions

To make sense of complex pipelines, a
multitude of users, and rapid model and data iteration, deployment systems
should attempt to identify:

  • Who called which version of a particular model
  • The time a model was called
  • Which data the model used
  • What result was produced

As organizations add machine learning to their development plans, it’s imperative they consider how their specific use cases will determine the ways these challenges can be overcome.

Diego Oppenheimer is Founder and CEO of Algorithmia, a service that enables the creation of applications through the use of community contributed machine learning models

Trending Stories
All Upcoming Events

Upcoming Webinars

Latest Videos

More videos


More EBooks

Research Reports

More Research Reports
AI Knowledge Hub

Newsletter Sign Up

Sign Up