The industry needs standards if machine learning is to be adopted at scale. But does it need Hadoop?
by Max Smolaks 23 December 2019
Earlier this month, data management software vendor Cloudera called on the industry to define “universal open standards” for machine learning operations and model governance, the emerging discipline that is quickly becoming known as MLOps.
The idea here is that without a common rulebook, it will be impossible to manage and update thousands of ML models that are expected to be deployed by businesses in the next few years. Common standards will also make it easier to move ML workloads, and their associated datasets, between the environments provided by different software vendors, and between on-premises infrastructure and public clouds.
Cloudera has suggested that the initial set of standards should revolve around Apache Atlas – an open source project that develops a data governance and metadata framework for Hadoop. Incidentally, the entire Cloudera business was built on Hadoop too.
“We see a lot of organizations that are successful with a handful of models, putting it into production, and having a team to manage. But what happens when that organization wants to scale that up to 100 models, or even 1,000 models, and then the paradigm changes entirely,” Santiago Giraldo, senior product manager for data engineering at Cloudera, told AI Business.
“Some of our clients are in finance, or they’re dealing with fraud, and they’re using machine learning to tackle a very dynamic, very challenging and ever-evolving problem. And for them, that means that they have to constantly be checking these models. They have to deploy models on a daily basis, on a weekly basis, just to keep up with what they’re doing, and it’s incredibly time-intensive and labor-intensive.
“A lot of these organizations had already begun to set out and said you know, we need to find a way of formalizing this and making it repeatable, making it manageable and being able to do this at scale. And to their testament, they started working on their own things.”
“We realized that we wanted to solve this problem in machine learning, we want to do it in an open source way, we want to do it in the open.” Cloudera has invited participation in its standardization efforts, from partners and competitor alike.
Hadoop of the future
Cloudera is an open source company with solid enterprise credentials, having launched more than a dozen new open source projects, after its recent merger with Hortonworks is taken into account, and surviving a bittersweet IPO in 2017.
It was established in 2008 by a trio of former hyperscale engineers to commercialize the open source Apache Hadoop software – the tech that revolutionized data storage and processing at scale, and essentially kick-started the Big Data hype phenomenon.
The key to the popularity of Hadoop is the fact that it allows its users to deploy large quantities of cheap storage servers with a high risk of failure – if a server fails, the system recovers automatically.
Cloudera was one of the companies that brought the power of Hadoop to the masses, and is responsible for creating the data-rich environment that caused a resurgence in data science – making machine learning the next
victim beneficiary of the hype cycle. The company moved further into this space with the acquisition of AI research firm Fast Forward Labs in 2017, and today, it positions machine learning among the most promising applications of its software tools, as evidenced by its heavy presence at the AI Summit New York.
“Machine learning models are already part of almost every aspect of our lives from automating internal processes to optimizing the design, creation, and marketing behind virtually every product consumed,” commented Nick Patience, founder and research VP for software at 451 Research.
“As ML proliferates, the management of those models becomes challenging, as they have to deal with issues such as model drift and repeatability that affect productivity, security and governance. The solution is to create a set of universal, open standards so that machine learning metadata definitions, monitoring, and operations become normalized, the way metadata and data governance are standardized for data pipelines.”
Besides issues with model management and portability, MLOps is also aiming to solve questions of regulatory compliance and quality of service. In this regard, Cloudera has pinned its hopes on Atlas, a metadata management and governance project that features native integrations with Hadoop and enables “exchange of metadata with other tools and processes within and outside of the Hadoop stack.”
According to its documentation, Atlas can handle collection, storage, and visualization of ML metadata, helping organizations build a catalog of their data assets, classify and govern these assets and provide collaboration capabilities around these data assets for data scientists, analysts and the data governance team.
“The Apache Atlas (Project) fits all the needs for defining ML metadata objects and governance standards. It is open-source, extensible, and has pre-built governance features,” Doug Cutting, chief architect at Cloudera who co-authored the original Hadoop framework in 2004, said in a press release.
Is this drive for standards somewhat self-serving, given that it promotes Hadoop as a common framework for big data management? Yes, it certainly appears so. But it’s also true that nearly 15 years after its birth, Hadoop has become a mainstay of corporate data center, supported by some of the world’s largest IT players – like Oracle, Microsoft, SAS, SAP and Teradata. The industry needs standards. It remains to be seen whether it will choose trusty old Hadoop as its focal point.
“We don’t claim that we have all the answers, we’re building something that we know is working for our clients, and we’re getting adoption with some of our largest clients. And we’re getting the validation that this is the right direction,” Giraldo told AI Business.
“We’re not going to claim that, for example, every other business should be using Apache Atlas, that is unreasonable. But what we can share is that common language of what, for example, that metadata should be, and how our systems that may be using Apache Atlas can communicate with other systems that maybe aren’t.
“When we have that shared common language that allows us to inter-operate between systems, between different vendors, between different technologies, it makes it much easier for enterprise organizations, to lower that barrier to entry.”