Meaning more data sources for machine learning models
by Max Smolaks 25 February 2020
American startup Databricks, established by the original authors of the Apache Spark software, has launched a set of enhancements for its Unified Data Analytics Platform.
The Databricks Ingest functionality enables users to move to a new "data management paradigm" dubbed the data lakehouse - intended to reconcile structured data coming from traditional business systems, stored in rows and columns in data warehouses, and unstructured stuff that's being dumped into corporate data lakes.
The former is widely used in business intelligence (BI), while the latter serves as fuel for machine learning systems, and improved integration promises more data for both categories of business projects.
Databricks' implementation of the lakehouse depends on another of the company's creations, Delta Lake, an open source storage layer that brings a set of reliability features to data lakes.
Databricks was co-founded in 2013 by a team of academics that met at Berkeley, including computer scientist Matei Zaharia, who developed Spark as a PhD thesis in 2009. A decade later, this open source cluster computing engine has become the de-facto standard for handling really large datasets. Such datasets are increasingly used to train machine learning models, and in 2019, Databricks released MLflow, an open source machine learning management engine.
There are two parts to the company's latest announcement: the first is Databricks Ingest, a set of features that enable Data Platform users to load data from a range of commonly used sources, including applications such as Salesforce, SAP and Google Analytics, databases such as Oracle, Cassandra and MySQL, and file storage services such as Amazon S3 and Azure Data Lake Storage.
Data imported from these sources can be loaded automatically, without setting up and maintaining job triggers or schedules.
The second part of the announcement is the Data Ingestion Network: a collection of partners that have built native integrations with Delta Lake that includes Fivetran, Qlik, Infoworks, StreamSets and Syncsort. At least three more members -- Informatica, Segment and Talend -- will join in an upcoming release.
Thanks to these partnerships, organizations can load data from hundreds of data sources directly into a Delta Lake, without having to create custom integrations and configure APIs.
“Databricks powers our machine learning and business intelligence across multiple business functions, from car inventory management, to price prediction and technical operations, by using hundreds of terabytes of data,” said Greg Rokita, executive director for Technology at Edmunds, one of the world's most visited automotive resources.
“Our data vision is fully aligned with the lakehouse approach, and our cloud data journey starts with Delta Lake which powers our machine learning use cases and executive reporting. We’re excited about Databricks Ingest - it will definitely simplify loading data into our Delta Lake.”