Quite often, the terms big data and data lake are used in conjunction, even interchangeably. But they are not the same. Big Data is a technology concept, data lakes a business concept.
The misconceptions might be caused by technologies such as Hadoop or Spark. Both are used in the context of data lakes as well as in the context of big data. This can be confusing.
Doug Laney shaped our understanding of big data at the beginning of this millennium by introducing the three Vs: volume, velocity, and variety. Volume reflects the ever-increasing amount of data. System logs or behavioral data of web pages and apps led to the first explosion of the data volume in organizations. IoT is at the core of the current, second data volume explosion: webcams, CCTV, and sensors create and deliver data more than ever before. The second V, velocity, refers to the increased speed of receiving and processing of data. CCTV information should be transmitted and evaluated in near-real-time, not written to a tape that is checked once a week. Finally, variety refers to the different types of data and data formats such as structured, semi-structured, or unstructured files, images, videos, audio, etc.
Big data technologies such as Hadoop are a solution for the three Vs. They scale and process massive amounts of data based on the scale-out paradigm. If the computing power is not sufficient, you parallelize and add more and more cheap standard components until you have enough computing power. This contrasts the scale-up approach of optimizing and tuning systems as known, e.g., from Oracle’s powerful Exadata solution.
Thanks to the scale-out paradigm and being open source, Hadoop offers cost-savings for storing large amounts of data. Thus, Hadoop is used for offloading data warehouse data that is used seldomly. This is one of the two important business cases for big data technology (Figure 1). It is about doing the same for less, i.e., a cheaper storage option for large data sets than traditional databases and data warehouses. It does not bring any direct benefit for AI and analytics projects. However, once the big data environment runs, it can be used for the second use or business case: implementing a data lake.
Figure 1: Big data, data lakes, AI and analytics, data warehouses - understanding the big picture
Data lakes capture potentially relevant data in an enterprise whether or not there is a direct, obvious usage for them. Text, audio, video, sensor data – the variety of data is high and so is the volume. Thus, data lakes are often implemented using big data technology such as Hadoop. The specialty of data lakes becomes clear when looking at how data gets into a data lake versus how data gets into an (enterprise) data warehouse (Figure 2).
The steps are the same: data ingestion, data extraction, data cleansing & consistency, and data usage. Data ingestion is the delivery of data to the data lake or the data warehouse. This covers opening firewalls and ports and exporting and transporting files to the data lake or data warehouse. Data extraction means extracting relevant information from files to shrink the data volume. This is especially helpful for multi-media data and documents. Instead of storing complete web pages, just storing the text and no images or videos can be sufficient.
Data cleansing and consistency are the core of the data transformation and best understood when looking at two examples. A banking system might store individual persons as well as couples as a customer object. When a couple gets divorced, the customer object has to be retired. In systems without divorce date field, it happens that the bank staff uses the date of death field. They enter the divorce date to retire the object and add a note in the comments field. The bank staff does their best to capture the reality in the system. When data scientists use this death date field later, they have to be aware of that. Before loading such data in the data warehouse (or using it from a data lake), a distinction between normal dead customers and divorced couples is necessary. Another example is an attribute that represents the number of sold cars. Some teams might look at the signed contracts, others on cars delivered to customers. Such nuances cause a mess when numbers are aggregated and compared. The data and numbers from various sources must be made consistent before used in the same report or training set.
Finally, data usage refers to the actual act of using the data warehouse or data lake data for analysis. In both cases, data lakes and data warehouses, the machine learning algorithm require structured input data. However, data lakes ease dealing with variable data formats with the schema-on-read approach.
One of these four steps is much more time consuming and, thus, expensive than all others: the cleansing and consistency step. It causes project and maintenance costs. It is the step that distinguishes a data lake from a data warehouse. The step is performed when adding the data to the data warehouse, but not when adding data to the data lake. A data warehouse aims for “gold standard” data quality. Engineers and business users rely on the data to be accurate. Thus, it is expensive to add data to a data warehouse. In contrast, data lakes store uncleansed raw data. This does not incur high costs. Thus, engineers can add all data that might be of interest in the future. They do not need a real business case. The business case is needed only later when an AI and analytics project wants to use the data lake data and has to do (and finance) the cleansing activities as part of their project.
Deferring the costs from the time of adding data to the time the data is used is a big advantage for AI and analytics teams and data-driven organizations. First, the data lake is the one-stop-shop for all data. Data scientists do not have to ask around and hunt for applications with interesting data. They just look at the data lake. Plus, data scientists have to deal with only one interface, i.e., the one of the data lake. They do not have to deal with various storage or database technologies or with firewall port challenges. Second, storage costs are low in data lakes. Data lakes store data the original applications delete. Applications keep master and transactional data for a long time but delete large log files after some days or weeks. They free their (relatively expensive) storage for their application from log data they do not need anymore.
When introducing or using a data lake, clarity regarding the service level is crucial. In my article Shaping AI and analytics services, I discussed the difference between strategic advisory and ongoing operational insights-style AI and analytics. The latter means that the insights influence operational processes. If you get to the point that data lake data impacts directly operational processes (e.g., trading decisions in a bank), the data lake must have an adequate service level.
Figure 2: Loading data into data warehouses (left) and data lakes (right). Blue are investments done at the time of adding data to the system, green are later costs.
Data lakes are not improving everything for AI and analytics teams. They also require these teams to take over new types of tasks. The AI and analytics teams (not the data lake teams) are typically responsible for data cleansing and ensuring consistency. They have to do an initial analysis and writing transformation code. In the case of repetitive analytics tasks, e.g., ones that support operational processes, the AI and analytics teams must ensure the ongoing maintenance of the code. If feeds change, they have to update the cleansing and consistency code. The more diverse the feeds, the more effort. This has implications for the capabilities of the AI and analytics team (see my article AI and Analytics Services: Capabilities & Costs).
To conclude: There are new challenges for AI and analytics teams in case of data lakes, but, first of all, they are a great opportunity. They provide easy access to more data than any data warehouse will ever do. William Cowper’s proverb also applies for AI and analytics initiatives: “Variety is the very spice of life, that gives it all its flavour.” If you manage to integrate new data, you can provide fresh perspectives and additional insights to the business! And the key to finding data in the data lake is the data catalog – a topic for one of my next articles.
 D. Laney: 3D Data Management: Controlling Data Volume, Velocity, and Variety, META Group, February 2001
Klaus Haller is a Senior IT Project Manager with in-depth business analysis, solution architecture, and consulting know-how. His experience covers Data Management, Analytics & AI, Information Security and Compliance, and Test Management. He enjoys applying his analytical skills and technical creativity to deliver solutions for complex projects with high levels of uncertainty. Typically, he manages projects consisting of 5-10 engineers.
Since 2005, Klaus works in IT consulting and for IT service providers, often (but not exclusively) in the financial industries in Switzerland.