Data Lakes: Cutting Through the Hype

ABE / December 17, 2019

I have three passions in life: people, data, and beverages. I'll save people and beverages for another time; this blog post is all about data.

We live in a data-driven era. Information is being created with a velocity never before seen in history - and it is only increasing!

You may have heard of 'data lakes' as part of the technology ecosystem related to data science, analytics, and machine learning. Some marketers claim that data lakes are essential to data science and analysis. Other data wizards say that data lakes are dangerous and unnecessary.

What is a data lake?

Data Lake Definition

Let's start with the definition from Wikipedia:

"A data lake is a system or repository of data stored in its natural/raw format, usually object blobs or files. A data lake is usually a single store of all enterprise data including raw copies of source system data and transformed data used for tasks such as reporting, visualization, advanced analytics and machine learning."

Wikipedia provides a pretty good working definition.

Stated more even more simply, a data lake is a digital repository where an organization centralizes raw data storage.

At Black Cape, we think about modern data architectures as having four independent but related layers:

modern data architecture

  1. simple storage layer that maintains data in its raw state - this could be text files, JSON, images, audio files et cetera
  2. structured storage layer that applies a schema or other organizing mechanism to make it easier to search for specific elements of data
  3. an analytics and AI/ML layer that applies algorithms to data to transform it and create new derived insights
  4. an access layer that provides Application Programmer Interfaces (APIs) and related services for making data available to external systems and to end-users

A well thought-out data lake can serve as a strong foundation for the simple storage layer, and perhaps a major part of the structured storage layer as well.

Data Lake Use Cases

There are many use cases for data lakes, four common ones include:

landing zone a place where incoming data in aggregated and stored before additional processing.

analytic sandbox an environment for storing data and running descriptive statistics, quantitative analysis, predictive models, and other forms of data analysis.

machine learning training repository a staging area to hold training data that is used to construct machine learning models to classify, predict, categorize, or label information.

digital graveyard a place where people save data and then never do anything with it. I frequently see Microsoft Sharepoint used as a digital graveyard where information no one cares about goes to be buried. This is a bad practice! If you have a data lake, use it for one of the other use cases.

Data Lakes @ Black Cape

We recently stood up a data lake at Black Cape.

The data lake contains 12 data sources that Black Cape engineers can munge, slice, chop, massage, and wrangle to their hearts content. The Black Cape Data Lake demonstrates our engineers' passion for machine learning and turning data into meaningful insights as well as our commitment to continual learning through doing.

data lake osm POI

Interested in data science, machine learning, and software engineering?

Consider joining the Black Cape Team, we're always on the lookout for exceptional people.