Fueling the Future with Data

Written by Eric Boszin | Mar 27, 2024 3:27:13 PM

In this post, we are delighted to share a fun analogy comparing oil and data. To our surprise, they are more alike than anyone would like to admit.

TL;DR

The analogy "Data is the new oil" compares the value of raw data to unprocessed oil, emphasizing that both require strategic processing to unlock their potential. The article explores the data value chain, drawing parallels to the fossil fuel industry. It discusses challenges in data collection, transportation, storage, processing, and enrichment, likening them to stages in the oil industry. The concept of a Data Lakehouse, combining the benefits of Data Lakes and Data Warehouses, is introduced. The article concludes by urging caution regarding the downstream effects of data use, drawing parallels to environmental concerns in the oil industry, and prompts reflection on how individuals want to utilize their "data oil".

(Yes, we used ChatGPT to create this TL;DR…)

Data is the new Oil

“Data is the new oil” - an expression coined by Clive Humby, a British Mathematician in 2006, is probably one you’ve heard many times at this point.

But what does it mean? Fundamentally, Humby’s statement helps draw an analogy between the value of oil and data - highlighting that, in its raw form, there isn’t much apparent value to be captured. It is with some clever processing that data, like oil, can create new value streams and even entire industries!

Let’s extend the analogy of data as oil and examine how we can apply learnings from the Industrial Revolution to the “Technology Revolution”.

Striking “Oil” in the “Arctiq”

There are fossil fuels all over the globe, but one area in particular is especially interesting as we extend this oil analogy: The Arctic.

The Arctic, as you can imagine, is incredibly cold. This makes setting up the necessary infrastructure to extract and process oil especially difficult, both technically and logistically. To add to the hardships, the Arctic borders several countries which raises the question - who “owns” the Arctic and, by natural extension, the oil within it? And let us not forget how fragile the Arctic ecosystem is already. The oil industry is notorious for wreaking havoc on the environment; fighting political battles and setting up the necessary infrastructure to get Arctic oil almost certainly spells disaster for the local environment, not even considering the broader effects of burning fossil fuels or plastic waste.

It is clear that getting value from Arctic oil presents many challenges. What we have noticed is that getting value from data in large organizations, like Arctic oil, also comes with challenges; the question is: are those challenges similar? What learnings can we take from the Fossil Fuel Industry and apply to this new Data industry?

Let’s examine the data value chain and compare and contrast difficulties in capturing value as it relates to 3 broad themes:

Requisite infrastructure
Ownership and security
Ethics

The Data Value Chain

The data value chain is strikingly similar to the value chain of the fossil fuel industry. Let’s deep dive into each of the stages of the value chain.

Data Source

Data can come from many different places - activity logs, sales transactions, and even heart rate as captured by your smartwatch. Like Oil, data needs to be produced by something. In the case of data, it is by creating records which describe a certain event; while for oil, it comes from subjecting animals and plants under volumes of soil to heat & pressure over millions of years.

Data - like oil in soil, rocks, or tar - can come in many forms:

Structured: think of rows and columns of a predictable schema, like a table
Unstructured: raw byte arrays representing an image, video, etc.
Semi-structured: a bit of both, with some structure but freedom to break free from a schema

Hold this thought, as we will get into it more in Data Processing and Enrichment.

Where do we find data? Like oil, we often find ourselves not even knowing that there is data in our ecosystem. Figuring out where data is coming from is a task on its own. Once we have identified that, we need to figure out how to collect it.

How do we collect data? Simply, systems (typically software) need to expose data to some endpoint for collection, referred to as a “source”. For oil, it comes down to traditional drilling and pumping methods, not so great for the environment.

Granting the ability to collect that data sits behind a set of infrastructure and security gates that restrict who, how, and how much of that data can be gathered.

But who decides that I can collect that data?

You would think it is the data owner - the person, “subject”, that the data is about. But oftentimes, that is not the case. Every day, we see data collection without the permission of the data owner. It’s gotten to a point where in Europe, they had to create an entirely new law which restricts who can collect data (we’re talking about GDPR).

So, like oil, we face a dilemma:

Where is the data?
Who owns the data?
Who has the right to collect it?
Is it ethical to collect it?
And do we have the systems to efficiently collect it?

Data Transport

Alright, cool. We have a way to collect data. How do we get it to somewhere that we can store it for refinement?

Data has a couple of transport modes, which are tightly coupled with raw storage and processing.

The first is referred to as batch. In the batch transport mode, data is pooled into a chunk which collects events over a fixed period of time. This type of transport enables high-volumes of data to be processed across multiple processing nodes in a fixed window. Think of this as placing containers of oil on a ship or truck and transporting it to the processing facility - a fixed amount is carried in a given time frame.

The second method is referred to as stream. In the stream transport mode, data is gathered and sent over a channel in real time. This type of transport enables real-time processing and can be processed across multiple processing nodes in a much tighter time window. Think of this as pumping oil through pipelines as it is collected - a variable volume is flowing at any given time frame, but it is flowing continuously.

So the question you might have is, how do we accommodate these different types of transport modes? There are many interesting technologies and systems available that enable batch and stream, most of which are in the cloud; and all continue to evolve as we create more data. However, the volume of data continues to increase - organizations that have not equipped themselves with scalable architectures encounter bottlenecks that stunt their capacity to grow.

Note: this is classified under Extract in the Data Engineering Framework.

Raw Storage

Now the data (our “oil”) has reached the processing site, landing in an area where it waits to be processed. This area is referred to as a Data Lake.

A Data Lake is a storage medium which can store all types of data (recall structured, unstructured, and semi-structured) in its original format, which makes it very flexible. The length of time that the data stays in a Data Lake depends on the transport mode and the velocity at which it can be processed.

Data Lakes leverage a Data Catalog - a data governance control specifying metadata about the raw data. Without a Data Catalog, the raw data runs the risk of becoming a “Data Swamp”, making it next to impossible to know what data is actually in the Data Lake.

A Data Lake has a couple of characteristics, which make it strikingly similar to a general oil storage container - it’s relatively cheap and can hold large amounts, but is difficult to synthesize into value.

Note: this is classified as both Extract and/or Load in the Data Engineering Framework.

Data Processing & Enrichment

As in the oil industry, data processing can be a complex operation.

Typically carried out by Data Engineers (who manage many aspects of the ETL/ELT framework), this activity involves many tools and sophisticated operations to transform raw data into clean, structured, aggregated data that is efficiently and easily consumed by value streams.

We will avoid getting too technical - but the general idea is that data is pulled from the raw storage (typically a Data Lake, but it can be a Data Lakehouse - more on that soon), either in batches or in near real-time (recall data transport modes). From there, the data is inspected and modified to remove missing fields, remove outliers, and augmented with other data to enhance the overall picture. Often, the data is finally aggregated into groups with statistics before it is finally ready to be consumed. Typically, the outputs from this stage take the form of structured data.

In an oil refinery, the concepts are similar: oil is processed, separated, and/or enriched before it is then ready for value streams.

Note: this is classified as Transform in the Data Engineering Framework.

Refined Storage

So where do we put data when it’s been “enriched”? Well, let me introduce the concept of a Data Warehouse.

A Data Warehouse is logical storage for “enriched”, structured data (much like a regular database). It is relatively small in size as it only contains specific, cleaned, and aggregated data. Think of the Data warehouse as a collection of containers that each store one type of enriched output - a container for gasoline, a container for diesel, a container for kerosene, etc. Due to the specific structure or “contents” of the containers in the Data Warehouse, it becomes relatively more costly to maintain the “purity” of the elements stored in the container; but it is easier to manipulate the data as there is ACID (i.e., Atomicity, Consistency, Isolation, and Durability) support.

What makes a Data Warehouse appealing is that it facilitates value streams, much like how gasoline can be easily shipped to gas stations to fuel traditional vehicles.

Note: this is classified as Load in the Data Engineering Framework.

The Data Lakehouse

You may be wondering why we need to move data between storage mediums. That seems inefficient and might produce a lot of duplication. You’re not wrong…

What if we could use a single storage medium to host our raw and processed data, structured or not? We can! The Data Lakehouse is a data management architecture that combines the flexibility, cost-efficiency, and scale of Data Lakes with the data management and ACID transactions of Data Warehouses. What’s nice about a Data Lakehouse is that we don’t have to replicate and move data around as often, and we can reference all our data assets - raw and processed - from a single, updated source.

The Data Lakehouse architecture is proving to be increasingly popular; we are seeing new solutions being offered by the Hyperscalers AND entire companies being built around the concept.

Are you ready to explore the technology?

Value Streams

We now have structured, enriched, aggregated data - what’s next? Let’s break down some of the value streams that result from our processing.

The first value stream is analogous to fuels - think diesel, gasoline, etc. These outputs are ready to consume as-is; that is, they are ready to generate value. With all the processing we did, our Data is now ready for consumption in what we call “Data Analysis”, which analyzes data to inform past and present insights. You can think of this as creating a report for executives on customer satisfaction or how many users access your website at any given time of day. Like with fuels, these analysis activities have downstream value creation effects: diesel may fuel the machinery required to build a manufacturing facility for electric vehicles while, similarly, data reporting may inform decision-makers to change a business strategy.

The second value stream is more like plastics; these outputs require some extra shaping into a final form. In data, you can think of this as modelling and artificial intelligence - the data is used as a building block for sophisticated algorithms that can be used to predict future events. I won’t go into too much detail on artificial intelligence, but (as I’m sure you already know) the ecosystem is massive and evolving every day.

The possibilities are truly endless, and the value that can be generated has already started to reshape entire industries. However, like the oil industry, do we understand the downstream effects these value streams have?

Embarking on a New Expedition

Through this analysis, we can see that Data and Oil are quite similar. We leveraged the analogy of Arctic oil to help promote a narrative that getting value from data is not as simple as it seems - there is a lot that goes into getting the data, transporting it, refining it, and ultimately exposing it for value streams.

But what about those value streams? Today we are learning about some of the negative effects of those value streams on our planet and ourselves - whether that be global warming from greenhouse gasses, or microplastics in almost every corner of the globe and inside us. Just as in the Oil industry, we must be careful of the downstream effects of our use of this new oil: data. We are already learning of some of the bias, discrimination, and other dangers of artificial intelligence.

With that, we leave you with a simple question:

What do you want to do with your “Oil”? If you need help answering this question, let’s chat!

View full post