article |
I am A.I.

Fly Fishing in the Big Data Lake

Which is more important to you: the size of your data lake or knowing where the tasty data “fish” are lurking? Some people believe that the amount of data they have directly correlates with the number of valuable insights that they’ll find. In reality, the size of the data lake matters less compared to having a clear objective and strategy for gaining usable insights within it.

The concept of big data is not new—the term was reportedly first used at the end of the 1980s—and refers to vast amounts of information that can be analyzed to reveal patterns and trends about human behavior and interactions. The last few years have also seen a rise in another phrase—data lakes—large-scale repositories of data held in their raw or source form.

It’s difficult to find a tech conference or client meeting where the terms big data and data lakes don’t come up. They have become popular because as the technology for storing data becomes cheaper—gigabytes of data can be stored for less than a penny—companies are accumulating large amounts of data about their customers, suppliers, and themselves. Even the EU’s General Data Protection Regulation will do little to repress the vast amounts of data that companies possess. 

So why does this matter? 
As it becomes easier to gather and store large amounts of data, companies are moving away from the disciplines learned in the era of data warehousing.
Back then, business leaders had to think carefully about the types of information they were going to collect and how they would store and use it. Today, the mindset tends to be, “Let’s not worry about what we’re storing or how, just store it all and we’ll sort it out later.” 

That thought occurred to me after a discussion with leaders of a large organization who were planning to implement a data lake strategy and use it to find insights about customers that the marketing team could then use for segmentation and campaign targeting. As we talked more, I became aware that there wasn’t really a strategy about what kinds of insights they needed, how they would be generated, what sort of data would be required, or even where it would come from in the first place. The plan was just to store everything and hope the insights would become self-evident or appear after some kind of critical mass had been achieved.

To return to the fishing analogy, it’s like baiting your hook with whatever you have at the time, casting into an area you hope holds fish, and waiting to see what happens. You don’t really know what’s lurking beneath the surface, whether your bait is attractive, or even whether there are any fish there in the first place. Nor do you know if what you catch will be particularly appetizing. 

The other problem is that most algorithms and analytics require the data to be in a structured form. But separating the data to apply structure to it after the fact can be incredibly difficult. 

Think like a fly fisher
Big data encompasses a wide variety of data types: structured (tabular data like numbers, dates, and names) unstructured (streams of text, media, audio, video, etc.), and semi-structured/self-declaring data. 

The benefit of structured data is that it is predictable and easy to archive and retrieve. Unstructured data is generally unpredictable in either form or content. Semi-structured/self-declaring data can appear to have structure but is stored in a way that doesn’t dictate a pre-defined data model, (e.g., all the data is stored as text, in any order, of any length). Archiving and retrieving unstructured data is also easy, but analyzing unstructured or semi-structured data is much harder if the system is not designed well.  

All of these data types potentially hold value, but not all data insights are equally valuable. Some insights may be interesting, but are ultimately useless or irrelevant. That’s why it’s critical to determine early on which types of insights you’re looking for. 

And just like fishing, it takes more than just casting a hook to land a whopper of a fish (or insight). Business leaders may need to think like fly fishers. Fly fishing entails preparing the fly before you leave for the water, identifying the fish you want to catch, carefully getting to a good casting location, casting to just the right spot at just the right moment, and being able to eat whatever you land. 

Do you have clear objectives in mind about what you hope to extract from your data lake? Are you adding the right kind of data (structured, unstructured, semi-structured, etc.) in a way that ensures you can use it again afterward? Do you have the right tackle and the skills to use it? Answering these questions before delving into a project is essential to delivering the results you’re looking for.  

A few things to keep in mind: Some of the tastiest fish come from small rivers and streams. Finding actionable insights from big data is not about quantity, it’s about quality. Focus on formulating your vision and strategy—what will you use the insights for? How will it allow you to make better decisions? And how will those decisions be put into practice? 

It’s also worth noting that some people prefer not to fly fish in a lake. They believe it’s best used for swift-moving currents and fish. This idea can also be applied to data strategies. Preparing a careful strategy for parsing data is time-consuming and might not be necessary for flat or stagnant data. In some cases, it may make sense to just cast a line and see what comes up. 

But when you’re dealing with fast-moving, real-time data, you’ll need a plan that acknowledges the opportunities (and challenges) of uncovering insights. Otherwise, you’re just storing more data.