Fly Fishing in the Big Data Lake

It’s thought that the term ‘Big Data’ was first used at the end of the 1980s but the last few years has seen a rise in an additional phrase: ‘Data Lakes’ - large scale repositories of data held in their raw or source form. I suspect that the reason data lakes have become so popular is because the cost of storage and the tools necessary to manipulate Big Data has decreased dramatically. That doesn’t mean to say that they haven’t introduced a whole new set of challenges and pitfalls as well.

Data without a design strategy is just data

I don’t dispute the potential value of Big Data (after all, I have a data science background) but I do take exception at some of the sloppy design thinking it tends to engender – along the lines of ‘don’t worry about what we store or how, just store it all and we’ll sort it out later’ (at the point of extraction and use). Here’s the problem – Lots of algorithms and especially analytics, requires the data to be in a structured form (think tables) and it can be extraordinarily difficult to apply structure after the fact. Worse, you may not even be capturing the right kind of data but you won’t know until you try and use what you’ve got. Sadly, there is still no getting around the need to do some planning about your data needs.

So, why the title of this blog? It came to me after a discussion with a large organisation that said it was going to implement a Data Lake strategy and use it to find insights about customers that Marketing could then use for segmentation and campaign targeting. As we talked more, I became aware that there wasn’t really a strategy about what kinds of insights they needed, how they would be generated, what sort of data would be required or even where it would come from in the first place. They were just going to store everything and hope the insights would appear after some kind of critical mass had been achieved.

Preparing for your catch

To stretch the lake metaphor further; that’s like turning up at the lake, baiting your hook with whatever you have at the time, casting into an area you hope holds fish and waiting to see what happens; you don’t really know what’s lurking beneath the surface, whether your bait is attractive or even whether there are any fish there in the first place. Nor do you know if what you catch will be particularly appetising. Fly fishing is different – it’s about preparing the fly before you leave for the lake, looking for an individual fish (one you want to catch), carefully getting to a good casting location, casting to just the right spot at the just the right moment and being able to eat whatever you land. So, back to the data lake – do you have clear objectives in mind about what you hope to extract from your lake? Are you adding the right kind of data and in a way that ensures you can use it again afterwards? Do you have the right tackle and the skills to use it? It is at this point it’s also worth noting out that some of the tastiest fish come from small rivers and streams.

If you are unsure of how to fish in your data lake, speak to a ghillie (a data expert).

Finally, I will leave you with another metaphor to ponder – have you ever heard of a case where adding more hay to the haystack made finding the needle easier?