Disruptive technology explained – Data Lake

Disruptive technology explained – Data Lake

From the AiM Next Generation Technologies team.

What is a Data Lake?

In an earlier post, we defined ‘big data’, which can be a database with many rows of data, each with many attributes (columns), or a repository of different data types, for example databases, spreadsheets, documents, photographs. As well as volume of data, the other key factor is usefulness, i.e. the big data can be used to deliver business related objectives. In this edition, let’s take it a step further; let’s look at how this data is stored and introduce a couple more concepts.

Big data is by definition big, it may contain petabytes of information. To put this in context, the storage on a mobile phone is measured in gigabytes, and a 256-gigabyte phone can handle a lot of media (films, video, music and photos). A petabyte is one million times bigger, so we’re talking huge volumes of data. Companies like Google, who provide cloud-based storage, can accrue petabytes of new data each day.

This data can be stored in several ways. The traditional method, a data warehouse, contains cleansed and structured data and allows it to be interrogated. Secondly, a data lake contains random data types, allowing users to define their own metrics. The traditional data warehouse has a larger upfront cost, organising the data and determining the outputs, whereas the data lake has a higher end cost, as the processing necessary to find and display the data has not been completed in advance. Data lakes are adaptable and can be used in ways not envisaged at the outset; imagine a large lake, but instead of water it’s made up of data, you can dip a bucket in and pull an assortment of information out, at which point you’ll need to work out how to decode it and fit it together. Data lakes are storage areas for any data an organisation creates, but ‘buyer beware’, if you throw everything in there, without any thought, it will become virtually impossible to use constructively, and it could then become a data swamp, which provides little value and is as good as useless.


Interested in reading more on disruptive technology? Click here for our blog on disruptive technology in the Legal industry.