Navigating the Azure Data Lake for Big, FAST data

In my recent presentation at the Microsoft AzureCon event, I spoke briefly about a new Azure service we’ve been working with called Azure Data Lake. I wanted to go into a little more detail around why this service is so important to Plexure and what it means for our customers.  
 

Building BIG Data sets

One of the biggest challenges we’ve faced over the past few years is that, while Cloud technology allows us to store massive amounts of data, the tools to query that data have been very cumbersome (or very expensive).

We know that a customer’s mobile phone on its own produces thousands of data points. Now multiply that by tens-of-millions of customers, add in data from other connected devices and external services like weather and traffic, and you’ll start to understand how big Big Data really is – we're talking terabytes of new data stored every month. Historically this has gone into Blob storage.

However, data isn’t really useful if we can’t query it easily. While we have very smart ways of analyzing the data as it comes in, it can be very painful when we want to go back and answer new questions about that data later on. Internally we refer to Blob storage as “The Big Tape Drive in the Sky”; data is cheap and easy to store, painful to do anything with.

Data Lake Store allows us to just as easily capture and store very large data sets, but it then gives us the opportunity to analyze it later, and we don’t need to do any special manipulation of the data up front. The data can be structured if we want to store it that way, but it’s not required. Under the hood it uses the industry standard Hadoop Distributed File System (HDFS) but the team doesn't need to worry about any set up or scaling.
 

Using big data sets

Now that we have the data stored, we need to be able to query it. We don’t want to have to worry about the underlying infrastructure scaling, and we want to use tools that are familiar to us. So Data Lake Analytics now comes with a plugin for Visual Studio and uses a language called U-SQL. This looks very similar to T-SQL but it supports the use of .NET libraries within the query. This is a complete game changer as you can use existing code, or easily write new code using familiar languages to manipulate the data. This makes it easy for existing development teams to very quickly pick up these new powerful tools.
 

Using big data sets FAST

When it comes to historic data sets, we’re talking terabytes or even petabytes of information, so speed becomes a critical concern. Data Science teams want to explore and experiment on data quickly; they don’t want to have to leave queries running for days while waiting for the results. They also need to be able to publish results in a timely fashion, so they’re useful for decision making: there’s no point running a query that will tell us which burger deal will perform best with female millennials in New York this afternoon if the query takes 72 hrs to run!

This is where Data Lake Analytics offers some real magic: it’s really fast. Out of the box it’s designed to run up to 1000 processing threads concurrently for the same query (with options to increase this if needed), and Microsoft has worked to get the processing and storage as close together as possible to improve the speeds even further. Choosing the level of parallelism, or the number of concurrent threads, is now as easy as changing the setting on the ‘magic slider’. You can also review the job in terms of where the parallelism improved things, or where there may be bottlenecks in processing.

Future-proof data collection

One of the most common problems businesses face when working with Big Data is balancing the availability of data with the ability to use that data and the cost of storing it. With Azure Data Lake behind the Plexure platform, you get the power and capacity to collect and store a wide range of data; you can store data that is not currently being used; and you’re also free to call data up and work with it in the future – whenever and however it's needed.

Watch the full AzureCon demo to see this all in action!