Plexure data architecture: making magic happen with Azure
One of the big 'aha' moments our customers experience when we start talking to them about the Plexure platform is when they first see our Power BI dashboards correlating consumer activity with POS data. For many marketers it’s a magical moment to see a complete closed loop representation of marketing campaign activity (offers, ads, push messages) combined with the consumer response to that activity (offer impressions, clicks, redemptions), put alongside real POS data, so they can see very clearly the correlation between campaigns, consumer activity and sales figures. It's a view no traditional form of marketing can really offer, and it’s only possible today with the holy trinity of IoT, cloud computing and big data. To provide a peek under the hood of the Plexure platform to see how we take data from mobile devices, process it and then expose it in Power BI dashboards I thought it worth sketching out the flow of data through the platform, highlighting the various Microsoft Azure products we use to make the magic happen.
Dealing with data in real time: Azure Stream Analytics
Starting with the easy stuff first – the offer and ad content is taken straight out of the core part of our content database, along with cross reference data we get during consumer registration on the mobile device.
It starts to get a bit more interesting when you consider how we ingest activity data – activities in Plexure are the numerous interactions consumers have with the app and Plexure content, such as viewing an offer, clicking an advertisement or redeeming an offer. The reason this is more interesting is the sheer volume of data that comes in through this API. We've seen upwards of 7,000 requests per second against our Activity API, where each request is typically a batch of a few, sometimes more specific records of customer interaction with the device. That's a lot of data – at peak times we can accumulate more than 40GB of data per hour from the raw activities coming in through this API.
Getting meaningful information out of this kind of firehose is a well-known big data problem, and one which Microsoft has solved with Azure Stream Analytics. We're working at such large volumes that it's not really practicable to accumulate a bunch of data and then go through it to pick out the bits you want – by the time you’ve finished querying it you've already accumulated more data than you can handle. Instead Stream Analytics allows us to write SQL-like queries which work on frames of data – so specific time intervals within which you want to do some aggregation. You can find out more information about Stream Analytics here - this video is a little old now but it provides a great high level overview.
For the reporting challenge, we use this to output files to blob storage containing only the specific activities we're interested in, and this is done in real-time from Stream Analytics as the activities arrive through our APIs.
Bringing it all together: Azure Data Factory
The next step is to bring all the data together and prepare it for some processing. Apart from the core data from the platform, and the activity data from Stream Analytics, we also bring in customer-specific POS data (I say customer-specific as its format is usually specific to the POS technology used by our customer). Currently this is brought in through some kind of SFTP service running on an Azure Virtual Machine, however we're reviewing this to see how we can come up with an industrial strength 'data clearing house' common to our product, to make the customer-specific integration easier.
To orchestrate all the data transfer, and to provide a good Ops story for monitoring, we make the most of pipelines in Azure Data Factory. The pipelines are basically operations on data, fed by a source and outputting to a destination – in this case blobs, table, and SQL Azure as sources, and the destination as Azure Data Lake. One great feature of Azure Data Factory worth mentioning is its concept of a data slice – it splits long running data operations up into data slices which means that should the process fail for some reason, individual slices can be re-run to resolve the problem (once the root cause has been found and fixed), meaning the entire process does not need to be re-run. This is a classic problem with any long running ETL process, so having a tidy way of dealing with it is great. For a good overview of Azure Data Factory check out this video:
Storing BIG data: Azure Data Lake
The next stop on the magic data bus is Azure Data Lake. The concept of a 'data lake' has emerged over the last couple of years as a new kind of component for processing big data, and exists to work on large sets of data from different kinds of storage, as opposed to a data warehouse which is predominantly about large sets of data in a carefully structured relational store. Software design luminary Martin Fowler describes the data lake concept here. Microsoft has created Azure Data Lake to realize this concept as a PaaS offering in the cloud, and it is very powerful.
Coming in two parts – Azure Data Lake Store and Azure Data Lake Analytics, it's based on a solid Hadoop foundation and allows developers to use a new unified query language (U-SQL) to bring together data sources of different kinds, at very large scales (we're talking petabytes here), undertake aggregation and composition tasks, and run it all on a robust processing platform. The developer can pick how many nodes upon which to run Data Lake jobs simply by moving a slider, so you can decide to run a long running task on a few nodes, or scale to thousands and run the task in just a few minutes – all without any concern for the underlying infrastructure. As far as big data goes it doesn't get much better than that for a true cloud-hosted 'big data as a service' product. You can find an introduction to Azure Data Lake here.
Check out Plexure CTO David Inggs' presentation at Azure Con for more on how VMob makes use of Azure and big data analytics: https://vimeo.com/141478074
We use Data Lake to bring together the different source data pumped in by Azure Data Factory and prepare it for loading into Azure SQL Data Warehouse.
Keeping customer data safe: Azure SQL Data Warehouse & Azure Resource Manager
One of the reasons we believe Azure SQL Data Warehouse fits so well into our product roadmap is its billing model – you only pay for compute and storage, not for having an instance of a SQL Data Warehouse provisioned – so we believe having a data warehouse per customer is doable. With Azure Resource Manager support for SQL Data Warehouse provisioning it means we can easily integrate setup and configuration management in our DevOps pipeline. Having this per customer means we can continue to support our philosophy of our data belonging to our customer (we don't do any kind of aggregation across customer datasets) – it's physically isolated from other customer data which helps meet our customer compliance requirements.
Having piped in the data from Data Lake, we're now able to optimize schemas and queries in the warehouse, and point reporting tools like Power BI at it, exposing the richness of the data in a familiar reporting interface. This provides our customers' data experts with an excellent surface to draw their own insights from all the many data points we've collected from consumer interaction, engagement and sales data.
And that's the story of connecting millions of data points from mobile devices, combining it with campaign activity, and all the POS data our customer can provide, through to visualization and exposing deep insights to our customer marketers and BI professionals. Now we have all the rails in place we can start layering on more data into the data warehouse, and more supported queries to make those Power BI reports even more magical.