Data Analytics Architecture on Azure using Delta Lake
In this blog post, we will understand how we can leverage Delta Lake and create a generalized Data Analytics Architecture on Azure.
Delta lake in brief
Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs. Delta Lake on Azure Databricks allows you to configure Delta Lake based on your workload patterns.
Leverage Delta Lake with Azure
Similar to any ETL pipeline, the journey starts with a wide variety of data – structured and unstructured.
The very first step is to bring that data to the Azure Data Lake Storage (ADLS).
- For real-time, we might be using either Kafka, Azure Event Hub, or Azure IoT Hub.
- Azure Data Factory will be used to ingest the data in bulk to the ADLS storage
- In the 2nd step, both batch and streaming data will be combined and saved as a Delta Format using Azure Databricks.
- This will be the exact replica of the raw data without business logic.
- The data will be cleaned, transformed, and enriched using Azure Databricks.
- If it is required, we can have a Gold table, as per our architecture, to store curated data.
- Azure ML, along with, ML Flow will be used to develop, train, and score ML models
The final step after data preparation would be, is to send that data downstream to
- Cosmos DB for real-time apps.
- Azure Synapse for BI and reporting.