AWS Data Analytics Architecture – Serverless & DataLake centric

Author Shrikant Gourh
Published July 30, 2021
0 comments Join the Conversation

Below is the animated infographic explaining data analytics architecture in basic language, It is based on serverless AWS model, which is data lake centric. just follow the comments in infographic to get more details.

Summary:

It starts with Data Ingestion of third party data sources via was serverless services, for ex : AWS SFTP for SFTP file ingestion, etc., the raw data can be in three formats Unstructured: docx/pdf/images Semi-structured: XML, JSON, YAML, etc. Structured: CSV, parquet, etc. These data using ingestion services go to DATA LAKE which can be AWS serverless is S3 location further divided into three or more than three buckets, the first one will be the S3 bucket for RAW data which contains raw data just after the ingestion, S3 bucket(s) for staging data (can be more than one) and Finally the S3 bucket for clean data, For data processing, transformations we use AWS Glue, AWS Glue further can Orchestrated using AWS step functions, was glue job will take the raw data converting it to staging data stores in S3 staging locations for further analysis or further load to Data warehouse. Data in S3 data lake after cleaning can be read via AWS Athena using SQL query, final clean data is ready for consumption to warehouse AWS Redshift, AWS Quicksight (for Business Intelligence), AWS Sagemaker (creating ML models) as for visualization or ML model we require clean structured data. For more info please follow the comments in infographic.