Big Data End to End Pipeline Design on major Cloud Platforms.
Simple Process for designing a Pipeline can be like below. Let's understand it with examples:
Ingest -> Store -> Process -> Serve.
Ingest => Get the data from multiple sources using some ingestion framework.Example => AWS Glue, Azure Data Factory, Sqoop and NiFi
Store => Since we are going to store huge amount of data we need a Distributed/Object store.
Example => HDFS, Amazon S3, Azure blob storage and Google Cloud Storage (GCS)
Process => Since our data is sitting on multiple machines our traditional programming styles won't work. we need a distributed processing framework.
Example => Apache Spark, MapReduce
*Apache Spark can be on-premises, Databricks etc.
Serve => The processed Data now needs to show on UI, Visualization for reporting.
Example => NoSQL DB's like HBase, MongoDB, DynamoDB, CosmosDB, Cassandra or even RDBMS databases.
Let's see this pipeline across 2 major public cloud providers AWS and Azure including On-Premise :
AWS :
Glue -> Amazon S3 -> Athena/Redshift -> DynamoDB
Azure :
Data Factory -> ADLS Gen2 -> Databricks -> Delta/Synapse
On Premise :
Sqoop/NiFi -> HDFS -> Apache Spark -> HBase/Hive
Note => There can be many other ways to Design these pipelines. Some of them are as below :
1. Glue --> S3 --> Apache Spark/Athena --> S3/DynamoDB
2. Data source --> ingest --> S3 --> Snowflake (through Snowpipe) --> Transformations in one layer of DB --> Power BI
Sqoop/NiFi -> HDFS -> Apache Spark -> HBase/Hive
Note => There can be many other ways to Design these pipelines. Some of them are as below :
1. Glue --> S3 --> Apache Spark/Athena --> S3/DynamoDB
2. Data source --> ingest --> S3 --> Snowflake (through Snowpipe) --> Transformations in one layer of DB --> Power BI
Hope you enjoyed while reading these optimization tips. If you like then please Like, Comment and Share.
Thank You!
Comments
Post a Comment