Big Data End to End Pipeline Design on major Cloud Platforms.

Simple Process for designing a Pipeline can be like below. Let's understand it with examples:

Ingest -> Store -> Process -> Serve.

Ingest => Get the data from multiple sources using some ingestion framework.

Example => AWS Glue, Azure Data Factory, Sqoop and NiFi

Store => Since we are going to store huge amount of data we need a Distributed/Object store.

Example => HDFS, Amazon S3, Azure blob storage and Google Cloud Storage (GCS)

Process => Since our data is sitting on multiple machines our traditional programming styles won't work.                        we need a distributed processing framework.

Example => Apache Spark, MapReduce

*Apache Spark can be on-premises, Databricks etc.

Serve => The processed Data now needs to show on UI, Visualization for reporting.

Example => NoSQL DB's like HBase, MongoDB, DynamoDB, CosmosDB, Cassandra or even RDBMS databases.

Let's see this pipeline across 2 major public cloud providers AWS and Azure including On-Premise :

AWS :
Glue -> Amazon S3 -> Athena/Redshift -> DynamoDB

Azure : 
Data Factory -> ADLS Gen2 -> Databricks -> Delta/Synapse

On Premise :
Sqoop/NiFi -> HDFS -> Apache Spark -> HBase/Hive

Note => There can be many other ways to Design these pipelines. Some of them are as below :
1.  Glue --> S3 --> Apache Spark/Athena --> S3/DynamoDB
2.  Data source --> ingest --> S3 --> Snowflake (through Snowpipe) --> Transformations in one layer of DB --> Power BI

Hope you enjoyed while reading these optimization tips. If you like then please Like, Comment and Share.


Thank You!

Comments

Popular posts from this blog

Transformations and Actions in Spark

Knowledge about Apache Sqoop and its all basic commands to import and export the Data

How to Convert a Spark DataFrame String Type Column to Array Type and Split all the json files of this column into rows : Part - 1