Posts

Showing posts from December, 2023

CICD for Data Engineers with easy understanding!

Let's  say we are working on a Customer Analysis Project & have a Jira ticket assigned to us as CA-111 If we are a developer, we would create a feature branch as below : feature/CA-111 and work on it. As soon as we make a git push, and GitHub receives the new code, a pipeline should run which involves below steps : 1. Build   :  Creating a virtual environment & install all the dependencies (In case of python). 2. Test : Run unit test cases / Quality checks. 3. Package : Create a package, can be a zip of code. 4. Deploy : Send the code bundle to edge node using SCP (Secure Copy). If we do all of these manually then it will be a time consuming and error prone. So all of the above steps should run as a automated pipeline step by step.  We can automate it using a automation server such as Jenkins. So anytime a new branch is created, or a git push happens, all of the build -> test -> package -> deploy should run without any manual involvement. Now let us understa

Incremental Load Technique with CDC (Change Data Capture).

Change Data Capture (CDC) :   Incremental Load with CDC (Change Data Capture) is a strategy in data warehousing and ETL (Extract, Transform, Load) processes where only the changed or newly added data is loaded from source systems to the target system. CDC is particularly useful in scenarios where processing the entire dataset is impractical due to its size or where real-time or near-real-time updates are essential. Key Concepts for CDC :  Change Data Capture (CDC) is the process of identifying and capturing changes made to source data since the last extraction. It helps in tracking inserts, updates, and deletes. Instead of reloading the entire dataset, only the changes (inserts, updates, deletes) are applied to the target system. This minimizes processing time and resources. Example with PySpark Code : from pyspark.sql import SparkSession from pyspark.sql.functions import col 1. Create a SparkSession : spark = SparkSession.builder.appName("IncrementalLoadWithCDC").getOrCreat