CICD for Data Engineers with easy understanding!

Let's say we are working on a Customer Analysis Project & have a Jira ticket assigned to us as CA-111

If we are a developer, we would create a feature branch as below :
feature/CA-111 and work on it.

As soon as we make a git push, and GitHub receives the new code, a pipeline should run which involves below steps :

1. Build : Creating a virtual environment & install all the dependencies (In case of python).
2. Test : Run unit test cases / Quality checks.
3. Package : Create a package, can be a zip of code.
4. Deploy : Send the code bundle to edge node using SCP (Secure Copy).

If we do all of these manually then it will be a time consuming and error prone. So all of the above steps should run as a automated pipeline step by step. We can automate it using a automation server such as Jenkins.
So anytime a new branch is created, or a git push happens, all of the build -> test -> package -> deploy should run without any manual involvement.

Now let us understand the branching structure based on different environments which most of the data engineering projects follow.

Feature Branches -> Dev -> Test -> UAT -> main/Master (Prod)

Feature Branch -> This should be the branch where developers should code their new changes and after testing the code properly, we can merge the code with Development and other branches as below :
Dev -> Development Branch
Test -> For QA team to test
UAT -> For user acceptance testing
main/Master -> This is the production branch.

Feature branches are short lived that means once feature branch is merge to a higher branch, we can delete it. All other branches will be there for always. So, when we make a code push to Feature Branch in GitHub then all the 4 steps in CICD pipeline will run.

For the Feature Branch, if we see that all the 4 steps ran fine, then we can raise a pull request to merge our code to Dev Branch. once the reviewers see and merges it, the automated pipeline should run again on Dev Branch.

Same way when code is merged to Test, UAT & main/Master the pipeline should run.

Most of the companies follow the similar structure for CICD in data engineering projects.

As a Data Engineer we should not go too deep into it, but we should have a fair idea. Ideally the entire process would be set, we should have to just follow that.


I tried my best to explain the CICD process in simple words and I hope you would have liked the post. If you like then please Like, Comment and Share.

Thank You!

Comments

Popular posts from this blog

Transformations and Actions in Spark

How to convert XML String to JSON String in Spark-Scala

How to Convert a Spark DataFrame to Map in Scala