Skip to content

Latest commit

 

History

History
118 lines (75 loc) · 9.11 KB

README.md

File metadata and controls

118 lines (75 loc) · 9.11 KB

CLOUD DATA ENGINEERING

DATA ENGINEERING

This repository include the Roadmap for Data Engineering. Since Data Engineering is a broad field we'll try to cover following tools.

NOTE: It is a first draft of the course we'll Keep on updating.

Understanding Data Engineering

Before starting the digging deep into the field of Data Engineering one should know what is Data Engineering, What is the Scope of Data ENgineering in 2024, what tools are required to have knowledge for data engineering.

Python

Today’s data-driven organizations rely on efficiency in data engineering tasks. As the demand for data increases, teams must have the ability to collect, process, and store extremely large volumes of data, and Python has emerged as a vital asset for accomplishing this mission. Teams use Python for data engineering tasks due to its flexibility, ease of use, and rich ecosystem of libraries and tools. that is why Python should be the first step toward Data Engineering. Following tis the repo in which we covered Python Programming.

NumPy, Pandas, Matplotlib

Since Data Engineering is the field in which you have to play around with the data so Basic Exploratory Data Analysis skills are required to play around with data for that we'll be coving NumPy Pandas & Matplotlib for EDA but we'll give a keen foucs on Pandas in this section.

SQL (PostgresSQL & T-SQL)

When You have know Python & basic Data Analysis then for Data Engineeering Next Step shoudl be to know about Database & how to intereact with them. SQL is the most common laguage to that is been used for Databases/DataWareshousing from decads. So, we'll cover SQL as query language for databases & we'll try to cover two flavours of SQL i:3. T-SQL & PostgresSQL.

Snowflake

To be summarize we're targetting cloud data engineering & snowflake is kind of cloud data warehoursing solution that data idustry have adopt extensively. Only SQL is used in snowflake which make the perfect combo to learn after we have covered SQL (PostgresSQL & T-SQL).

  1. Batch 1: http://learn.snowflake.com/en/courses/uni-essdww101/
  2. Batch 2: http://learn.snowflake.com/en/courses/uni-ess-cmcw/
  3. Batch 3: http://learn.snowflake.com/en/courses/uni-ess-dabw/
  4. Batch 4: http://learn.snowflake.com/en/courses/uni-ess-dlkw/
  5. Batch 5: http://learn.snowflake.com/en/courses/uni-ess-dngw/

Bash/Shell Scripting & Liux Commands

Bash/Shell scripting and Linux commands are vital in a Cloud Data Engineering roadmap due to their automation capabilities, essential for tasks like data processing and infrastructure management. Proficiency ensures flexibility, troubleshooting skills, and compatibility with cloud platforms. Cost optimization through efficient resource usage and the ability to streamline version control and deployment processes further emphasizes their importance.

  • Introduction to Shell

  • Introduction Bash Scripting

  • Data processing in Shell

  • Project -4 Security Log Analysis You're responsible for the security of a server, which involves monitoring a log file named security.log. This file records security-related events, including successful and failed login attempts, file access violations, and network intrusion attempts. Your goal is to analyze this log file to extract crucial security insights. Create a sample log file named security.log with the following format:

2024-03-29 08:12:34 SUCCESS: User admin login
2024-03-29 08:15:21 FAILED: User guest login attempt
2024-03-29 08:18:45 ALERT: Unauthorized file access detected
2024-03-29 08:21:12 SUCCESS: User admin changed password
2024-03-29 08:24:56 FAILED: User root login attempt
2024-03-29 08:27:34 ALERT: Possible network intrusion detected.

Docker w.r.t data engineering

Docker is integral to a Cloud Data Engineering roadmap for its ability to encapsulate data engineering environments into portable containers. This ensures consistency across development, testing, and production stages, facilitating seamless deployment and scaling of data pipelines. Docker's lightweight nature optimizes resource utilization, enabling efficient utilization of cloud infrastructure. Moreover, it promotes collaboration by simplifying the sharing of reproducible environments among team members, enhancing productivity and reproducibility in data engineering workflows.

Airflow

When we have a Data Pipeline & we want to trigger it on daily basis so we need some kind of automation or orchestration tool that can automate our orchestration part. for that purposes Airflow is the quite adopted choice to learn that why we have airflow in our roadmap.

Kafka

When data is coming in the real-time fashion & suppose we don't have end destination ready to consume that data or let say any diaster happen. In this case we'll lose our data. This itroduce the need of de-coupling tool that can seperate both produce ends of the data & consumer end of the & act as mediator.

AWS

AWS is crucial in a Cloud Data Engineering roadmap due to its comprehensive suite of services tailored for data processing, storage, and analytics. Leveraging AWS allows data engineers to build scalable and cost-effective data pipelines using services like S3, Glue, and EMR. Integration with other AWS services enables advanced analytics, machine learning, and real-time processing capabilities, empowering data engineers to derive valuable insights from data. Furthermore, AWS certifications validate expertise in cloud data engineering, enhancing career prospects and credibility in the industry.