As part of my training, I was assigned the role of a Data Engineer working on a data pipeline/ETL project. My main task was to extract data from a website, process it, and store it in a PostgreSQL database.
For this project, I built a web scraping tool to gather product data from Lazada, specifically focusing on running shoes, which are currently trending due to the growing interest in running and fitness.
This project helped me understand the real-world workflow of a Data Engineer β from data extraction and cleaning to storage and analysis.
- Scrape product data related to running shoes from Lazada.
- Clean and process the collected data.
- Store the structured data in a PostgreSQL database using pgAdmin4.
- Perform basic analysis to understand product distribution and popularity.
- Python: Main programming language
- Pandas: Data manipulation and analysis
- BeautifulSoup: HTML parsing for scraping static content
- Selenium: Automating browser actions and scraping dynamic content
- PostgreSQL: Database for storing the cleaned data
- pgAdmin4: GUI for PostgreSQL database management
the data I scraped was up to 10 slides, resulting in 400 rows and 6 columns :
- Product_Name
- Price
- Seller Location
- Sold
- Rating
- Review
By the end of this project, I was able to simulate a real-world ETL (Extract, Transform, Load) process and gain hands-on experience in:
- Building web scrapers with Selenium & BeautifulSoup
- Structuring and cleaning data with Pandas
- Using PostgreSQL for data storage
- Understanding the workflow of a data engineering project
π Check the notebooks folder for the Jupyter Notebook.
π View data folder for raw and cleaned datasets.
This project is for educational purposes only. It complies with Lazadaβs terms of use and was not used for commercial purposes.