-
Notifications
You must be signed in to change notification settings - Fork 14
Description
Course
data-engineering-zoomcamp
Question
How do I ensure that the ingestion pipeline (ingest_data.py) runs successfully? In what order do we need to build containers?
Answer
Step 1: Create a common network
Ensure that you have created a common network (pg-network). This is to ensure that you run several containers in the same network so that they can communicate with each other.
Pg-network is the broader network layer on top of which you will run -
- Postgres container,
- the Dockerized script container (the container which you will have your ingestion script)
- pgadmin container
Command:
docker network create pg-networkStep 2: Run the Postgres container
Once you’ve created the network, start running each container one-by-one. First, run the Postgres container
docker run -it \
-e POSTGRES_USER="root" \
-e POSTGRES_PASSWORD="root" \
-e POSTGRES_DB="ny_taxi" \
-v ny_taxi_postgres_data:/var/lib/postgresql \
-p 5432:5432 \
--network=pg-network \
--name pgdatabase \
postgres:16Remember, if postgres:18 causes issues, use postgres:16 as mentioned above
Step 3: Build the docker container for the pipeline
Ensure your current working directory is /pipeline. Then, run this command to build your container -
docker build -t taxi_ingest:v001 .Step 4: Run the ingestion container
Ensure your current working directory is /pipeline. Then, run this command to build your container -
docker run -it \
--network=pg-network \
taxi_ingest:v001 \
--pg_user=root \
--pg_pass=root \
--pg_host=pgdatabase \
--pg_port=5432 \
--pg_db=ny_taxi \
--year=2021 \
--month=1 \
--target_table=yellow_taxi_tripsMake sure that you use the parameters in the command exactly same as what you have in your script. For example, if your script as the click parameter –pg_user then use pg_user, else if it is something like --user then change the above command to include --user instead of --pg_user
Step 5 (Optional): To validate if your records reached the table
To validate if your records really reached the Postgres table, run PGCLI using the following command -
uv run pgcli -h localhost -p 5432 -u root -d ny_taxiOnce inside pgcli, run the following to get table names -
\dt
Then, to validate how many rows have been ingested, run the following -
SELECT COUNT(*) FROM yellow_taxi_trips
Checklist
- I have searched existing FAQs and this question is not already answered
- The answer provides accurate, helpful information
- I have included any relevant code examples or links