Skip to content

A project to scrape, transform and load data into a local PostgreSQL database and migrate it to the live Postgres DB hosted on AWS.

Notifications You must be signed in to change notification settings

kkumyk/kabardian-poetry-etl-pipeline

Repository files navigation

Kabardian Poetry ETL Pipeline

The main goal of this project is to populate the Kabardian Poetry Collection with poems in Kabardian language - a potentially vulnerable* language spoken in Northwest Caucasus which even ChatGPT struggles to recognise: ChatGPT response

The poems collection site presents each poem together with the poem's vocabulary - words that the custom Python script could find in the kbd.wiktionary.org/wiki/ dictionary and extract their translation.

The insert_data.py script performs the following steps to extract, transform and load data into the local database:

  1. it extracts each poem's content from up to ten txt files saved in the poems_contents folder;
  2. it checks if the poem is already in the database; the poem will not be added if this is the case;
  3. it then splits each poem into separate words and checks if these words can be found in the online dictionary;
  4. if a word was found, it checks if this word is already in the database and adds it to the words pool if not;
  5. finally, the script allocates the found words and their translation to the corresponding poem which will then appear in the Vocabulary section on each poem page.

Once the data is extracted and loaded to the local database, the contents of its tables will be moved to the corresponding tables in the PostgreSQL live database using copy_to_aws.py.

* classified by the UNESCO Atlas of the World's Languages in Danger.

Entity Relationship Diagram for Kabardian Poetry Collection

Kabardian Poetry Collection Postgres DB ERD

How It Works

  1. Poems added to the txt files. (As Kabardian is a low-resource language, it is not very easy to find many poems in this language online and usually it is enough to process ten poems at a time.)
  2. Run copy_to_aws.py which will first run the insert_data.py to identify the vocabulary for each selected poem, to scrape words' translations from an online dictionary and add them together with the poems to a local Postgres database. It will then use pg_dump to copy the contents of the tables from the local database to the corresponding tables on the live database.
  3. Run dbt models in the postgres_transformation project to update the model.

Dependency Management With Python Poetry

Activating Virtual Environment

Sources: Real Python, Poetry Docs

1. Activate Poetry Shell

To activate the virtual environment, create a nested shell with poetry shell. This will activate the environment and start a new shell session within it.

After running this command, you'll be inside the virtual environment, and any Python commands or scripts you run will use the dependencies installed in this environment.

   poetry shell

To deactivate the virtual environment and exit this new shell type exit.

   exit

To deactivate the virtual environment without leaving the shell use deactivate.

   deactivate

2. Directly Use the Virtual Environment (Without Activating the Shell)

If you don’t want to activate the entire shell but just want to run commands within the Poetry-managed virtual environment, you can prefix your commands with poetry run.

poetry run python insert_data.py

Or to run a dbt command:

poetry run dbt run

3. Keep Dependencies Updated

poetry update

DBT Installation With Python Poetry

1. Add dbt-core dbt-postgres to your dependencies

poetry add dbt-core dbt-postgres

Verify installation:

poetry shell
dbt --version

2. Set Up a DBT Project and navigate into this project

dbt init postgres_transformation
cd postgres_transformation

3. Configure profiles.yml file to connect to PostgreSQL

dbt requires a profiles.yml file to connect to your PostgreSQL database. The file is usually located in ~/.dbt/. Configure the file:

kabardian_poems:
  outputs:
    dev:
      dbname: your_db_name
      host: your_host
      pass: your_password
      port: 5432
      schema: "{{ env_var('DBT_SCHEMA', 'your_dbt_schema') }}"
      threads: 4
      type: postgres
      user: your_user
  target: dev

4. Replace view for table in the dbt_project.yml file

+materialized: table #view

5. Run dbt project with Python Poetry

poetry run dbt run

Retrieve recent web server logs for your Heroku app via CLI
  1. Log in to Heroku via terminal heroku login
  • This command will open a browser window prompting you to enter your Heroku credentials (email and password).
  • Once logged in, you’ll be redirected back to the CLI, where you should see a message indicating you’re logged in.
  1. Retrieve recent logs with --num flag
  • You can adjust 1500 to any number up to Heroku's current maximum.
    heroku logs --app your-app-name --num 1500
  1. Save logs to a file
    heroku logs --app your-app-name --num 1500 > recent-logs.txt

About

A project to scrape, transform and load data into a local PostgreSQL database and migrate it to the live Postgres DB hosted on AWS.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages