Kabardian Poetry ETL Pipeline

The main goal of this project is to populate the Kabardian Poetry Collection with poems in Kabardian language - a potentially vulnerable* language spoken in Northwest Caucasus which even ChatGPT struggles to recognise:

The poems collection site presents each poem together with the poem's vocabulary - words that the custom Python script could find in the kbd.wiktionary.org/wiki/ dictionary and extract their translation.

The insert_data.py script performs the following steps to extract, transform and load data into the local database:

it extracts each poem's content from up to ten txt files saved in the poems_contents folder;
it checks if the poem is already in the database; the poem will not be added if this is the case;
it then splits each poem into separate words and checks if these words can be found in the online dictionary;
if a word was found, it checks if this word is already in the database and adds it to the words pool if not;
finally, the script allocates the found words and their translation to the corresponding poem which will then appear in the Vocabulary section on each poem page.

Once the data is extracted and loaded to the local database, the contents of its tables will be moved to the corresponding tables in the PostgreSQL live database using copy_to_aws.py.

* classified by the UNESCO Atlas of the World's Languages in Danger.

Entity Relationship Diagram for Kabardian Poetry Collection

How It Works

Poems added to the txt files. (As Kabardian is a low-resource language, it is not very easy to find many poems in this language online and usually it is enough to process ten poems at a time.)
Run copy_to_aws.py which will first run the insert_data.py to identify the vocabulary for each selected poem, to scrape words' translations from an online dictionary and add them together with the poems to a local Postgres database. It will then use pg_dump to copy the contents of the tables from the local database to the corresponding tables on the live database.
Run dbt models in the postgres_transformation project to update the model.

Dependency Management With Python Poetry

Activating Virtual Environment

Sources: Real Python, Poetry Docs

1. Activate Poetry Shell

To activate the virtual environment, create a nested shell with poetry shell. This will activate the environment and start a new shell session within it.

After running this command, you'll be inside the virtual environment, and any Python commands or scripts you run will use the dependencies installed in this environment.

   poetry shell

To deactivate the virtual environment and exit this new shell type exit.

   exit

To deactivate the virtual environment without leaving the shell use deactivate.

   deactivate

2. Directly Use the Virtual Environment (Without Activating the Shell)

If you don’t want to activate the entire shell but just want to run commands within the Poetry-managed virtual environment, you can prefix your commands with poetry run.

poetry run python insert_data.py

Or to run a dbt command:

poetry run dbt run

3. Keep Dependencies Updated

poetry update

DBT Installation With Python Poetry

1. Add dbt-core dbt-postgres to your dependencies

poetry add dbt-core dbt-postgres

Verify installation:

poetry shell
dbt --version

2. Set Up a DBT Project and navigate into this project

dbt init postgres_transformation
cd postgres_transformation

3. Configure profiles.yml file to connect to PostgreSQL

dbt requires a profiles.yml file to connect to your PostgreSQL database. The file is usually located in ~/.dbt/. Configure the file:

kabardian_poems:
  outputs:
    dev:
      dbname: your_db_name
      host: your_host
      pass: your_password
      port: 5432
      schema: "{{ env_var('DBT_SCHEMA', 'your_dbt_schema') }}"
      threads: 4
      type: postgres
      user: your_user
  target: dev

4. Replace `view` for `table` in the dbt_project.yml file

+materialized: table #view

5. Run dbt project with Python Poetry

poetry run dbt run

Retrieve recent web server logs for your Heroku app via CLI

Log in to Heroku via terminal heroku login

This command will open a browser window prompting you to enter your Heroku credentials (email and password).
Once logged in, you’ll be redirected back to the CLI, where you should see a message indicating you’re logged in.

Retrieve recent logs with --num flag

You can adjust 1500 to any number up to Heroku's current maximum.
```
heroku logs --app your-app-name --num 1500
```

Save logs to a file

heroku logs --app your-app-name --num 1500 > recent-logs.txt

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
docs		docs
poems		poems
postgres_transformation		postgres_transformation
.gitignore		.gitignore
README.md		README.md
copy_to_aws.py		copy_to_aws.py
insert_data.py		insert_data.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
sql_observations.sql		sql_observations.sql

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kabardian Poetry ETL Pipeline

Entity Relationship Diagram for Kabardian Poetry Collection

How It Works

Activating Virtual Environment

1. Activate Poetry Shell

2. Directly Use the Virtual Environment (Without Activating the Shell)

3. Keep Dependencies Updated

1. Add dbt-core dbt-postgres to your dependencies

2. Set Up a DBT Project and navigate into this project

3. Configure profiles.yml file to connect to PostgreSQL

4. Replace `view` for `table` in the dbt_project.yml file

5. Run dbt project with Python Poetry

About

Releases

Packages

Languages

kkumyk/kabardian-poetry-etl-pipeline

Folders and files

Latest commit

History

Repository files navigation

Kabardian Poetry ETL Pipeline

Entity Relationship Diagram for Kabardian Poetry Collection

How It Works

Activating Virtual Environment

1. Activate Poetry Shell

2. Directly Use the Virtual Environment (Without Activating the Shell)

3. Keep Dependencies Updated

1. Add dbt-core dbt-postgres to your dependencies

2. Set Up a DBT Project and navigate into this project

3. Configure profiles.yml file to connect to PostgreSQL

4. Replace view for table in the dbt_project.yml file

5. Run dbt project with Python Poetry

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

4. Replace `view` for `table` in the dbt_project.yml file

Packages