Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#42] Add Trino use case in the Jupyter notebook #43

Merged
merged 3 commits into from
May 28, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -62,6 +62,8 @@ cd gravitino-playground

## Experiencing Gravitino with Trino SQL

### Using Trino CLI in Docker Container

1. Log in to the Gravitino playground Trino Docker container using the following command:

```shell
@@ -74,6 +76,14 @@ docker exec -it playground-trino bash
trino@container_id:/$ trino
```

### Using Jupiter Notebook

1. Open the Jupyter Notebook in the browser at [http://localhost:8888](http://localhost:8888).

2. Open the `gravitino-trino-example.ipynb` notebook.

3. Start the notebook and run the cells.

## Example

### Simple queries
312 changes: 312 additions & 0 deletions init/jupyter/gravitino-trino-example.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,312 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "98c657d7-ef2b-49cf-99de-9a307b4f5d93",
"metadata": {},
"source": [
"## Gravitino Trino Example\n",
"\n",
"In this example, we will use `Jupyter` and the `Trino Python Client` to experience `Gravitino`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a0dbfc33-8811-4b8e-a43b-24c0a4c934af",
"metadata": {},
"outputs": [],
"source": [
"# install trino python client and pandas\n",
"%pip install trino pandas"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fb9e0bda-c0bf-4aaf-9094-bb14649e688c",
"metadata": {},
"outputs": [],
"source": [
"from trino.dbapi import connect\n",
"\n",
"# Create a Trino connector client\n",
"conn = connect(\n",
" host=\"trino\",\n",
" port=8080,\n",
" user=\"admin\",\n",
" catalog=\"catalog_hive\",\n",
" schema=\"http\",\n",
")\n",
"\n",
"trino_client = conn.cursor()"
]
},
{
"cell_type": "markdown",
"id": "dd240dfe-f4a9-4fa0-8366-82a8df34273d",
"metadata": {},
"source": [
"## Prepare\n",
"\n",
"Creates a schema named `catalog_hive.company` in Hive, with its location set to`hdfs://hive:9000/user/hive/warehouse/company.db` on HDFS."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "85eea564-cb4f-476d-a373-fc71b5086f24",
"metadata": {},
"outputs": [],
"source": [
"trino_client.execute(\"\"\"\n",
"CREATE SCHEMA catalog_hive.company\n",
" WITH (location = 'hdfs://hive:9000/user/hive/warehouse/company.db')\n",
"\"\"\").fetchall()"
]
},
{
"cell_type": "markdown",
"id": "d4015b50-470c-47c0-b74f-6d8268cd6d1c",
"metadata": {},
"source": [
"Displays the SQL command that was used to create the schema `catalog_hive.company`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c9424567-a5a9-4b0a-9a43-82a088812a79",
"metadata": {},
"outputs": [],
"source": [
"trino_client.execute(\"\"\"\n",
"SHOW CREATE SCHEMA catalog_hive.company\n",
"\"\"\").fetchall()"
]
},
{
"cell_type": "markdown",
"id": "96339930-dbab-46d8-955e-25917c2ea5e4",
"metadata": {},
"source": [
"Create `employees` table"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "46eda853-bf85-4c06-96b7-5b09d08141db",
"metadata": {},
"outputs": [],
"source": [
"# Create Table\n",
"trino_client.execute(\n",
"\"\"\"\n",
"CREATE TABLE catalog_hive.company.employees\n",
"(\n",
" name varchar,\n",
" salary decimal(10,2)\n",
")\n",
"WITH (\n",
" format = 'TEXTFILE'\n",
")\n",
"\"\"\"\n",
").fetchall()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "52921a17-d50f-42b1-870c-523efba622f7",
"metadata": {},
"outputs": [],
"source": [
"# Insert data\n",
"print(trino_client.execute(\"INSERT INTO catalog_hive.company.employees (name, salary) VALUES ('Sam Evans', 55000)\").fetchall())"
]
},
{
"cell_type": "markdown",
"id": "c8b5c0f8-2650-4e0f-85d3-1dfd396cee21",
"metadata": {},
"source": [
"## Simple queries\n",
"\n",
"Some simple query testing."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "65b02105-c530-431a-8302-3ce8ff1a6463",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"# Show employees table contents\n",
"df = pd.DataFrame(trino_client.execute(\"SELECT * FROM catalog_hive.company.employees\").fetchall(), columns=['Name', 'Salary'])\n",
"\n",
"# Display the DataFrame\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ebb7f9fc-8ae8-4298-b49d-7f7b06b6b8c3",
"metadata": {},
"outputs": [],
"source": [
"# Execute the queries and convert the results directly to DataFrames\n",
"df_g = pd.DataFrame(trino_client.execute(\"SHOW SCHEMAS from catalog_hive\").fetchall(), columns=['Schema'])\n",
"df_g"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "de3556c5-35b8-4ad2-8c67-4f1e985513c3",
"metadata": {},
"outputs": [],
"source": [
"h = trino_client.execute(\"DESCRIBE catalog_hive.company.employees\").fetchall()\n",
"h"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cae3b62f-5ad8-4959-8023-afd758e4103e",
"metadata": {},
"outputs": [],
"source": [
"df_i = pd.DataFrame(trino_client.execute(\"SHOW TABLES from catalog_hive.company\").fetchall(), columns=['Tables'])\n",
"df_i"
]
},
{
"cell_type": "markdown",
"id": "3e00c989-607e-4652-8288-54e2629ac62d",
"metadata": {},
"source": [
"## Cross-catalog queries\n",
"\n",
"In a company, there may be different departments using different data stacks. In this example, the HR department uses Apache Hive to store its data and the sales department uses PostgreSQL. You can run some interesting queries by joining the two departments' data together with Gravitino.\n",
"\n",
"To know which employee has the largest sales amount:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "83af02a6-526a-44f8-b99d-54721b3cd05d",
"metadata": {},
"outputs": [],
"source": [
"# Cross-catalog queries\n",
"cross_catalog = trino_client.execute(\"\"\"\n",
"SELECT given_name, family_name, job_title, sum(total_amount) AS total_sales\n",
"FROM catalog_hive.sales.sales as s,\n",
" catalog_postgres.hr.employees AS e\n",
"where s.employee_id = e.employee_id\n",
"GROUP BY given_name, family_name, job_title\n",
"ORDER BY total_sales DESC\n",
"LIMIT 1\n",
"\"\"\").fetchall()\n",
"\n",
"# Convert the result to a DataFrame\n",
"df_j = pd.DataFrame(cross_catalog, columns=['Given Name', 'Family Name', 'Job Title', 'Total Sales'])\n",
"\n",
"df_j"
]
},
{
"cell_type": "markdown",
"id": "b92fcfb4-c6c9-4fff-9193-6562467996b7",
"metadata": {},
"source": [
"To know the top customers who bought the most by state:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ff956ab9-60ef-485b-8e75-f907ed1707c2",
"metadata": {},
"outputs": [],
"source": [
"# Execute the query\n",
"k = trino_client.execute(\"\"\"\n",
"SELECT customer_name, location, SUM(total_amount) AS total_spent\n",
"FROM catalog_hive.sales.sales AS s,\n",
" catalog_hive.sales.stores AS l,\n",
" catalog_hive.sales.customers AS c\n",
"WHERE s.store_id = l.store_id AND s.customer_id = c.customer_id\n",
"GROUP BY location, customer_name\n",
"ORDER BY location, SUM(total_amount) DESC\n",
"\"\"\").fetchall()\n",
"\n",
"# Convert the result to a DataFrame\n",
"df_k = pd.DataFrame(k, columns=['Customer Name', 'Location', 'Total Spent'])\n",
"\n",
"# Display the DataFrame\n",
"df_k"
]
},
{
"cell_type": "markdown",
"id": "c2369ac9-d92c-4d35-8753-11c5c2cb0f47",
"metadata": {},
"source": [
"To know the employee's average performance rating and total sales:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d9d6e4a4-d913-4d4d-9153-e25fd4df3f68",
"metadata": {},
"outputs": [],
"source": [
"# Execute the query\n",
"l = trino_client.execute(\"\"\"\n",
"SELECT e.employee_id, given_name, family_name, AVG(rating) AS average_rating, SUM(total_amount) AS total_sales\n",
"FROM catalog_postgres.hr.employees AS e,\n",
" catalog_postgres.hr.employee_performance AS p,\n",
" catalog_hive.sales.sales AS s\n",
"WHERE e.employee_id = p.employee_id AND p.employee_id = s.employee_id\n",
"GROUP BY e.employee_id, given_name, family_name\n",
"\"\"\").fetchall()\n",
"\n",
"# Convert the result to a DataFrame\n",
"df_l = pd.DataFrame(l, columns=['Employee ID', 'Given Name', 'Family Name', 'Average Rating', 'Total Sales'])\n",
"\n",
"# Display the DataFrame\n",
"df_l"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.7"
}
},
"nbformat": 4,
"nbformat_minor": 5
}