Skip to content

Commit

Permalink
Added link to binder repository (forked from @raybellwaves) (#72)
Browse files Browse the repository at this point in the history
* Added link to binder repository (forked from @raybellwaves) and two notebooks

* Move to the new dask-sql API
  • Loading branch information
nils-braun authored Nov 5, 2020
1 parent 8e472ce commit 05b08ae
Show file tree
Hide file tree
Showing 3 changed files with 335 additions and 0 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
[![Read the Docs](https://img.shields.io/readthedocs/dask-sql)](https://dask-sql.readthedocs.io/en/latest/)
[![Codecov](https://img.shields.io/codecov/c/github/nils-braun/dask-sql?logo=codecov)](https://codecov.io/gh/nils-braun/dask-sql)
[![GitHub](https://img.shields.io/github/license/nils-braun/dask-sql)](https://github.com/nils-braun/dask-sql/blob/main/LICENSE.txt)
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/nils-braun/dask-sql-binder/main?urlpath=lab)

`dask-sql` adds a SQL query layer on top of `dask`.
This allows you to query and transform your dask dataframes using
Expand Down
159 changes: 159 additions & 0 deletions notebooks/Custom Functions.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Custom Functions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Apart from the SQL functions that are already implemented in `dask-sql`, it is possible to add custom functions and aggregations.\n",
"Have a look into [the documentation](https://dask-sql.readthedocs.io/en/latest/pages/custom.html) for more information."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import dask.dataframe as dd\n",
"import dask.datasets\n",
"from dask_sql.context import Context"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We use some generated test data for the notebook:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"c = Context()\n",
"\n",
"df = dask.datasets.timeseries().reset_index().persist()\n",
"c.create_table(\"timeseries\", df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As a first step, we will create a scalar function to calculate the absolute value of a column.\n",
"(Please note that this can also be done via the `ABS` function in SQL):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# The input to the function will be a dask series\n",
"def my_abs(x):\n",
" return x.abs()\n",
"\n",
"# As SQL is a typed language, we need to specify all types \n",
"c.register_function(my_abs, \"MY_ABS\", parameters=[(\"x\", np.float64)], return_type=np.float64)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We are now able to use our new function in all queries"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"c.sql(\"\"\"\n",
" SELECT\n",
" x, y, MY_ABS(x) AS \"abs_x\", MY_ABS(y) AS \"abs_y\"\n",
" FROM\n",
" \"timeseries\"\n",
" WHERE\n",
" MY_ABS(x * y) > 0.5\n",
"\"\"\").compute()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we will register an aggregation, which gets a column as input and returns a single value.\n",
"An aggregation needs to be an instance of `dask.Aggregation` (see the [dask docu](https://docs.dask.org/en/latest/dataframe-groupby.html#aggregate))."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"my_sum = dd.Aggregation(\"MY_SUM\", lambda x: x.sum(), lambda x: x.sum())\n",
"\n",
"c.register_aggregation(my_sum, \"MY_SUM\", [(\"x\", np.float64)], np.float64)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"c.sql(\"\"\"\n",
" SELECT\n",
" name, MY_SUM(x) AS \"my_sum\"\n",
" FROM\n",
" \"timeseries\"\n",
" GROUP BY\n",
" name\n",
" LIMIT 10\n",
"\"\"\").compute()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
175 changes: 175 additions & 0 deletions notebooks/Introduction.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# dask-sql Introduction"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`dask-sql` lets you query your (dask) data using usual SQL language.\n",
"You can find more information on the usage in the [documentation](https://dask-sql.readthedocs.io/)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from dask_sql import Context\n",
"from dask.datasets import timeseries\n",
"from dask.distributed import Client"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As a first step, we will create a dask client to connect to a local dask cluster (which is started implicitly).\n",
"You can open the dashboard by clicking on the shown link (in binder, this is already open on the left)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"client = Client()\n",
"client"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we create a context to hold the registered tables.\n",
"You typically only do this once in your application."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"c = Context()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Load the data and register it in the context. This will give the table a name.\n",
"In this example, we generate random data.\n",
"It is also possible to load data from file, S3, hdfs etc.\n",
"Have a look into [Data Loading](https://dask-sql.readthedocs.io/en/latest/pages/data_input.html) for more information."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df = timeseries()\n",
"c.create_table(\"timeseries\", df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now execute an SQL query. \n",
"The result is a dask dataframe.\n",
"\n",
"The query looks for the id with the highest x for each name (this is just random test data, but you could think of looking for outliers in the sensor data)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"result = c.sql(\"\"\"\n",
" SELECT\n",
" lhs.name,\n",
" lhs.id,\n",
" lhs.x\n",
" FROM\n",
" timeseries AS lhs\n",
" JOIN\n",
" (\n",
" SELECT\n",
" name AS max_name,\n",
" MAX(x) AS max_x\n",
" FROM timeseries\n",
" GROUP BY name\n",
" ) AS rhs\n",
" ON\n",
" lhs.name = rhs.max_name AND\n",
" lhs.x = rhs.max_x\n",
"\"\"\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can show the result:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"result.compute()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"... or use it for any other dask calculation"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"result.x.mean().compute()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

0 comments on commit 05b08ae

Please sign in to comment.