Contributing to Dask-SQL

Environment Setup

The environment used for development and CI consists of:

a system installation of rustup with:
- the latest stable toolchain
- the latest nightly rustfmt
a conda environment containing all required Python packages

Once rustup is installed, ensure that the latest stable toolchain and nightly rustfmt are available by running

rustup toolchain install nightly -c rustfmt --profile minimal
rustup update

To initialize and activate the conda environment for a given Python version:

conda env create -f dask-sql/continuous_integration/environment-{$PYTHON_VER}.yaml
conda activate dask-sql

Rust Developers Guide

Dask-SQL utilizes Apache Arrow Datafusion for parsing, planning, and optimizing SQL queries. DataFusion is written in Rust and therefore requires some Rust experience to be productive. Luckily, there are tons of great Rust learning resources on the internet. We have listed some of our favorite ones here

Apache Arrow DataFusion

The Dask-SQL Rust codebase makes heavy use Apache Arrow DataFusion. Contributors should familiarize themselves with the codebase and documentation.

Purpose

DataFusion provides Dask-SQL with key functionality.

Parsing SQL query strings into a LogicalPlan datastructure
Future integration points with substrait.io
An optimization framework used as the baseline for creating custom highly efficient LogicalPlans specific to Dask.

Building

Building the Dask-SQL Rust codebase is a straightforward process. If you create and activate the Dask-SQL Conda environment the Rust compiler and all necessary components will be installed for you during that process and therefore requires no further manual setup.

maturin is used by Dask-SQL for building and bundling the resulting Rust binaries. This helps make building and installing the Rust binaries feel much more like a native Python workflow.

More details about the building setup can be found in pyproject.toml and Cargo.toml

Note that while maturin is used by CI and should be used during your development cycle, if the need arises to do something more specific that is not yet supported by maturin you can opt to use cargo directly from the command line.

Building with Python

Building Dask-SQL is straightforward with Python. To build run pip install .. This will build both the Rust and Python codebase and install it into your locally activated conda environment; note that if your Rust dependencies have been updated, this command must be rerun to rebuild the Rust codebase.

DataFusion Modules

DataFusion is broken down into a few modules. We consume those modules in our Cargo.toml. The modules that we use currently are

datafusion-common - Datastructures and core logic
datafusion-expr - Expression based logic and operators
datafusion-sql - SQL components such as parsing and planning
datafusion-optimizer - Optimization logic and datastructures for modifying current plans into more efficient ones.

Retrieving Upstream Dependencies

During development you might find yourself needing some upstream DataFusion changes not present in the projects current version. Luckily this can easily be achieved by updating Cargo.toml and changing the rev to the SHA of the version you need. Note that the same SHA should be used for all DataFusion modules.

Local Documentation

Sometimes when building against the latest Github commits for DataFusion you may find that the features you are consuming do not have their documentation public yet. In this case it can be helpful to build the DataFusion documentation locally so that it can be referenced to assist with development. Here is a rough outline for building that documentation locally.

clone https://github.com/apache/arrow-datafusion
change into the arrow-datafusion directory
run cargo doc
navigate to target/doc/datafusion/all.html and open in your desired browser

Datastructures

While working in the Rust codebase there are a few datastructures that you should make yourself familiar with. This section does not aim to verbosely list out all of the datastructure with in the project but rather just the key datastructures that you are likely to encounter while working on almost any feature/issue. The aim is to give you a better overview of the codebase without having to manually dig through the all the source code.

PyLogicalPlan -> DataFusion LogicalPlan
- Often encountered in Python code with variable name rel
- Python serializable umbrella representation of the entire LogicalPlan that was generated by DataFusion
- Provides access to DaskTable instances and type information for each table
- Access to individual nodes in the logical plan tree. Ex: TableScan
DaskSQLContext
- Analogous to Python Context
- Contains metadata about the tables, schemas, functions, operators, and configurations that are persent within the current execution context
- When adding custom functions/UDFs this is the location that you would register them
- Entry point for parsing SQL strings to sql node trees. This is the location Python will begin its interactions with Rust
PyExpr -> DataFusion Expr
- Arguably where most of your time will be spent
- Represents a single node in sql tree. Ex: avg(age) from SELECT avg(age) FROM people
- Is associate with a single RexType
- Can contain literal values or represent function calls, avg() for example
- The expressions "index" in the tree can be retrieved by calling PyExpr.index() on an instance. This is useful when mapping frontend column names in Dask code to backend Dataframe columns
- Certain PyExprs contain operands. Ex: 2 + 2 would contain 3 operands. 1) A literal PyExpr instance with value 2 2) Another literal PyExpr instance with a value of 2. 3) A + PyExpr representing the addition of the 2 literals.
DaskSqlOptimizer
- Registering location for all Dask-SQL specific logical plan optimizations
- Optimizations that are written either custom or use from another source, DataFusion, are registered here in the order they are wished to be executed
- Represents functions that modify/convert an original PyLogicalPlan into another PyLogicalPlan that would be more efficient when running in the underlying Dask framework
RelDataType
- Not a fan of this name, was chosen to match existing Calcite logic
- Represents a "row" in a table
- Contains a list of "columns" that are present in that row
  - RelDataTypeField
RelDataTypeField
- Represents an individual column in a table
- Contains:
  - qualifier - schema the field belongs to
  - name - name of the column/field
  - data_type - DaskTypeMap instance containing information about the SQL type and underlying Arrow DataType
  - index - location of the field in the LogicalPlan
DaskTypeMap
- Maps a conventional SQL type to an underlying Arrow DataType

Rust Learning Resources

"The Book"
Lets Get Rusty "LGR" YouTube series

Documentation TODO

SQL Parsing overview diagram
Architecture diagram
Setup dev environment
Version of Rust and specs
Updating version of datafusion
Building
Rust learning resources
Rust Datastructures local to Dask-SQL
Build DataFusion documentation locally
Python & Rust with PyO3
Types mapping, Arrow datatypes
RexTypes explaination, show simple query and show it broken down into its parts in a diagram
Registering tables with DaskSqlContext, also functions
Creating your own optimizer
Simple diagram of PyExpr, showing something like 2+2 but broken down into a tree looking diagram

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CONTRIBUTING.md

CONTRIBUTING.md

Contributing to Dask-SQL

Environment Setup

Rust Developers Guide

Apache Arrow DataFusion

Purpose

Building

Building with Python

DataFusion Modules

Retrieving Upstream Dependencies

Local Documentation

Datastructures

Rust Learning Resources

Documentation TODO

Files

CONTRIBUTING.md

Latest commit

History

CONTRIBUTING.md

File metadata and controls

Contributing to Dask-SQL

Environment Setup

Rust Developers Guide

Apache Arrow DataFusion

Purpose

Building

Building with Python

DataFusion Modules

Retrieving Upstream Dependencies

Local Documentation

Datastructures

Rust Learning Resources

Documentation TODO