The environment used for development and CI consists of:
- a system installation of
rustup
with:- the latest stable toolchain
- the latest nightly
rustfmt
- a conda environment containing all required Python packages
Once rustup
is installed, ensure that the latest stable toolchain and nightly rustfmt
are available by running
rustup toolchain install nightly -c rustfmt --profile minimal
rustup update
To initialize and activate the conda environment for a given Python version:
conda env create -f dask-sql/continuous_integration/environment-{$PYTHON_VER}.yaml
conda activate dask-sql
Dask-SQL utilizes Apache Arrow Datafusion for parsing, planning, and optimizing SQL queries. DataFusion is written in Rust and therefore requires some Rust experience to be productive. Luckily, there are tons of great Rust learning resources on the internet. We have listed some of our favorite ones here
The Dask-SQL Rust codebase makes heavy use Apache Arrow DataFusion. Contributors should familiarize themselves with the codebase and documentation.
DataFusion provides Dask-SQL with key functionality.
- Parsing SQL query strings into a
LogicalPlan
datastructure - Future integration points with substrait.io
- An optimization framework used as the baseline for creating custom highly efficient
LogicalPlan
s specific to Dask.
Building the Dask-SQL Rust codebase is a straightforward process. If you create and activate the Dask-SQL Conda environment the Rust compiler and all necessary components will be installed for you during that process and therefore requires no further manual setup.
maturin
is used by Dask-SQL for building and bundling the resulting Rust binaries. This helps make building and installing the Rust binaries feel much more like a native Python workflow.
More details about the building setup can be found in pyproject.toml and Cargo.toml
Note that while maturin
is used by CI and should be used during your development cycle, if the need arises to do something more specific that is not yet supported by maturin
you can opt to use cargo
directly from the command line.
Building Dask-SQL is straightforward with Python. To build run pip install .
. This will build both the Rust and Python codebase and install it into your locally activated conda environment; note that if your Rust dependencies have been updated, this command must be rerun to rebuild the Rust codebase.
DataFusion is broken down into a few modules. We consume those modules in our Cargo.toml. The modules that we use currently are
datafusion-common
- Datastructures and core logicdatafusion-expr
- Expression based logic and operatorsdatafusion-sql
- SQL components such as parsing and planningdatafusion-optimizer
- Optimization logic and datastructures for modifying current plans into more efficient ones.
During development you might find yourself needing some upstream DataFusion changes not present in the projects current version. Luckily this can easily be achieved by updating Cargo.toml and changing the rev
to the SHA of the version you need. Note that the same SHA should be used for all DataFusion modules.
Sometimes when building against the latest Github commits for DataFusion you may find that the features you are consuming do not have their documentation public yet. In this case it can be helpful to build the DataFusion documentation locally so that it can be referenced to assist with development. Here is a rough outline for building that documentation locally.
- clone https://github.com/apache/arrow-datafusion
- change into the
arrow-datafusion
directory - run
cargo doc
- navigate to
target/doc/datafusion/all.html
and open in your desired browser
While working in the Rust codebase there are a few datastructures that you should make yourself familiar with. This section does not aim to verbosely list out all of the datastructure with in the project but rather just the key datastructures that you are likely to encounter while working on almost any feature/issue. The aim is to give you a better overview of the codebase without having to manually dig through the all the source code.
PyLogicalPlan
-> DataFusion LogicalPlan- Often encountered in Python code with variable name
rel
- Python serializable umbrella representation of the entire LogicalPlan that was generated by DataFusion
- Provides access to
DaskTable
instances and type information for each table - Access to individual nodes in the logical plan tree. Ex:
TableScan
- Often encountered in Python code with variable name
DaskSQLContext
- Analogous to Python
Context
- Contains metadata about the tables, schemas, functions, operators, and configurations that are persent within the current execution context
- When adding custom functions/UDFs this is the location that you would register them
- Entry point for parsing SQL strings to sql node trees. This is the location Python will begin its interactions with Rust
- Analogous to Python
PyExpr
-> DataFusion Expr- Arguably where most of your time will be spent
- Represents a single node in sql tree. Ex:
avg(age)
fromSELECT avg(age) FROM people
- Is associate with a single
RexType
- Can contain literal values or represent function calls,
avg()
for example - The expressions "index" in the tree can be retrieved by calling
PyExpr.index()
on an instance. This is useful when mapping frontend column names in Dask code to backend Dataframe columns - Certain
PyExpr
s contain operands. Ex:2 + 2
would contain 3 operands. 1) A literalPyExpr
instance with value 2 2) Another literalPyExpr
instance with a value of 2. 3) A+
PyExpr
representing the addition of the 2 literals.
DaskSqlOptimizer
- Registering location for all Dask-SQL specific logical plan optimizations
- Optimizations that are written either custom or use from another source, DataFusion, are registered here in the order they are wished to be executed
- Represents functions that modify/convert an original
PyLogicalPlan
into anotherPyLogicalPlan
that would be more efficient when running in the underlying Dask framework
RelDataType
- Not a fan of this name, was chosen to match existing Calcite logic
- Represents a "row" in a table
- Contains a list of "columns" that are present in that row
- RelDataTypeField
- Represents an individual column in a table
- Contains:
qualifier
- schema the field belongs toname
- name of the column/fielddata_type
-DaskTypeMap
instance containing information about the SQL type and underlying Arrow DataTypeindex
- location of the field in the LogicalPlan
- DaskTypeMap
- Maps a conventional SQL type to an underlying Arrow DataType
- SQL Parsing overview diagram
- Architecture diagram
- Setup dev environment
- Version of Rust and specs
- Updating version of datafusion
- Building
- Rust learning resources
- Rust Datastructures local to Dask-SQL
- Build DataFusion documentation locally
- Python & Rust with PyO3
- Types mapping, Arrow datatypes
- RexTypes explaination, show simple query and show it broken down into its parts in a diagram
- Registering tables with DaskSqlContext, also functions
- Creating your own optimizer
- Simple diagram of PyExpr, showing something like 2+2 but broken down into a tree looking diagram