Name		Name	Last commit message	Last commit date
parent directory ..
gen_data		gen_data
run_tpch		run_tpch
tpch.queries.updated		tpch.queries.updated
README.md		README.md

README.md

Test on Velox backend with TPC-H workload

Test datasets

Parquet and DWRF(a fork of the ORC file format) format files are both supported. Here are the steps to generate the testing datasets:

Generate the Parquet dataset

Please refer to the scripts in ./gen_data/parquet_dataset/ directory to generate parquet dataset. Note this script relies on the spark-sql-perf and tpch-dbgen package from Databricks. Note in the tpch-dbgen kits, we need to do a slight modification to allow Spark to convert the csv based content to parquet, please make sure to use this commit: 0469309147b42abac8857fa61b4cf69a6d3128a8

In tpch_datagen_parquet.sh, several parameters should be configured according to the system.

spark_sql_perf_jar=/PATH/TO/spark-sql-perf_2.12-0.5.1-SNAPSHOT.jar
...
  --num-executors 14 
  --executor-cores 8 
  --conf spark.sql.shuffle.partitions=224 
...

In tpch_datagen_parquet.scala, the parameters and dirs should be configured as well.

val scaleFactor = "100" // scaleFactor defines the size of the dataset to generate (in GB).
val numPartitions = 200  // how many dsdgen partitions to run - number of input tasks.
...
val rootDir = "/PATH/TO/TPCH_PARQUET_PATH" // root directory of location to create data in.
val dbgenDir = "/PATH/TO/TPCH_DBGEN" // location of dbgen

Currently, Gluten with Velox can support both Parquet and DWRF file format and three compression codec including snappy, gzip, zstd. Below step, to convert Parquet to DWRF, is optional if you are using Parquet format to run the testing.

Convert the Parquet dataset to DWRF dataset(OPTIONAL)

And then please refer to the scripts in ./gen_data/dwrf_dataset/ directory to convert the Parquet dataset to DWRF dataset.

In tpch_convert_parquet_dwrf.sh, spark configures should be set according to the system.

export GLUTEN_HOME=/PATH/TO/gluten
...
--executor-cores 8                                      \
--num-executors 14                                       \

In tpch_convert_parquet_dwrf.scala, the table path should be configured.

val parquet_file_path = "/PATH/TO/TPCH_PARQUET_PATH"
val dwrf_file_path = "/PATH/TO/TPCH_DWRF_PATH"

Test Queries

We provided the test queries in ./tpch.queries.updated directory, which changed all DATE fields to STRING since DATE type is not well supported in Gluten with Velox backend. We also provided a scala code in ./run_tpch/ directory about how to run TPC-H queries. Please note if you are using dwrf test, please remember to set the file format to dwrf in the code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tpch

tpch

README.md

Test on Velox backend with TPC-H workload

Test datasets

Generate the Parquet dataset

Convert the Parquet dataset to DWRF dataset(OPTIONAL)

Test Queries

Files

tpch

Directory actions

More options

Directory actions

More options

Latest commit

History

tpch

Folders and files

parent directory

README.md

Test on Velox backend with TPC-H workload

Test datasets

Generate the Parquet dataset

Convert the Parquet dataset to DWRF dataset(OPTIONAL)

Test Queries