Note: This fork is adapted for compilation using recent versions of Clang and GCC, which we ensure with a GitHub action.
# @(#)README 2.4.0-JCCH
========================== JCC-H specific info =========================
JCC-H is a version of TPC-H rife with Join Crossing Correlations (JCC).
You get into situations where 5 keys join with 20% of the fact tables,
which is skew that MPP systems tend not to like. This skew, only appears
when certain selections are made on the dimension table (often
indirectly, the selection may also be on another dimension table a few
joins away). This means current optimizers will typically not be able
to tell that this will happen.
This was published in TPCTC 2017, the benchmarking workshop at VLDB.
Paper here: https://ir.cwi.nl/pub/27429/27429.pdf
This software is a 100% compatible TPC-H benchmark. It generates very
intricate correlations and skew in the datagen when you add the flag.
usage: dbgen -k [any other options, see below]
generates the correlated+skewed TPC-H dataset
For the query generation (qgen), it supports two modes of operation:
get "normal" parameters where JCC-H behaves without skew, very similar
to normal TPC-H. And a "skewed" parameter set: identical 22 TPC-H queries,
just different constants, but now the join and scan skew is everywhere.
usage: qgen -k [any other options, see below]
generates the 'skewed' TPC-H parameter bindings
omitting generates the 'normal' TPC-H parameter bindings
(normal != default, with default being what TPC-H qgen generates,
though it is supposed to be similar. Please use normal qgen to get the
default parameters).
========================= TCC-H normal readme =========================
Table of Contents
===================
0. What is this document?
1. What is DBGEN?
2. What will DBGEN create?
3. How is DBGEN built?
4. Command Line Options for DBGEN
5. Building Large Data Sets with DBGEN
6. DBGEN limitations and compliant usage
7. Sample DBGEN executions
8. What is QGEN?
9. What will QGEN create?
10. How is QGEN built?
11. Command Line Options for QGEN
12. Query Template Syntax
13. Sample QGEN executions and Query Templates
14. Environment variable
15. Version Numbering in DBGEN and QGEN
16. Validated Platforms
0. What is this document?
This is the general README file for DBGEN and QGEN, the data-
base population and executable query text generation programs
used in the TPC-H benchmark. It covers the proper use
of DBGEN and QGEN. For information on porting the utility to your
particular platform see Porting.Notes.
1. What is DBGEN?
DBGEN is a database population program for use with the TPC-H benchmark.
It is written in ANSI 'C' for portability, and has
been successfully ported to over a dozen different systems. While the
TPC-H specification allow an implementor to use any utility
to populate the benchmark database, the resultant population must exactly
match the output of DBGEN. The source code has been provided to make the
process of building a compliant database population as simple as possible.
2. What will DBGEN create?
Without any command line options, DBGEN will generate 8 separate ascii
files. Each file will contain pipe-delimited load data for one of the
tables defined in the TPC-H database schema. The default tables
will contain the load data required for a scale factor 1 database. By
default the file will be created in the current directory and be
named <table>.tbl. As an example, customer.tbl will contain the
load data for the customer table.
When invoked with the '-U' flag, DBGEN will create the data sets to be
used in the update functions and the SQL syntax required to delete the
data sets. The update files will be created in the same directory as
the load data files and will be named "u_<table>.set". The delete
syntax will be written to "delete.set". For instance, the data set to
be used in the third query set to update the lineitem table will be
named "u_lineitem.tbl.3", and the SQL to remove those rows will be
found in "delete.3". The size of the update files can be controlled
with the '-r' flag.
3. How is DBGEN built?
Create an appropriate makefile, using makefile.suite as a basis,
and type make. Refer to Porting.Notes for more details and for
suggested compile time options.
4. Command Line Options for DBGEN
DBGEN's output is controlled by a combination of command line options
and environment variables. Command line options are assumed to be single
letter flags preceded by a minus sign. They may be followed by an
optional argument.
option argument default action
------ -------- ------- ------
-k generate correlated and skewed data
-h Display a usage summary
-f none Force. Existing data files will be
overwritten.
-F none yes Flat file output.
-D none Direct database load. ld_XXXX() routines
must be defined in load_stub.c
-s <scale> 1 Scale of the database population. Scale
1.0 represents ~1 GB of data
-T <table> Generate the data for a particular table
ONLY. Arguments: p -- part/partuspp,
c -- customer, s -- supplier,
o -- orders/lineitem, n -- nation, r -- region,
l -- code (same as n and r),
O -- orders, L -- lineitem, P -- part,
S -- partsupp
-O d Generate SQL for delete function
instead of key ranges
-O f Allow over-ride of default output file
names
-O h Generate headers in flat ascii files.
hd_XXX routines must be defined in
load_stub.c
-O m Flat files generate fixed length records
-O r Generate key ranges for the UF2 update
function
-O v Verify data set without generating it.
-r <percentage> 10 Scale each udpate file to the given
percentage (expressed in basis points)
of the data set
-v none Verbose. Progress messages are
displayed as data is generated.
-n <name> Use database <name> for in-line load
-C <children> Use <children> separate processes to
generate data
-S <n> Generate the <n>th part of a multi-part load
or update set
-U <updates> Create a specified number of data sets
in flat files for the update/delete
functions
-i <n> Split the inserted rows in an refresh pair
between <n> files
-d <n> Split the deleted rows in an refresh pair
between <n> files
5. DBGEN limitations and compliant usage
DBGEN is meant to be a robust population generator for use with the
TPC-H benchmark. It is hoped that DBGEN will make it easier
to experiment with and become proficient in the execution of TPC decision
support benchmarks. As a result, it includes a number of command line
options which are not, strictly speaking, necessary to generate a compliant
data set for a TPC-D run. In addition, some command line options will accept
arguments which result in the generation of NON-COMPLIANT data sets. Options
which should be used with care include:
-s -- scale factor. TPC-H runs are only compliant when run against SF's
of 1, 10, 100, 300, 1000, 3000, 10000, 30000, 100000
-r -- refresh percentage. TPC-H runs are only compliant when run with
-r 10, the default.
6. Sample DBGEN executions
DBGEN has been built to allow as much flexibility as possible, but is
fundementally intended to generate two things: a database population
against which the queries in TPC-H can be run, and the updates
that are used during the update functions in TPC-H. Here are
some sample uses of DBGEN.
1. To generate the database population for the qualification database
dbgen -s 1
2. To generate the lineitem table only, for a scale factor 10 database,
and over-write any existing flat files:
dbgen -s 10 -f -T L
4. To geterate a 100GB data set in 1GB pieces, generate only the part and
partsupplier tables, and include some progress reports along the way:
dbgen -s 100 -S 1 -C 100 -T p -v (to generate the first 1GB file)
dbgen -s 100 -S 2 -C 100 -T p -v (to generate the second 1GB file)
(and so on, incrementing the argument to -S each time)
5. To generate the update files needed for a 4 stream run of the throughput
test at 100 GB, using an existing set of seed files from an 8 process
load:
dbgen -s 100 -U 4 -C 8
7. What is QGEN?
QGEN is a query generation program for use with the TPC-H benchmark.
It is written in ANSI 'C' for portability, and has been successfully
ported to over a dozen different systems. While the benchmark specifications
allow an implementor to use any utility to create the benchmark query
sets, QGEN has been provided to make the process of building
a benchmark implementation as simple as possible.
8. What will QGEN create?
QGEN is a filter, triggered by :'s. It does line-at-a-time reads of its
input (more on that later), scanning for :foo, where foo determines the
substitution that occurs. Including:
:<int> replace with the appropriate value for parameter <int>
:b replace with START_TRAN (from tpcd.h)
:c replace with SET_DBASE (from tpcd.h)
:n<int> replace with SET_ROWCOUNT(<int>) (from tpcd.h)
:o replace with SET_OUTPUT (from tpcd.h)
:q replace with query number
:s replace with stream number
:x replace with GEN_QUERY_PLAN (from tpcd.h)
Qgen takes an assortment of command line options, controlling which of these
options should be active during the translation from template to EQT, and a
list of query "names". It then translates the template found in
$DSS_QUERY/<name>.sql and puts the result of stdout.
Here is a sample query template:
{ Sccsid: @(#)1.sql 9.1.1.1 1/25/95 10:51:56 }
:n 0
:o
select
l_returnflag,
l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count(*) as count_order
from lineitem
where l_shipdate <= date '1998-12-01' - interval :1 day
group by l_returnflag, l_linestatus
order by l_returnflag, l_linestatus;
And here is what is generated:
$ qgen -d 1
{return 0 rows}
select
l_returnflag,
l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count(*) as count_order
from lineitem
where l_shipdate <= date('1998-12-01') - interval (90) day to day
group by l_returnflag, l_linestatus
order by l_returnflag, l_linestatus;
See "Query Template Syntax" below for more detail on converting your prefered query
phrasing for use with QGEN.
9. How is QGEN built?
QGEN is built by the same makefile that creates DBGEN. If the makefile
is successfully creating DBGEN, no further compilation modifications
should be necessary. You may need to modify some of the options which
allow QGEN to integrate with your preferred query tool. Refer to
Porting.Notes for more detail.
10. Command Line Options for QGEN
Like DBGEN, QGEN is controlled by a combination of command line options
and environment variables (See "Environment Variables", below for more
detail). Command line options are assumed to be single
letter flags preceded by a minus sign. They may be followed by an
optional argument.
option argument default action
------ -------- ------- ------
-k generate 'skewed' parameters that bring about all the
problems with correlations and skew. Omitting this will
generate (non-default) parameters that mostly avoid
these problems and behave rather similar to default TPC-H
-c none Retain comments in translation of template to
EQT
-d none Default. Use the parameter substitutions
required for query validation
-h Display a usage summary
-i <file> Use contents of <file> to init a query stream
-l <file> Save query parameters to <file>
-n <name> Use database <name> for queries
-N Always use default rowcount, and ignore :n directives
-o <path> Save query n's output in <path>/n.<stream>
Uses -p option, and uses :o tag
-p <stream> Use the query permutation defined for
stream <stream>. If this option is
omited, EQT will be generated for the
queries named on the command line.
-r <n> Seed the rnadom number generator with <n>
-s <n> Set scale to <n> for parameter
substitutions.
-t <file> Use contents of <file> to complete a query
stream
-T none Use time table format for date substitution
-v none Verbose. Progress messages are
displayed as data is generated.
-x none Generate a query plan as part of query
execution.
11. Query Template Syntax
QGEN is a simple ASCII text filter, meant to translate query generalized
query syntax("query template") into the executable query text(EQT) re-
quired by the benchmarks. It provides a number of shorthands and syntactic
extensions that allow the automatic generation of query parameters and some
control over the operation of the benchmark implementation.
QGEN first strips all comments from the query template, recognizing both
{comment} and --comment styles. Next it traverses the query template
one line at a time, locating required substitution points, called
parameter tags. The values substituted for a given tag are summarized
below. QGEN does not support nested substitutions. That is, if
the text substituted for tag itself contains a valid tag the second tag
will not be expanded.
Tag Converted To Based on
=== ============ ========
:c database <dbname>;(1) -n from the command line
:x set explain on;(1) -x from the command line
:<number> paremeter <number>
:s stream number
:o output to outpath/qnum.stream;(1)
-o from command line, -s from
command line
:b BEGIN WORK;(1) -a from comand line
:e COMMIT WORK(1) -a from command line
:q query number
:n <number> sets rowcount to be returned
to <number>, unless -N appears on the command line
Notes:
(1) This is Informix-specific syntax. Refer to Porting.Notes for
tailoring the generated text to your database environment.
12. Sample QGEN executions and Query Templates
QGEN translates generic query templates into valid SQL. In addition, it
allows conditional inclusion of the commands necessary to connect to a
database, produce diagnostic output, etc. Here are some sample of QGEN
usage, and the way that command line parameters and the query templates
interact to produce valid SQL.
Template, in $DSS_QUERY/1.sql:
:c
:o
select count(*) from foo;
:x
select count(*) from lineitem
where l_orderdate < ':1';
1. "qgen 1", would produce:
select count(*) from foo;
select count(*) from lineitem
where l_orderdate < '1997-01-01';
Assuming that 1 January 1997 was a valid substitution for parameter 1.
2. "qgen -d -c dss1 1, would produce:
database dss1;
select count(*) from foo;
select count(*) from lineitem
where l_orderdate < '1995-07-18';
Assuming that 18 July 1995 was the default substitution for parameter 1,
and using Informix syntax.
3. "qgen -d -c dss1 -x -o somepath 1, would produce:
database dss1;
output to "somepath/1.0"
select count(*) from foo;
set explain on;
select count(*) from lineitem
where l_orderdate < '1995-07-18';
Assuming that 18 July 1995 was the default substitution for parameter 1,
and using Informix syntax.
13. Environment Variables
Enviroment variables are used to control features of DBGEN and QGEN
which are unlikely to change from one execution to another.
Variable Default Action
------- ------- ------
DSS_PATH . Directory in which to build flat files
DSS_CONFIG . Directory in which to find configuration files
DSS_DIST dists.dss Name of distribution definition file
DSS_QUERY . Directory in which to find query templates
14. Version Numbering in DBGEN and QGEN
DBGEN and QGEN use a common version numbering algorithm. Each executable
is stamped with a version number which is displayed in the usage messages
available with the '-h' option. A version number is of the form:
V.R.P.M
| | | |
| | | |
| | | |
| | | -- modification: alphabetic, incremented for any trivial changes
| | | to the source (e.g, porting ifdef's)
| | ---- patch level: numeric, incremented for any minor bug fix
| | (e.g, qgen parameter range)
| ------- release: numeric, incremented for each minor revision of the
| specification
|-------- version: numeric, incremented for each major revision of the
specification
An implementation of TPC-H is valid only if it conforms to the
following version usage rules:
-- The Version of DBGEN and QGEN must match the integer portion of the
current specification revision
15. The current revisions are:
DBGEN: 2.4.0
QGEN: 2.4.0
16. Validated Platforms
The following platforms have been validated to produce the reference
data set for TPC-H 2.4.0
Processor Operating System (version) Compiler (version) Compiler Flags
----------------------------------------------------------------------------
POWER5 AIX 64-bit (5.3) C for AIX Compiler, v7 -q64 (no -g)
IA-64 HPUX 64-bit () icc
Linux 32-bit () gcc