-
Notifications
You must be signed in to change notification settings - Fork 3k
Variable precision datetime types
- Support variable precision for the following datetime types
TIMESTAMP(p) WITH TIME ZONE
TIMESTAMP(p) WITHOUT TIME ZONE
TIME(p) WITH TIME ZONE
TIME(p) WITHOUT TIME ZONE
- Maintain backward compatibility for the existing encoding to minimize disruption and changes to existing connectors
- Maximize performance for common values of
p
(e.g., the default ofp = 6
in the SQL spec, common values ofp
in connectors) - Eventually, bring Presto default in line with SQL standard (
p = 6
for TIMESTAMP types andp = 0
for TIME types)
Current default precision in Presto is 3
The SQL specification defines
- For TIMESTAMP(p) (both variants), default precision is 6
- For TIME(p) (both variants), default precision is 0
The range of values for TIMESTAMP defined by the SQL spec is 0001-01-01 to 9999-12-31 => ~10000 * 365 days (+/- leap days)
TODO: survey of common values of p
in other DBs and connectors
- Usages of
p
other than 0, 3, 6, 9 are unlikely in practice.
- What should be the maximum precision for each type?
- Should the maximum be the same for all types?
p | bits |
---|---|
0 | 39 |
1 | 42 |
2 | 45 |
3 | 49 |
4 | 52 |
5 | 55 |
6 | 59 |
7 | 62 |
8 | 65 |
9 | 69 |
10 | 72 |
11 | 75 |
12 | 79 |
Covering the whole range up to p = 12 requires 79 bits. For efficient handling of commonly used values of p (e.g., the default p = 3 in Presto or the default p = 6 in the SQL spec), a dual representation is beneficial, similar to the short vs long decimal treatment.
- Short encoding:
p <= 3
- Single 64-bit long.
- Stores units of 10^-3 seconds. Compatible with the current representation when p = 3.
- Medium encoding
3 < p <= 6
- Single 64-bit long
- Stores units of 10^-6 seconds. Theoretically, it can store units of 10^-7 seconds, but for cognitive simplicity, pick a common number (p = 6 = microseconds). Usages
of
p = 7
are unlikely, so no need to optimize for that case.
- Long encoding
p > 6
- 64-bit long + 16/32-bits (depending on the maximum precision we want to support)
- First long stores units of 10^-6 seconds. Second value stores the fractional part beyond that.
- Alternative: 10 or 12 byte binary value.
- Option 1: maintain the split of units of 10^-6 seconds vs fractional part. Easier to extract each part efficiently for manipulation
- Option 2: encode units of 10^-(max p) seconds. Slightly more space efficient, but requires doing decimal division to extract parts for processing.
Size of TIMESTAMP(p) WITHOUT TIME ZONE + 12 bits for TZ id.
p | bits |
---|---|
0 | 51 |
1 | 54 |
2 | 57 |
3 | 61 |
4 | 64 |
5 | 67 |
6 | 71 |
7 | 74 |
8 | 77 |
9 | 81 |
10 | 84 |
11 | 87 |
12 | 91 |
TODO
TODO
- Convert built-in datetime functions
- CURRENT_TIMESTAMP(p)
- LOCALTIMESTAMP(p)
- CURRENT_TIME(p)
- LOCALTIME(p)
- Convert types to parametric types
- TimestampType
- TimestampWithTimeZoneType
- TimeType
- TimeWithTimeZoneType
- How to introduce parametric types in a backward-compatible way?
- Add casts from T(p1) to T(p2)
- Truncation/rounding
- Store assignment rules (see 9.2.3).xvi).2).b) in SQL:2016)
- Update casts from T1(p1) to T2(p2)
- Support for
p != 3
in connectors - Usages of
x instanceof TimestampType
- Usages of
x.equals(TimestampType.TIMESTAMP)
- Usages of
x == TimestampType.TIMESTAMP
: https://github.com/prestosql/presto/pull/3759 - ConnectorSession.getStartTime() currently returns a long with number of milliseconds since epoch: https://github.com/prestosql/presto/pull/3761
- SqlTimestamp is based on milliseconds since epoch.
-
Currently, the inferred type of a literal like
TIMESTAMP '2020-05-20 12:34:56
istimestamp(3)
. Per the SQL spec, it should betimestamp(0)
. This will cause problems for connectors that don’t support variable-precision timestamps (i.e., anything other thantimestamp(3)
) for queries such asCREATE TABLE t(x) AS (VALUES TIMESTAMP '2020-05-20 12:34:56')
-
The JDBC driver expects all timestamps to be formatted as
yyyy-mm-dd hh:mm:ss.SSS
(3 mandatory digits for the fractional second part). This will cause older versions of the driver to fail whenever a query produces a column of type other thantimestamp(3)
. In particular, similar to the case above, a query such asVALUES TIMESTAMP '2020-05-20 12:34:56'
will fail for older drivers.
- Scale-aware representation: store the number of 10^-p seconds. This form makes the representation compatible with the current representation when p = 3, but:
- requires rescaling on cast between different
p
- requires knowing
p
to interpret values - harder to debug
- requires rescaling on cast between different