Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions PROTOCOL.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,8 @@
- [Table Features for New and Existing Tables](#table-features-for-new-and-existing-tables)
- [Supported Features](#supported-features)
- [Active Features](#active-features)
- [Table Properties](#table-properties)
- [Statistics Collection Properties](#statistics-collection-properties)
- [Column Mapping](#column-mapping)
- [Writer Requirements for Column Mapping](#writer-requirements-for-column-mapping)
- [Reader Requirements for Column Mapping](#reader-requirements-for-column-mapping)
Expand Down Expand Up @@ -886,6 +888,22 @@ A feature being supported does not imply that it is active. For example, a table
## Active Features
A feature is active on a table when it is supported *and* its metadata requirements are satisfied. Each feature defines its own metadata requirements, as stated in the corresponding sections of this document. For example, the Append-only feature is active when the `appendOnly` feature name is present in a `protocol`'s `writerFeatures` *and* a table property `delta.appendOnly` set to `true`.

# Table Properties

Delta tables support configuration through table properties stored in the `configuration` field of the [metadata](#change-metadata) action. The following table properties are recognized by the Delta protocol:

## Statistics Collection Properties

The following table properties control which columns have per-column statistics collected in [Per-file Statistics](#per-file-statistics):

Property | Description
-|-
`delta.dataSkippingStatsColumns` | A comma-separated list of column names for which to collect per-column statistics. Column names may refer to nested struct fields using dot notation (e.g., `a.b.c`), in which case statistics are collected for all leaf fields within that struct. When this property is set, it takes precedence over `delta.dataSkippingNumIndexedCols`. Partition columns cannot be specified.
`delta.dataSkippingNumIndexedCols` | The number of leading leaf columns in the table schema for which to collect per-column statistics. Defaults to 32. This property is ignored if `delta.dataSkippingStatsColumns` is set. A negative value indicates that statistics should be collected for all columns.

When neither property is set, statistics are collected for the first 32 leaf columns in the table schema (excluding partition columns).
When [Column Mapping](#column-mapping) is enabled, per-column statistics are keyed by physical column names.

# Column Mapping
Delta can use column mapping to avoid any column naming restrictions, and to support the renaming and dropping of columns without having to rewrite all the data. There are two modes of column mapping, by `name` and by `id`. In both modes, every column - nested or leaf - is assigned a unique _physical_ name, and a unique 32-bit integer as an id. The physical name is stored as part of the column metadata with the key `delta.columnMapping.physicalName`. The column id is stored within the metadata with the key `delta.columnMapping.id`.

Expand Down Expand Up @@ -2084,6 +2102,7 @@ Bytes | Name | Description
## Per-file Statistics
`add` and `remove` actions can optionally contain statistics about the data in the file being added or removed from the table.
These statistics can be used for eliminating files based on query predicates or as inputs to query optimization.
See [Statistics Collection Properties](#statistics-collection-properties) for table properties that control which columns have statistics collected.

Global statistics record information about the entire file.
The following global statistic is currently supported:
Expand Down
Loading