diff --git a/versioned_docs/version-v1.15.0/README.md b/versioned_docs/version-v1.15.0/README.md new file mode 100644 index 00000000..765b6c15 --- /dev/null +++ b/versioned_docs/version-v1.15.0/README.md @@ -0,0 +1,108 @@ +--- +sidebar_position: 1 +sidebar_label: Introduction +--- + +# The Zed Project + +Zed offers a new approach to data that makes it easier to manipulate and manage +your data. + +With Zed's new [super-structured data model](formats/README.md#2-zed-a-super-structured-pattern), +messy JSON data can easily be given the fully-typed precision of relational tables +without giving up JSON's uncanny ability to represent eclectic data. + +## Getting Started + +Trying out Zed is easy: just [install](install.md) the command-line tool +[`zq`](commands/zq.md) and run through the [zq tutorial](tutorials/zq.md). + +`zq` is a lot like [`jq`](https://stedolan.github.io/jq/) +but is built from the ground up as a search and analytics engine based +on the [Zed data model](formats/zed.md). Since Zed data is a +proper superset of JSON, `zq` also works natively with JSON. + +While `zq` and the Zed data formats are production quality, the Zed project's +[Zed data lake](commands/zed.md) is a bit [earlier in development](commands/zed.md#status). + +For a non-technical user, Zed is as easy to use as web search +while for a technical user, Zed exposes its technical underpinnings +in a gradual slope, providing as much detail as desired, +packaged up in the easy-to-understand +[ZSON data format](formats/zson.md) and +[Zed language](language/README.md). + +## Terminology + +"Zed" is an umbrella term that describes +a number of different elements of the system: +* The [Zed data model](formats/zed.md) is the abstract definition of the data types and semantics +that underlie the Zed formats. +* The [Zed formats](formats/README.md) are a family of +[sequential (ZNG)](formats/zng.md), [columnar (VNG)](formats/vng.md), +and [human-readable (ZSON)](formats/zson.md) formats that all adhere to the +same abstract Zed data model. +* A [Zed lake](commands/zed.md) is a collection of Zed data stored +across one or more [data pools](commands/zed.md#data-pools) with ACID commit semantics and +accessed via a [Git](https://git-scm.com/)-like API. +* The [Zed language](language/README.md) is the system's dataflow language for performing +queries, searches, analytics, transformations, or any of the above combined together. +* A [Zed query](language/overview.md) is a Zed script that performs +search and/or analytics. +* A [Zed shaper](language/shaping.md) is a Zed script that performs +data transformation to _shape_ +the input data into the desired set of organizing Zed data types called "shapes", +which are traditionally called _schemas_ in relational systems but are +much more flexible in the Zed system. + +## Digging Deeper + +The [Zed language documentation](language/README.md) +is the best way to learn about `zq` in depth. +All of its examples use `zq` commands run on the command line. +Run `zq -h` for a list of command options and online help. + +The [Zed lake documentation](commands/zed.md) +is the best way to learn about `zed`. +All of its examples use `zed` commands run on the command line. +Run `zed -h` or `-h` with any subcommand for a list of command options +and online help. The same language query that works for `zq` operating +on local files or streams also works for `zed query` operating on a lake. + +## Design Philosophy + +The design philosophy for Zed is based on composable building blocks +built from self-describing data structures. Everything in a Zed lake +is built from Zed data and each system component can be run and tested in isolation. + +Since Zed data is self-describing, this approach makes stream composition +very easy. Data from a Zed query can trivially be piped to a local +instance of `zq` by feeding the resulting Zed stream to stdin of `zq`, for example, +``` +zed query "from pool | ...remote query..." | zq "...local query..." - +``` +There is no need to configure the Zed entities with schema information +like [protobuf configs](https://developers.google.com/protocol-buffers/docs/proto3) +or connections to +[schema registries](https://docs.confluent.io/platform/current/schema-registry/index.html). + +A Zed lake is completely self-contained, requiring no auxiliary databases +(like the [Hive metastore](https://cwiki.apache.org/confluence/display/hive/design)) +or other third-party services to interpret the lake data. +Once copied, a new service can be instantiated by pointing a `zed serve` +at the copy of the lake. + +Functionality like [data compaction](commands/zed.md#manage) and retention are all API-driven. + +Bite-sized components are unified by the Zed data, usually in the ZNG format: +* All lake meta-data is available via meta-queries. +* All like operations available through the service API are also available +directly via the `zed` command. +* Lake management is agent-driven through the API. For example, instead of complex policies +like data compaction being implemented in the core with some fixed set of +algorithms and policies, an agent can simply hit the API to obtain the meta-data +of the objects in the lake, analyze the objects (e.g., looking for too much +key space overlap) and issue API commands to merge overlapping objects +and delete the old fragmented objects, all with the transactional consistency +of the commit log. +* Components are easily tested and debugged in isolation. diff --git a/versioned_docs/version-v1.15.0/commands/README.md b/versioned_docs/version-v1.15.0/commands/README.md new file mode 100644 index 00000000..4c6b608e --- /dev/null +++ b/versioned_docs/version-v1.15.0/commands/README.md @@ -0,0 +1,18 @@ +# Command Tooling + +The Zed system is managed and queried with the [`zed` command](zed.md), +which is organized into numerous subcommands like the familiar command patterns +of `docker` or `kubectrl`. +Built-in help for the `zed` command and all of its subcommands is always +accessible with the `-h` flag. + +The [`zq` command](zq.md) offers a convenient slice of `zed` for running +stand-alone, command-line queries on inputs from files, HTTP URLs, or [S3](../integrations/amazon-s3.md). +`zq` is like [`jq`](https://stedolan.github.io/jq/) but is easier and faster, utilizes the richer +Zed data model, and interoperates with a number of other formats beyond JSON. +If you don't need a Zed lake, you can install just the +slimmer `zq` command which omits lake support and dev tools. + +`zq` is always installed alongside `zed`. You might find yourself mixing and +matching `zed` lake queries with `zq` local queries and stitching them +all together with Unix pipelines. diff --git a/versioned_docs/version-v1.15.0/commands/_category_.yaml b/versioned_docs/version-v1.15.0/commands/_category_.yaml new file mode 100644 index 00000000..e5c770ba --- /dev/null +++ b/versioned_docs/version-v1.15.0/commands/_category_.yaml @@ -0,0 +1,2 @@ +position: 3 +label: Commands diff --git a/versioned_docs/version-v1.15.0/commands/zed.md b/versioned_docs/version-v1.15.0/commands/zed.md new file mode 100644 index 00000000..0a991c22 --- /dev/null +++ b/versioned_docs/version-v1.15.0/commands/zed.md @@ -0,0 +1,799 @@ +--- +sidebar_position: 2 +sidebar_label: zed +--- + +# zed + +> **TL;DR** `zed` is a command-line tool to manage and query Zed data lakes. +> You can import data from a variety of formats and `zed` will automatically +> commit the data in the Zed data model's [super-structured](../formats/README.md) +> format, providing full fidelity of the original format and the ability +> to reconstruct the original data without loss of information. +> +> Zed lakes provide an easy-to-use substrate for data discovery, preparation, +> and transformation as well as serving as a queryable and searchable store +> for super-structured data both for online and archive use cases. + +

+ +:::tip Status +While [`zq`](zq.md) and the [Zed formats](../formats/README.md) +are production quality, the Zed lake is still fairly early in development +and alpha quality. +That said, Zed lakes can be utilized quite effectively at small scale, +or at larger scales when scripted automation +is deployed to manage the lake's data layout via the +[lake API](../lake/api.md). + +Enhanced scalability with self-tuning configuration is under development. +::: + +## The Lake Model + +A Zed lake is a cloud-native arrangement of data, optimized for search, +analytics, ETL, data discovery, and data preparation +at scale based on data represented in accordance +with the [Zed data model](../formats/zed.md). + +A lake is organized into a collection of data pools forming a single +administrative domain. The current implementation supports +ACID append and delete semantics at the commit level while +we have plans to support CRUD updates at the primary-key level +in the near future. + +The semantics of a Zed lake loosely follows the nomenclature and +design patterns of [`git`](https://git-scm.com/). In this approach, +* a _lake_ is like a GitHub organization, +* a _pool_ is like a `git` repository, +* a _branch_ of a _pool_ is like a `git` branch, +* the _use_ command is like a `git checkout`, and +* the _load_ command is like a `git add/commit/push`. + +A core theme of the Zed lake design is _ergonomics_. Given the Git metaphor, +our goal here is that the Zed lake tooling be as easy and familiar as Git is +to a technical user. + +Since Zed lakes are built around the Zed data model, +getting different kinds of data into and out of a lake is easy. +There is no need to define schemas or tables and then fit +semi-structured data into schemas before loading data into a lake. +And because Zed supports a large family of formats and the load endpoint +automatically detects most formats, it's easy to just load data into a lake +without thinking about how to convert it into the right format. + +### CLI-First Approach + +The Zed project has taken a _CLI-first approach_ to designing and implementing +the system. Any time a new piece of functionality is added to the lake, +it is first implemented as a `zed` command. This is particularly convenient +for testing and continuous integration as well as providing intuitive, +bite-sized chunks for learning how the system works and how the different +components come together. + +While the CLI-first approach provides these benefits, +all of the functionality is also exposed through [an API](../lake/api.md) to +a Zed service. Many use cases involve an application like +[Zui](https://zui.brimdata.io/) or a +programming environment like Python/Pandas interacting +with the service API in place of direct use with the `zed` command. + +### Storage Layer + +The Zed lake storage model is designed to leverage modern cloud object stores +and separates compute from storage. + +A lake is entirely defined by a collection of cloud objects stored +at a configured object-key prefix. This prefix is called the _storage path_. +All of the meta-data describing the data pools, branches, commit history, +and so forth is stored as cloud objects inside of the lake. There is no need +to set up and manage an auxiliary metadata store. + +Data is arranged in a lake as a set of pools, which are comprised of one +or more branches, which consist of a sequence of data commit objects +that point to cloud data objects. + +Cloud objects and commits are immutable and named with globally unique IDs, +based on the [KSUIDs](https://github.com/segmentio/ksuid), and many +commands may reference various lake entities by their ID, e.g., +* _Pool ID_ - the KSUID of a pool +* _Commit object ID_ - the KSUID of a commit object +* _Data object ID_ - the KSUID of a committed data object + +Data is added and deleted from the lake only with new commits that +are implemented in a transactionally consistent fashion. Thus, each +commit object (identified by its globally-unique ID) provides a completely +consistent view of an arbitrarily large amount of committed data +at a specific point in time. + +While this commit model may sound heavyweight, excellent live ingest performance +can be achieved by micro-batching commits. + +Because the Zed lake represents all state transitions with immutable objects, +the caching of any cloud object (or byte ranges of cloud objects) +is easy and effective since a cached object is never invalid. +This design makes backup/restore, data migration, archive, and +replication easy to support and deploy. + +The cloud objects that comprise a lake, e.g., data objects, +commit history, transaction journals, partial aggregations, etc., +are stored as Zed data, i.e., either as [row-based ZNG](../formats/zng.md) +or [columnar VNG](../formats/vng.md). +This makes introspection of the lake structure straightforward as many key +lake data structures can be queried with metadata queries and presented +to a client as Zed data for further processing by downstream tooling. + +Zed's implementation also includes a storage abstraction that maps the cloud object +model onto a file system so that Zed lakes can also be deployed on standard file systems. + +### Zed Command Personalities + +The `zed` command provides a single command-line interface to Zed lakes, but +different personalities are taken on by `zed` depending on the particular +sub-command executed and the [lake location](#locating-the-lake). + +To this end, `zed` can take on one of three personalities: + +* _Direct Access_ - When the lake is a storage path (`file` or `s3` URI), +then the `zed` commands (except for `serve`) all operate directly on the +lake located at that path. +* _Client Personality_ - When the lake is an HTTP or HTTPS URL, then the +lake is presumed to be a Zed lake service endpoint and the client +commands are directed to the service managing the lake. +* _Server Personality_ - When the [`zed serve`](#serve) command is executed, then +the personality is always the server personality and the lake must be +a storage path. This command initiates a continuous server process +that serves client requests for the lake at the configured storage path. + +Note that a storage path on the file system may be specified either as +a fully qualified file URI of the form `file://` or be a standard +file system path, relative or absolute, e.g., `/lakes/test`. + +Concurrent access to any Zed lake storage, of course, preserves +data consistency. You can run multiple `zed serve` processes while also +running any `zed` lake command all pointing at the same storage endpoint +and the lake's data footprint will always remain consistent as the endpoints +all adhere to the consistency semantics of the Zed lake. + +> One caveat here: data consistency is not fully implemented yet for +> the S3 endpoint so only single-node access to S3 is available right now, +> though support for multi-node access is forthcoming. +> For a shared file system, the close-to-open cache consistency +> semantics of NFS should provide the necessary consistency guarantees needed by +> a Zed lake though this has not been tested. Multi-process, single-node +> access to a local file system has been thoroughly tested and should be +> deemed reliable, i.e., you can run a direct-access instance of `zed` alongside +> a server instance of `zed` on the same file system and data consistency will +> be maintained. + +### Locating the Lake + +At times you may want the Zed CLI tools to access the same lake storage +used by other tools such as [Zui](https://zui.brimdata.io/). To help +enable this by default while allowing for separate lake storage when desired, +`zed` checks each of the following in order to attempt to locate an existing +lake. + +1. The contents of the `-lake` option (if specified) +2. The contents of the `ZED_LAKE` environment variable (if defined) +3. A Zed lake service running locally at `http://localhost:9867` (if a socket + is listening at that port) +4. A `zed` subdirectory below a path in the + [`XDG_DATA_HOME`](https://specifications.freedesktop.org/basedir-spec/basedir-spec-latest.html) + environment variable (if defined) +5. A default file system location based on detected OS platform: + - `%LOCALAPPDATA%\zed` on Windows + - `$HOME/.local/share/zed` on Linux and macOS + +### Data Pools + +A lake is made up of _data pools_, which are like "collections" in NoSQL +document stores. Pools may have one or more branches and every pool always +has a branch called `main`. + +A pool is created with the [create command](#create) +and a branch of a pool is created with the [branch command](#branch). + +A pool name can be any valid UTF-8 string and is allocated a unique ID +when created. The pool can be referred to by its name or by its ID. +A pool may be renamed but the unique ID is always fixed. + +#### Commit Objects + +Data is added into a pool in atomic units called _commit objects_. + +Each commit object is assigned a global ID. +Similar to Git, Zed commit objects are arranged into a tree and +represent the entire commit history of the lake. + +> Technically speaking, Git can merge from multiple parents and thus +Git commits form a directed acyclic graph instead of a tree; +Zed does not currently support multiple parents in the commit object history. + +A branch is simply a named pointer to a commit object in the Zed lake +and like a pool, a branch name can be any valid UTF-8 string. +Consistent updates to a branch are made by writing a new commit object that +points to the previous tip of the branch and updating the branch to point at +the new commit object. This update may be made with a transaction constraint +(e.g., requiring that the previous branch tip is the same as the +commit object's parent); if the constraint is violated, then the transaction +is aborted. + +The _working branch_ of a pool may be selected on any command with the `-use` option +or may be persisted across commands with the [use command](#use) so that +`-use` does not have to be specified on each command-line. For interactive +workflows, the `use` command is convenient but for automated workflows +in scripts, it is good practice to explicitly specify the branch in each +command invocation with the `-use` option. + +#### Commitish + +Many `zed` commands operate with respect to a commit object. +While commit objects are always referenceable by their commit ID, it is also convenient +to refer to the commit object at the tip of a branch. + +The entity that represents either a commit ID or a branch is called a _commitish_. +A commitish is always relative to the pool and has the form: +* `@` or +* `@` + +where `` is a pool name or pool ID, `` is a commit object ID, +and `` is a branch name. + +In particular, the working branch set by the [use command](#use) is a commitish. + +A commitish may be abbreviated in several ways where the missing detail is +obtained from the working-branch commitish, e.g., +* `` - When just a pool name is given, then the commitish is assumed to be +`@main`. +* `@` or ``- When an ID is given (optionally with the `@` prefix), then the commitish is assumed to be `@` where `` is obtained from the working-branch commitish. +* `@` - When a branch name is given with the `@` prefix, then the commitish is assumed to be `@` where `` is obtained from the working-branch commitish. + +An argument to a command that takes a commit object is called a _commitish_ +since it can be expressed as a branch or as a commit ID. + +#### Pool Key + +Each data pool is organized according to its configured _pool key_, +which is the sort key for all data stored in the lake. Different data pools +can have different pool keys but all of the data in a pool must have the same +pool key. + +As pool data is often comprised of Zed records (analogous to JSON objects), +the pool key is typically a field of the stored records. +When pool data is not structured as records/objects (e.g., scalar or arrays or other +non-record types), then the pool key would typically be configured +as the [special value `this`](../language/dataflow-model.md#the-special-value-this). + +Data can be efficiently scanned if a query has a filter operating on the pool +key. For example, on a pool with pool key `ts`, the query `ts == 100` +will be optimized to scan only the data objects where the value `100` could be +present. + +> The pool key will also serve as the primary key for the forthcoming +> CRUD semantics. + +A pool also has a configured sort order, either ascending or descending +and data is organized in the pool in accordance with this order. +Data scans may be either ascending or descending, and scans that +follow the configured order are generally more efficient than +scans that run in the opposing order. + +Scans may also be range-limited but unordered. + +Any data loaded into a pool that lacks the pool key is presumed +to have a null value with regard to range scans. If large amounts +of such "keyless data" are loaded into a pool, the ability to +optimize scans over such data is impaired. + +### Time Travel + +Because commits are transactional and immutable, a query +sees its entire data scan as a fixed "snapshot" with respect to the +commit history. In fact, Zed's [from operator](../language/operators/from.md) +allows a commit object to be specified with the `@` suffix to a +pool reference, e.g., +``` +zed query 'from logs@1tRxi7zjT7oKxCBwwZ0rbaiLRxb | ...' +``` +In this way, a query can time-travel through the commit history. As long as the +underlying data has not been deleted, arbitrarily old snapshots of the Zed +lake can be easily queried. + +If a writer commits data after and while a reader is scanning, then the reader +does not see the new data since it's scanning the snapshot that existed +before these new writes occurred. + +Also, arbitrary metadata can be committed to the log as described below, +e.g., to associate derived analytics to a specific +journal commit point potentially across different data pools in +a transactionally consistent fashion. + +While time travel through commit history provides one means to explore +past snapshots of the commit history, another means is to use a timestamp. +Because the entire history of branch updates is stored in a transaction journal +and each entry contains a timestamp, branch references can be easily +navigated by time. For example, a list of branches of a pool's past +can be created by scanning the internal "pools log" and stopping at the largest +timestamp less than or equal to the desired timestamp. Then using that +historical snapshot of the pools, a branch can be located within the pool +using that pool's "branches log" in a similar fashion, then its corresponding +commit object can be used to construct the data of that branch at that +past point in time. + + > Note that time travel using timestamps is a forthcoming feature. + +## Zed Commands + +The `zed` command is structured as a primary command +consisting of a large number of interrelated sub-commands, similar to the +[`docker`](https://docs.docker.com/engine/reference/commandline/cli/) +or [`kubectl`](https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands) +commands. + +The following sections describe each of the available commands and highlight +some key options. Built-in help shows the commands and their options: + +* `zed -h` with no args displays a list of `zed` commands. +* `zed command -h`, where `command` is a sub-command, displays help +for that sub-command. +* `zed command sub-command -h` displays help for a sub-command of a +sub-command and so forth. + +### Auth +``` +zed auth login|logout|method|verify +``` +Access to a Zed lake can be secured with [Auth0 authentication](https://auth0.com/). +Please reach out to us on our [community Slack](https://www.brimdata.io/join-slack/) +if you'd like help setting this up and trying it out. + +### Branch +``` +zed branch [options] [name] +``` +The `branch` command creates a branch with the name `name` that points +to the tip of the working branch or, if the `name` argument is not provided, +lists the existing branches of the selected pool. + +For example, this branch command +``` +zed branch -use logs@main staging +``` +creates a new branch called "staging" in pool "logs", which points to +the same commit object as the "main" branch. Once created, commits +to the "staging" branch will be added to the commit history without +affecting the "main" branch and each branch can be queried independently +at any time. + +Supposing the `main` branch of `logs` was already the working branch, +then you could create the new branch called "staging" by simply saying +``` +zed branch staging +``` +Likewise, you can delete a branch with `-d`: +``` +zed branch -d staging +``` +and list the branches as follows: +``` +zed branch +``` + +### Create +``` +zed create [-orderby key[,key...][:asc|:desc]] +``` +The `create` command creates a new data pool with the given name, +which may be any valid UTF-8 string. + +The `-orderby` option indicates the pool key that is used to sort +the data in lake, which may be in ascending or descending order. + +If a pool key is not specified, then it defaults to +the [special value `this`](../language/dataflow-model.md#the-special-value-this). + +A newly created pool is initialized with a branch called `main`. + +> Zed lakes can be used without thinking about branches. When referencing a pool without +> a branch, the tooling presumes the "main" branch as the default, and everything +> can be done on main without having to think about branching. + +### Delete +``` +zed delete [options] [...] +zed delete [options] -where +``` +The `delete` command removes one or more data objects indicated by their ID from a pool. +This command +simply removes the data from the branch without actually deleting the +underlying data objects thereby allowing time travel to work in the face +of deletes. Permanent deletion of underlying data objects is handled by the +separate [`vacuum`](#vacuum) command. + +If the `-where` flag is specified, delete will remove all values for which the +provided filter expression is true. The value provided to `-where` must be a +single filter expression, e.g.: + +``` +zed delete -where 'ts > 2022-10-05T17:20:00Z and ts < 2022-10-05T17:21:00Z' +``` + +### Drop +``` +zed drop [options] | +``` +The `drop` command deletes a pool and all of its constituent data. +As this is a DANGER ZONE command, you must confirm that you want to delete +the pool to proceed. The `-f` option can be used to force the deletion +without confirmation. + +### Init +``` +zed init [path] +``` +A new lake is initialized with the `init` command. The `path` argument +is a [storage path](#storage-layer) and is optional. If not present, the path +is [determined automatically](#locating-the-lake). + +If the lake already exists, `init` reports an error and does nothing. + +Otherwise, the `init` command writes the initial cloud objects to the +storage path to create a new, empty lake at the specified path. + +### Load +``` +zed load [options] input [input ...] +``` +The `load` command commits new data to a branch of a pool. + +Run `zed load -h` for a list of command-line options. + +Note that there is no need to define a schema or insert data into +a "table" as all Zed data is _self describing_ and can be queried in a +schema-agnostic fashion. Data of any _shape_ can be stored in any pool +and arbitrary data _shapes_ can coexist side by side. + +As with `zq`, +the [input arguments](zq.md#usage) can be in +any [supported format](zq.md#input-formats) and +the input format is auto-detected if `-i` is not provided. Likewise, +the inputs may be URLs, in which case, the `load` command streams +the data from a Web server or [S3](../integrations/amazon-s3.md) and into the lake. + +When data is loaded, it is broken up into objects of a target size determined +by the pool's `threshold` parameter (which defaults 500MiB but can be configured +when the pool is created). Each object is sorted by the pool key but +a sequence of objects is not guaranteed to be globally sorted. When lots +of small or unsorted commits occur, data can be fragmented. The performance +impact of fragmentation can be eliminated by regularly [compacting](#manage) +pools. + +For example, this command +``` +zed load sample1.json sample2.zng sample3.zson +``` +loads files of varying formats in a single commit to the working branch. + +An alternative branch may be specified with a branch reference with the +`-use` option, i.e., `@`. Supposing a branch +called `live` existed, data can be committed into this branch as follows: +``` +zed load -use logs@live sample.zng +``` +Or, as mentioned above, you can set the default branch for the load command +via `use`: +``` +zed use logs@live +zed load sample.zng +``` +During a `load` operation, a commit is broken out into units called _data objects_ +where a target object size is configured into the pool, +typically 100MB-1GB. The records within each object are sorted by the pool key. +A data object is presumed by the implementation +to fit into the memory of an intake worker node +so that such a sort can be trivially accomplished. + +Data added to a pool can arrive in any order with respect to the pool key. +While each object is sorted before it is written, +the collection of objects is generally not sorted. + +Each load operation creates a single commit object, which includes: +* an author and message string, +* a timestamp computed by the server, and +* an optional metadata field of any Zed type expressed as a ZSON value. +This data has the Zed type signature: +``` +{ + author: string, + date: time, + message: string, + meta: +} +``` +where `` is the type of any optionally attached metadata . +For example, this command sets the `author` and `message` fields: +``` +zed load -user user@example.com -message "new version of prod dataset" ... +``` +If these fields are not specified, then the Zed system will fill them in +with the user obtained from the session and a message that is descriptive +of the action. + +The `date` field here is used by the Zed lake system to do time travel +through the branch and pool history, allowing you to see the state of +branches at any time in their commit history. + +Arbitrary metadata expressed as any [ZSON value](../formats/zson.md) +may be attached to a commit via the `-meta` flag. This allows an application +or user to transactionally commit metadata alongside committed data for any +purpose. This approach allows external applications to implement arbitrary +data provenance and audit capabilities by embedding custom metadata in the +commit history. + +Since commit objects are stored as Zed, the metadata can easily be +queried by running the `log -f zng` to retrieve the log in ZNG format, +for example, and using [`zq`](zq.md) to pull the metadata out +as in: +``` +zed log -f zng | zq 'has(meta) | yield {id,meta}' - +``` + +### Log +``` +zed log [options] [commitish] +``` +The `log` command, like `git log`, displays a history of the commit objects +starting from any commit, expressed as a [commitish](#commitish). If no argument is +given, the tip of the working branch is used. + +Run `zed log -h` for a list of command-line options. + +To understand the log contents, the `load` operation is actually +decomposed into two steps under the covers: +an "add" step stores one or more +new immutable data objects in the lake and a "commit" step +materializes the objects into a branch with an ACID transaction. +This updates the branch pointer to point at a new commit object +referencing the data objects where the new commit object's parent +points at the branch's previous commit object, thus forming a path +through the object tree. + +The `log` command prints the commit ID of each commit object in that path +from the current pointer back through history to the first commit object. + +A commit object includes +an optional author and message, along with a required timestamp, +that is stored in the commit journal for reference. These values may +be specified as options to the `load` command, and are also available in the +API for automation. + +> Note that the branchlog meta-query source is not yet implemented. + +### Manage +``` +zed manage [options] +``` +The `manage` command performs maintenance tasks on a lake. + +Currently the only supported task is _compaction_, which reduces fragmentation +by reading data objects in a pool and writing their contents back to large, +non-overlapping objects. + +If the `-monitor` option is specified and the lake is [located](#locating-the-lake) +via network connection, `zed manage` will run continuously and perform updates +as needed. By default a check is performed once per minute to determine if +updates are necessary. The `-interval` option may be used to specify an +alternate check frequency in [duration format](../formats/zson.md#23-primitive-values). + +If `-monitor` is not specified, a single maintenance pass is performed on the +lake. + +The output from `manage` provides a per-pool summary of the maintenance +performed, including a count of `objects_compacted`. + +### Merge + +Data is merged from one branch into another with the `merge` command, e.g., +``` +zed merge -use logs@updates main +``` +where the `updates` branch is being merged into the `main` branch +within the `logs` pool. + +A merge operation finds a common ancestor in the commit history then +computes the set of changes needed for the target branch to reflect the +data additions and deletions in the source branch. +While the merge operation is performed, data can still be written concurrently +to both branches and queries performed and everything remains transactionally +consistent. Newly written data remains in the +branch while all of the data present at merge initiation is merged into the +parent. + +This Git-like behavior for a data lake provides a clean solution to +the live ingest problem. +For example, data can be continuously ingested into a branch of main called `live` +and orchestration logic can periodically merge updates from branch `live` to +branch `main`, possibly [compacting](#manage) data after the merge +according to configured policies and logic. + +### Query +``` +zed query [options] +``` +The `query` command runs a Zed program with data from a lake as input. +A query typically begins with a [from operator](../language/operators/from.md) +indicating the pool and branch to use as input. If `from` is not present, then the +query reads from the working branch. + +The pool/branch names are specified with `from` at the beginning of the Zed +query. + +As with `zq`, the default output format is ZSON for +terminals and ZNG otherwise, though this can be overridden with +`-f` to specify one of the various supported output formats. + +If a pool name is provided to `from` without a branch name, then branch +"main" is assumed. + +This example reads every record from the full key range of the `logs` pool +and sends the results to stdout. + +``` +zed query 'from logs' +``` + +We can narrow the span of the query by specifying a filter on the pool key: +``` +zed query 'from logs | ts >= 2018-03-24T17:36:30.090766Z and ts <= 2018-03-24T17:36:30.090758Z' +``` +Filters on pool keys are efficiently implemented as the data is laid out +according to the pool key and seek indexes keyed by the pool key +are computed for each data object. + +Lake queries also can refer to HEAD (i.e., the branch context set in the most +recent `use` command) either implicitly by omitting the `from` operator: +``` +zed query '*' +``` +or by referencing `HEAD`: +``` +zed query 'from HEAD' +``` + +When querying data to the ZNG output format, +output from a pool can be easily piped to other commands like `zq`, e.g., +``` +zed query -f zng 'from logs' | zq -f table 'count() by field' - +``` +Of course, it's even more efficient to run the query inside of the pool traversal +like this: +``` +zed query -f table 'from logs | count() by field' +``` +By default, the `query` command scans pool data in pool-key order though +the Zed optimizer may, in general, reorder the scan to optimize searches, +aggregations, and joins. +An order hint can be supplied to the `query` command to indicate to +the optimizer the desired processing order, but in general, `sort` operators +should be used to guarantee any particular sort order. + +Arbitrarily complex Zed queries can be executed over the lake in this fashion +and the planner can utilize cloud resources to parallelize and scale the +query over many parallel workers that simultaneously access the Zed lake data in +shared cloud storage (while also accessing locally- or cluster-cached copies of data). + +#### Meta-queries + +Commit history, metadata about data objects, lake and pool configuration, +etc. can all be queried and +returned as Zed data, which in turn, can be fed into Zed analytics. +This allows a very powerful approach to introspecting the structure of a +lake making it easy to measure, tune, and adjust lake parameters to +optimize layout for performance. + +These structures are introspected using meta-queries that simply +specify a metadata source using an extended syntax in the `from` operator. +There are three types of meta-queries: +* `from :` - lake level +* `from pool:` - pool level +* `from pool[@]<:meta>` - branch level + +`` is the name of the metadata being queried. The available metadata +sources vary based on level. + +For example, a list of pools with configuration data can be obtained +in the ZSON format as follows: +``` +zed query -Z "from :pools" +``` +This meta-query produces a list of branches in a pool called `logs`: +``` +zed query -Z "from logs:branches" +``` +Since this is all just Zed, you can filter the results just like any query, +e.g., to look for particular branch: +``` +zed query -Z "from logs:branches | branch.name=='main'" +``` + +This meta-query produces a list of the data objects in the `live` branch +of pool `logs`: +``` +zed query -Z "from logs@live:objects" +``` + +You can also pretty-print in human-readable form most of the metadata Zed records +using the "lake" format, e.g., +``` +zed query -f lake "from logs@live:objects" +``` + +The `main` branch is queried by default if an explicit branch is not specified, +e.g., + +``` +zed query -f lake "from logs:objects" +``` + +### Rename +``` +zed rename +``` +The `rename` command assigns a new name `` to an existing +pool ``, which may be referenced by its ID or its previous name. + +### Serve +``` +zed serve [options] +``` +The `serve` command implements Zed's server personality to service requests +from instances of Zed's client [personality](#zed-command-personalities). +It listens for Zed lake API requests on the interface and port +specified by the `-l` option, executes the requests, and returns results. + +The `-log.level` option controls log verbosity. Available levels, ordered +from most to least verbose, are `debug`, `info` (the default), `warn`, +`error`, `dpanic`, `panic`, and `fatal`. If the volume of logging output at +the default `info` level seems too excessive for production use, `warn` level +is recommended. + +### Use +``` +zed use [] +``` +The `use` command sets the working branch to the indicated commitish. +When run with no argument, it displays the working branch and [lake](#locating-the-lake). + +For example, +``` +zed use logs +``` +provides a "pool-only" commitish that sets the working branch to `logs@main`. + +If a `@branch` or commit ID are given without a pool prefix, then the pool of +the commitish previously in use is presumed. For example, if you are on +`logs@main` then run this command: +``` +zed use @test +``` +then the working branch is set to `logs@test`. + +To specify a branch in another pool, simply prepend +the pool name to the desired branch: +``` +zed use otherpool@otherbranch +``` +This command stores the working branch in `$HOME/.zed_head`. + +### Vacuum +``` +zed vacuum [options] +``` + +The `vacuum` command permanently removes underlying data objects that have +previously been subject to a [`delete`](#delete) operation. As this is a +DANGER ZONE command, you must confirm that you want to remove +the objects to proceed. The `-f` option can be used to force removal +without confirmation. The `-dryrun` option may also be used to see a summary +of how many objects would be removed by a `vacuum` but without removing them. diff --git a/versioned_docs/version-v1.15.0/commands/zq.md b/versioned_docs/version-v1.15.0/commands/zq.md new file mode 100644 index 00000000..2d4ed914 --- /dev/null +++ b/versioned_docs/version-v1.15.0/commands/zq.md @@ -0,0 +1,695 @@ +--- +sidebar_position: 1 +sidebar_label: zq +description: A command-line tool that uses the Zed Language for pipeline-style search and analytics. +--- + +# zq + +> **TL;DR** `zq` is a command-line tool that uses the [Zed language](../language/README.md) +for pipeline-style search and analytics. `zq` can query a variety +of data formats in files, over HTTP, or in [S3](../integrations/amazon-s3.md) storage. +It is particularly fast when operating on data in the Zed-native [ZNG](../formats/zng.md) format. +> +> The `zq` design philosophy blends the query/search-tool approach +of `jq`, `awk`, and `grep` with the command-line, embedded database approach +of `sqlite` and `duckdb`. + +## Usage + +``` +zq [ options ] [ query ] input [ input ... ] +zq [ options ] query +``` +`zq` is a command-line tool for processing data in diverse input +formats, providing search, analytics, and extensive transformations +using the [Zed language](../language/README.md). A query typically applies Boolean logic +or keyword search to filter the input, then transforms or analyzes +the filtered stream. Output is written to one or more files or to +standard output. + +Each `input` argument must be a file path, an HTTP or HTTPS URL, +an S3 URL, or standard input specified with `-`. + +For built-in command help and a listing of all available options, +simply run `zq` with no arguments. + +`zq` supports a [number of formats](#input-formats) but [ZNG](../formats/zng.md) +tends to be the most space-efficient and most performant. ZNG has efficiency similar to +[Avro](https://avro.apache.org/docs/current/spec.html) +and [Protocol Buffers](https://developers.google.com/protocol-buffers) +but its comprehensive [Zed type system](../formats/zed.md) obviates +the need for schema specification or registries. +Also, the ZSON format is human-readable and entirely one-to-one with ZNG +so there is no need to represent non-readable formats like Avro or Protocol Buffers +in a clunky JSON encapsulated form. + +`zq` typically operates on ZNG-encoded data and when you want to inspect +human-readable bits of output, you merely format it as ZSON, which is the +default format when output is directed to the terminal. ZNG is the default +when redirecting to a non-terminal output like a file or pipe. + +When run with input arguments, each input's format is automatically inferred +([as described below](#auto-detection)) and each input is scanned +in the order appearing on the command line forming the input stream. + +A query expressed in the [Zed language](../language/README.md) +may be optionally specified and applied to the input stream. + +If no query is specified, the inputs are scanned without modification +and output in the desired format as described below. This latter approach +provides a convenient means to convert files from one format to another. + +To determine whether the first argument is a query or an input, +`zq` checks the local file system for the existence of a file by that name +or whether the name is an URL. +If no such file or URL exists, it attempts to parse the text as a Zed program. +If both checks fail, then an error is reported and `zq` exits. +This heuristic is convenient but can result in a rare surprise when a simple +Zed query (like a keyword search) happens to correspond with a file of the +same name in the local directory. + +When `zq` is run with a query and no input arguments, then the query must +begin with a +* a [from, file, or get operator](../language/operators/from.md), or +* an explicit or implied [yield operator](../language/operators/yield.md). + +In the case of a `yield` with no inputs, the query is run with +a single input value of `null`. This provides a convenient means to run in a +"calculator mode" where input is produced by the yield and can be operated upon +by the Zed query, e.g., +```mdtest-command +zq -z '1+1' +``` +emits +```mdtest-output +2 +``` +Note here that the query `1+1` [implies](../language/dataflow-model.md#implied-operators) +`yield 1+1`. + +## Input Formats + +`zq` currently supports the following input formats: + +| Option | Auto | Specification | +|-----------|------|------------------------------------------| +| `arrows` | yes | [Arrow IPC Stream Format](https://arrow.apache.org/docs/format/Columnar.html#ipc-streaming-format) | +| `json` | yes | [JSON RFC 8259](https://www.rfc-editor.org/rfc/rfc8259.html) | +| `csv` | yes | [CSV RFC 4180](https://www.rfc-editor.org/rfc/rfc4180.html) | +| `line` | no | One string value per input line | +| `parquet` | yes | [Apache Parquet](https://github.com/apache/parquet-format) | +| `tsv` | yes | [TSV - Tab-Separated Values](https://en.wikipedia.org/wiki/Tab-separated_values) | +| `vng` | yes | [VNG - Binary Columnar Format](../formats/vng.md) | +| `zson` | yes | [ZSON - Human-readable Format](../formats/zson.md) | +| `zng` | yes | [ZNG - Binary Row Format](../formats/zson.md) | +| `zjson` | yes | [ZJSON - Zed over JSON](../formats/zjson.md) | +| `zeek` | yes | [Zeek Logs](https://docs.zeek.org/en/master/logs/index.html) | + +The input format is typically detected automatically and the formats for which +`Auto` is `yes` in the table above support _auto-detection_. +Formats without auto-detection require the `-i` option. + +### Hard-wired Input Format + +The input format is specified with the `-i` flag. + +When `-i` is specified, all of the inputs on the command-line must be +in the indicated format. + +### Auto-detection + +When using _auto-detection_, each input's format is independently determined +so it is possible to easily blend different input formats into a unified +output format. + +For example, suppose this content is in a file `sample.csv`: +```mdtest-input sample.csv +a,b +1,foo +2,bar +``` +and this content is in `sample.json` +```mdtest-input sample.json +{"a":3,"b":"baz"} +``` +then the command +```mdtest-command +zq -z sample.csv sample.json +``` +would produce this output in the default ZSON format +```mdtest-output +{a:1.,b:"foo"} +{a:2.,b:"bar"} +{a:3,b:"baz"} +``` + +### ZSON-JSON Auto-detection + +Since ZSON is a superset of JSON, `zq` must be careful in whether it +interprets input as ZSON as JSON. While you can always clarify your intent +with the `-i zson` or `-i json`, `zq` attempts to "just do the right thing" +when you run it with JSON vs. ZSON. + +While `zq` can parse any JSON using its built-in ZSON parser this is typically +not desirable because (1) the ZSON parser is not particularly performant and +(2) all JSON numbers are floating point but the ZSON parser will parse as +JSON any number that appears without a decimal point as an integer type. + +> The reason `zq` is not particularly performant for ZSON is that the ZNG or +> VNG formats are semantically equivalent to ZSON but much more efficient and +> the design intent is that these efficient binary formats should be used in +> use cases where performance matters. ZSON is typically used only when +> data needs to be human-readable in interactive settings or in automated tests. + +To this end, `zq` uses a heuristic to select between ZSON in JSON when the +`-i` option is not specified. Specifically, JSON is selected when the first values +of the input are parsable as valid JSON and includes a JSON object either +as an outer object or as a value nested somewhere within a JSON array. + +This heuristic almost always works in practice because ZSON records +typically omit quotes around field names. + +## Output Formats + +The output format defaults to either ZSON or ZNG and may be specified +with the `-f` option. The supported output formats include all of +the input formats along with text and table formats, which are useful +for displaying data. (They do not capture all the information required +to reconstruct the original data so they are not supported input formats.) + +Since ZSON is a common format choice, the `-z` flag is a shortcut for +`-f zson.` Also, `-Z` is a shortcut for `-f zson` with `-pretty 4` as +described below. + +And since JSON is another common format choice, the `-j` flag is a shortcut for +`-f json.` + +### Output Format Selection + +When the format is not specified with `-f`, it defaults to ZSON if the output +is a terminal and to ZNG otherwise. + +While this can cause an occasional surprise (e.g., forgetting `-f` or `-z` +in a scripted test that works fine on the command line but fails in CI), +we felt that the design of having a uniform default had worse consequences: +* If the default format were ZSON, it would be very easy to create pipelines +and deploy to production systems that were accidentally using ZSON instead of +the much more efficient ZNG format because the `-f zng` had been mistakenly +omitted from some command. The beauty of Zed is that all of this "just works" +but it would otherwise perform poorly. +* If the default format were ZNG, then users would be endlessly annoyed by +binary output to their terminal when forgetting to type `-f zson`. + +In practice, we have found that the output defaults +"just do the right thing" almost all of the time. + +### ZSON Pretty Printing + +ZSON text may be "pretty printed" with the `-pretty` option, which takes +the number of spaces to use for indentation. As this is a common option, +the `-Z` option is a shortcut for `-f zson -pretty 4`. + +For example, +```mdtest-command +echo '{a:{b:1,c:[1,2]},d:"foo"}' | zq -Z - +``` +produces +```mdtest-output +{ + a: { + b: 1, + c: [ + 1, + 2 + ] + }, + d: "foo" +} +``` +and +```mdtest-command +echo '{a:{b:1,c:[1,2]},d:"foo"}' | zq -f zson -pretty 2 - +``` +produces +```mdtest-output +{ + a: { + b: 1, + c: [ + 1, + 2 + ] + }, + d: "foo" +} +``` + +When pretty printing, colorization is enabled by default when writing to a terminal, +and can be disabled with `-color false`. + +### Pipeline-friendly ZNG + +Though it's a compressed binary format, ZNG data is self-describing and stream-oriented +and thus is pipeline friendly. + +Since data is self-describing you can simply take ZNG output +of one command and pipe it to the input of another. It doesn't matter if the value +sequence is scalars, complex types, or records. There is no need to declare +or register schemas or "protos" with the downstream entities. + +In particular, ZNG data can simply be concatenated together, e.g., +```mdtest-command +zq -f zng 'yield 1,[1,2,3]' > a.zng +zq -f zng 'yield {s:"hello"},{s:"world"}' > b.zng +cat a.zng b.zng | zq -z - +``` +produces +```mdtest-output +1 +[1,2,3] +{s:"hello"} +{s:"world"} +``` +And while this ZSON output is human readable, the ZNG files are binary, e.g., +```mdtest-command +zq -f zng 'yield 1,[1,2,3]' > a.zng +hexdump -C a.zng +``` +produces +```mdtest-output +00000000 02 00 01 09 1b 00 09 02 02 1e 07 02 02 02 04 02 |................| +00000010 06 ff |..| +00000012 +``` + +### Schema-rigid Outputs + +Certain data formats like Arrow and Parquet are "schema rigid" in the sense that +they require a schema to be defined before values can be written into the file +and all the values in the file must conform to this schema. + +Zed, however, has a fine-grained type system instead of schemas and a sequence +of data values are completely self-describing and may be heterogeneous in nature. +This creates a challenge converting the type-flexible Zed formats to a schema-rigid +format like Arrow and Parquet. + +For example, this seemingly simple conversion: +```mdtest-command fails +echo '{x:1}{s:"hello"}' | zq -o out.parquet -f parquet - +``` +causes this error +```mdtest-output +parquetio: encountered multiple types (consider 'fuse'): {x:int64} and {s:string} +``` + +#### Fusing Schemas + +As suggested by the error above, the Zed `fuse` operator can merge different record +types into a blended type, e.g., here we create the file and read it back: +```mdtest-command +echo '{x:1}{s:"hello"}' | zq -o out.parquet -f parquet fuse - +zq -z out.parquet +``` +but the data was necessarily changed (by inserting nulls): +```mdtest-output +{x:1,s:null(string)} +{x:null(int64),s:"hello"} +``` + +#### Splitting Schemas + +Another common approach to dealing with the schema-rigid limitation of Arrow and +Parquet is to create a separate file for each schema. + +`zq` can do this too with the `-split` option, which specifies a path +to a directory for the output files. If the path is `.`, then files +are written to the current directory. + +The files are named using the `-o` option as a prefix and the suffix is +`-.` where the `` is determined from the output format and +where `` is a unique integer for each distinct output file. + +For example, the example above would produce two output files, +which can then be read separately to reproduce the original data, e.g., +```mdtest-command +echo '{x:1}{s:"hello"}' | zq -o out -split . -f parquet - +zq -z out-*.parquet +``` +produces the original data +```mdtest-output +{x:1} +{s:"hello"} +``` + +While the `-split` option is most useful for schema-rigid formats, it can +be used with any output format. + +## Query Debugging + +If you are ever stumped about how the `zq` compiler is parsing your query, +you can always run `zq -C` to compile and display your query in canonical form +without running it. +This can be especially handy when you are learning the language and +[its shortcuts](../language/dataflow-model.md#implied-operators). + +For example, this query +```mdtest-command +zq -C 'has(foo)' +``` +is an implied [where operator](../language/operators/where.md), which matches values +that have a field `foo`, i.e., +```mdtest-output +where has(foo) +``` +while this query +```mdtest-command +zq -C 'a:=x+1' +``` +is an implied [put operator](../language/operators/put.md), which creates a new field `a` +with the value `x+1`, i.e., +```mdtest-output +put a:=x+1 +``` + +## Error Handling + +Fatal errors like "file not found" or "file system full" are reported +as soon as they happen and cause the `zq` process to exit. + +On the other hand, +runtime errors resulting from the Zed query itself +do not halt execution. Instead, these error conditions produce +[first-class Zed errors](../language/data-types.md#first-class-errors) +in the data output stream interleaved with any valid results. +Such errors are easily queried with the +[is_error function](../language/functions/is_error.md). + +This approach provides a robust technique for debugging complex query pipelines, +where errors can be wrapped in one another providing stack-trace-like debugging +output alongside the output data. This approach has emerged as a more powerful +alternative to the traditional technique of looking through logs for errors +or trying to debug a halted program with a vague error message. + +For example, this query +``` +echo '1 2 0 3' | zq '10.0/this' - +``` +produces +``` +10. +5. +error("divide by zero") +3.3333333333333335 +``` +and +``` +echo '1 2 0 3' | zq '10.0/this' - | zq 'is_error(this)' - +``` +produces just +``` +error("divide by zero") +``` + +## Examples + +As you may have noticed, many examples of the [Zed language](../language/README.md) +are illustrated using this pattern +``` +echo | zq - +``` +which is used throughout the [language documentation](../language/README.md) +and [operator reference](../language/operators/README.md). + +The language documentation and [tutorials directory](../tutorials/README.md) +have many examples, but here are a few more simple `zq` use cases. + +_Hello, world_ +``` +echo '"hello, world"' | zq -z 'yield this' - +``` +produces this ZSON output +``` +"hello, world" +``` + +_Some values of available data types_ +``` +echo '1 1.5 [1,"foo"] |["apple","banana"]|' | zq -z 'yield this' - +``` +produces +``` +1 +1.5 +[1,"foo"] +|["apple","banana"]| +``` +_The types of various data_ +``` +echo '1 1.5 [1,"foo"] |["apple","banana"]|' | zq -z 'yield typeof(this)' - +``` +produces +``` + + +<[(int64,string)]> +<|[string]|> +``` +_A simple aggregation_ +``` +echo '{key:"foo",val:1}{key:"bar",val:2}{key:"foo",val:3}' | zq -z 'sum(val) by key | sort key' - +``` +produces +``` +{key:"bar",sum:2} +{key:"foo",sum:4} +``` +_Convert CSV to Zed and cast a to an integer from default float_ +``` +printf "a,b\n1,foo\n2,bar\n" | zq 'a:=int64(a)' - +``` +produces +``` +{a:1,b:"foo"} +{a:2,b:"bar"} +``` +_Convert JSON to Zed and cast to an integer from default float_ +``` +echo '{"a":1,"b":"foo"}{"a":2,"b":"bar"}' | zq 'a:=int64(a)' - +``` +produces +``` +{a:1,b:"foo"} +{a:2,b:"bar"} +``` +_Make a schema-rigid Parquet file using fuse and turn it back into Zed_ +``` +echo '{a:1}{a:2}{b:3}' | zq -f parquet -o tmp.parquet fuse - +zq -z tmp.parquet +``` +produces +``` +{a:1,b:null(int64)} +{a:2,b:null(int64)} +{a:null(int64),b:3} +``` + +## Performance + +Your mileage may vary, but many new users of `zq` are surprised by its speed +compared to tools like `jq`, `grep`, `awk`, or `sqlite` especially when running +`zq` over files in the ZNG format. + +### Fast Pattern Matching + +One important technique that helps `zq` run fast is to take advantage of queries +that involve fine-grained searches. + +When a query begins with a logical expression containing either a search +or a predicate match with a constant value, and presuming the input data format +is ZNG, then the runtime optimizes the query by performing an efficient, +byte-oriented "pre-search" of the values required in the predicate. This pre-search +scans the bytes that comprise a large buffer of values and looks for these values +and, if they are not present, the entire buffer is discarded knowing no individual +value in that buffer could match because the required serialized +values were not present in the buffer. + +For example, if the Zed query is +``` +"http error" and ipsrc==10.0.0.1 | count() +``` +then the pre-search would look for the string "http error" and the Zed encoding +of the IP address 10.0.0.1 and unless both those values are present, then the +buffer is discarded. + +Moreover, ZNG data is compressed and arranged into frames that can be decompressed +and processed in parallel. This allows the decompression and pre-search to +run in parallel very efficiently across a large number of threads. When searching +for sparse results, many frames are discarded without their uncompressed bytes +having to be processed any further. + +### Efficient JSON Processing + +While processing data in the ZNG format is far more efficient than JSON, +there is substantial JSON data in the world and it is important for JSON +input to perform well. + +This proved a challenge as `zq` is written in Go and Go's JSON package +is not particularly performant. To this end, `zq` has its own lean and simple +[JSON tokenizer](https://pkg.go.dev/github.com/brimdata/zed/pkg/jsonlexer), +which performs quite well, +and is +[integrated tightly](https://github.com/brimdata/zed/blob/main/zio/jsonio/reader.go) +with Zed's internal data representation. +Moreover, like `jq`, +`zq`'s JSON parser does not require objects to be newline delimited and can +incrementally parse the input to minimize memory overhead and improve +processor cache performance. + +The net effect is a JSON parser that is typically a bit faster than the +native C implementation in `jq`. + +### Performance Comparisons + +To provide a rough sense of the performance tradeoffs between `zq` and +other tooling, this section provides results of a few simple speed tests. + +#### Test Data + +These tests are easy to reproduce. The input data comes from the +[Zed sample data repository](https://github.com/brimdata/zed-sample-data), +where we used a semi-structured Zeek "conn" log from the `zeek-default` directory. + +It is easy to convert the Zeek logs to a local ZNG file using +zq's built-in `get` operator: +``` +zq -o conn.zng 'get https://raw.githubusercontent.com/brimdata/zed-sample-data/main/zeek-default/conn.log.gz' +``` +This creates a new file `conn.zng` from the Zeek log file fetched from GitHub. + +Note that this data is a gzip'd file in the Zeek format and `zq`'s auto-detector +figures out both that it is gzip'd and that the uncompressed format is Zeek. +There's no need to specify flags for this. + +Next, a JSON file can be converted from ZNG using: +``` +zq -f json conn.zng > conn.json +``` +Note here that we lose information in this conversion because the rich data types +of Zed (that were [translated from the Zeek format](../integrations/zeek/data-type-compatibility.md) are lost. + +We'll also make a SQLite database in the file `conn.db` as the table named `conn`. +One easy way to do this is to install +[sqlite-utils](https://sqlite-utils.datasette.io/en/stable/) +and run +``` +sqlite-utils insert conn.db conn conn.json --nl +``` +(If you need a cup of coffee, a good time to get it would be when +loading the JSON into SQLite.) + +#### File Sizes + +Note the resulting file sizes: +``` +% du -h conn.json conn.db conn.zng +416M conn.json +192M conn.db + 38M conn.zng +``` +Much of the performance of ZNG derives from an efficient, parallelizable +structure where frames of data are compressed +(currently with [LZ4](http://lz4.github.io/lz4/) though the +specification supports multiple algorithms) and the sequence of values +can be processed with only partial deserialization. + +That said, there are quite a few more opportunities to further improve +the performance of `zq` and the Zed system and we have a number of projects +forthcoming on this front. + +#### Tests + +We ran three styles of tests on a Mac quad-core 2.3GHz i7: +* `count` - compute the number of values present +* `search` - find a value in a field +* `agg` - sum a field grouped by another field + +Each test was run for `jq`, `zq` on JSON, `sqlite3`, and `zq` on ZNG. + +We used the Bash `time` command to measure elapsed time. + +The command lines for the `count` test were: +``` +jq -s length conn.json +sqlite3 conn.db 'select count(*) from conn' +zq 'count()' conn.zng +zq 'count()' conn.json +``` +The command lines for the `search` test were: +``` +jq 'select(.id.orig_h=="10.47.23.5")' conn.json +sqlite3 conn.db 'select * from conn where json_extract(id, "$.orig_h")=="10.47.23.5"' +zq 'id.orig_h==10.47.23.5' conn.zng +zq 'id.orig_h==10.47.23.5' conn.json +``` +Here, we look for an IP address (10.47.23.5) in a specific +field `id.orig_h` in the semi-structured data. Note when using ZNG, +the IP is a native type whereas for `jq` and SQLite it is a string. +Note that `sqlite` must use its `json_extract` function since nested JSON objects +are stored as minified JSON text. + +The command lines for the `agg` test were: +``` +jq -n -f agg.jq conn.json +sqlite3 conn.db 'select sum(orig_bytes),json_extract(id, "$.orig_h") as orig_h from conn group by orig_h' +zq "sum(orig_bytes) by id.orig_h" conn.zng +zq "sum(orig_bytes) by id.orig_h" conn.json +``` +where the `agg.jq` script is: +``` +def adder(stream): + reduce stream as $s ({}; .[$s.key] += $s.val); +adder(inputs | {key:.id.orig_h,val:.orig_bytes}) +| to_entries[] +| {orig_h: (.key), sum: .value} +``` + +#### Results + +The following table summarizes the results of each test as a column and +each tool as a row with the speed-up factor (relative to `jq`) +shown in parentheses: + +| | `count` | `search` | `agg` | +|------|---------------|---------------|---------------| +| `jq` | 11,540ms (1X) | 10,730ms (1X) | 20,175ms (1X) | +| `zq-json` | 7,150ms (1.6X) | 7,230ms (1.5X) | 7,390ms (2.7X) | +| `sqlite` | 100ms (115X) | 620ms (17X) | 1,475ms (14X) | +| `zq-zng` | 110ms (105X) | 135ms (80X) | 475ms (42X) | + +To summarize, `zq` with ZNG is consistently fastest though `sqlite` +was a bit faster counting rows. + +In particular, `zq` is substantially faster (40-100X) than `jq` with the efficient +ZNG format but more modestly faster (50-170%) when processing the bulky JSON input. +This is expected because parsing JSON becomes the bottleneck. + +While SQLite is much faster than `jq`, it is not as fast as `zq`. The primary +reason for this is that SQLite stores its semi-structured columns as minified JSON text, +so it must scan and parse the JSON when executing the _where_ clause above +as well as the aggregated fields. + +Also, note that the inferior performance of `sqlite` is in areas where databases +perform extraordinarily well if you do the work to +(1) transform semi-structured columns to relational columns by flattening +nested JSON objects (which are not indexable by `sqlite`) and +(2) configuring database indexes. + +In fact, if you implement these changes, `sqlite` performs better than `zq` on these tests. + +However, the benefit of Zed is that no flattening is required. And unlike `sqlite`, +`zq` is not intended to be a database. That said, there is no reason why database +performance techniques cannot be applied to the Zed model and this is precisely what the +open-source Zed project intends to do. + +Stay tuned! diff --git a/versioned_docs/version-v1.15.0/formats/README.md b/versioned_docs/version-v1.15.0/formats/README.md new file mode 100644 index 00000000..fd33a5f2 --- /dev/null +++ b/versioned_docs/version-v1.15.0/formats/README.md @@ -0,0 +1,284 @@ +# Zed Formats + +> **TL;DR** The Zed data model defines a new and easy way to manage, store, +> and process data utilizing an emerging concept called +[super-structured data](#2-zed-a-super-structured-pattern). +> The [data model specification](zed.md) defines the high-level model that is realized +> in a [family of interoperable serialization formats](#3-the-data-model-and-formats), +> providing a unified approach to row, columnar, and human-readable formats. +> Zed is a superset of both the dataframe/table model of relational systems and the +> semi-structured model that is used ubiquitously in development as JSON and by NOSQL +> data stores. The ZSON spec has [a few examples](zson.md#3-examples). + +## 1. Background + +Zed offers a new and improved way to think about and manage data. + +Modern data models are typically described in terms of their _structured-ness_: +* _tabular-structured_, often simply called _"structured"_, +where a specific schema is defined to describe a table and values are enumerated that conform to that schema; +* _semi-structured_, where arbitrarily complex, hierarchical data structures +define the data and values do not fit neatly into tables, e.g., JSON and XML; and +* _unstructured_, where arbitrary text is formatted in accordance with +external, often vague, rules for its interpretation. + +### 1.1 The Tabular-structured Pattern + +CSV is arguably the simplest but most frustrating format that follows the tabular-structured +pattern. It provides a bare bones schema consisting of the names of the columns as the +first line of a file followed by a list of comma-separated, textual values +whose types must be inferred from the text. The lack of a universally adopted +specification for CSV is an all too common source of confusion and frustration. + +The traditional relational database, on the other hand, +offers the classic, comprehensive example of the tabular-structured pattern. +The table columns have precise names and types. +Yet, like CSV, there is no universal standard format for relational tables. +The [_SQLite file format_](https://sqlite.org/fileformat.html) +is arguably the _de facto_ standard for relational data, +but this format describes a whole, specific database --- indexes and all --- +rather than a stand-alone table. + +Instead, file formats like Avro, ORC, and Parquet arose to represent tabular data +with an explicit schema followed by a sequence of values that conform to the schema. +While Avro and Parquet schemas can also represent semi-structured data, all of the +values in a given Avro or Parquet file must conform to the same schema. +The [Iceberg specification](https://iceberg.apache.org/spec/) +defines data types and metadata schemas for how large relational tables can be +managed as a collection of Avro, ORC, and/or Parquet files. + +### 1.2 The Semi-structured Pattern + +JSON, on the other hand, is the ubiquitous example of the semi-structured pattern. +Each JSON value is self-describing in terms of its +structure and types, though the JSON type system is limited. + +When a sequence of JSON objects is organized into a stream +(perhaps [separated by newlines](https://en.wikipedia.org/wiki/JSON_streaming#NDJSON)) +each value can take on any form. +When all the values have the same form, the JSON sequence +begins to look like a relational table, but the lack of a comprehensive type system, +a union type, and precise semantics for columnar layout limits this interpretation. + +[BSON](https://bsonspec.org/) +and [Ion](https://amzn.github.io/ion-docs/) +were created to provide a type-rich elaboration of the +semi-structured model of JSON along with performant binary representations +though there is no mechanism for precisely representing the type of +a complex value like an object or an array other than calling it +type "object" or type "array", e.g., as compared to "object with field s +of type string" or "array of number". + +[JSON Schema](https://json-schema.org/) +addresses JSON's lack of schemas with an approach to augment +one or more JSON values with a schema definition itself expressed in JSON. +This creates a parallel type system for JSON, which is useful and powerful in many +contexts, but introduces schema-management complexity when simply trying to represent +data in its natural form. + +### 1.3 The Hybrid Pattern + +As the utility and ease of the semi-structured design pattern emerged, +relational system design, originally constrained by the tabular-structured +design pattern, has embraced the semi-structured design pattern +by adding support for semi-structured table columns. +"Just put JSON in a column." + +[SQL++](https://asterixdb.apache.org/docs/0.9.7.1/sqlpp/manual.html) +pioneered the extension of SQL to semi-structured data by +adding support for referencing and unwinding complex, semi-structured values, +and most modern SQL query engines have adopted variations of this model +and have extended the relational model with a semi-structured column type. + +But once you have put a number of columns of JSON data into a relational +table, is it still appropriately called "structured"? +Instead, we call this approach the hybrid tabular-/semi-structured pattern, +or more simply, _"the hybrid pattern"_. + +## 2. Zed: A Super-structured Pattern + +The insight in Zed is to remove the tabular and schema concepts from +the underlying data model altogether and replace them with a granular and +modern type system inspired by general-purpose programming languages. +Instead of defining a single, composite schema to +which all values must conform, the Zed type system allows each value to freely +express its type in accordance with the type system. + +In this approach, +Zed is neither tabular nor semi-structured. Zed is "super-structured". + +In particular, the Zed record type looks like a schema but when +serializing Zed data, the model is very different. A Zed sequence does not +comprise a record-type declaration followed by a sequence of +homogeneously-typed record values, but instead, +is a sequence of arbitrarily typed Zed values, which may or may not all +be records. + +Yet when a sequence of Zed values _in fact conforms to a uniform record type_, +then such a collection of Zed records looks precisely like a relational table. +Here, the record type +of such a collection corresponds to a well-defined schema consisting +of field names (i.e, column names) where each field has a specific Zed type. +Zed also has named types, so by simply naming a particular record type +(i.e., a schema), a relational table can be projected from a pool of Zed data +with a simple type query for that named type. + +But unlike traditional relational tables, these Zed-constructed tables can have arbitrary +structure in each column as Zed allows the fields of a record +to have an arbitrary type. This is very different compared to the hybrid pattern: +all Zed data at all levels conforms to the same data model. Here, both the +tabular-structured and semi-structured patterns are representable in a single model. +Unlike the hybrid pattern, systems based on Zed have +no need to simultaneously support two very different data models. + +In other words, Zed unifies the relational data model of SQL tables +with the document model of JSON into a _super-structured_ +design pattern enabled by the Zed type system. +An explicit, uniquely-defined type of each value precisely +defines its entire structure, i.e., its super-structure. There is +no need to traverse each hierarchical value --- as with JSON, BSON, or Ion --- +to discover each value's structure. + +And because Zed derives it design from the vast landscape +of existing formats and data models, it was deliberately designed to be +a superset of --- and thus interoperable with --- a broad range of formats +including JSON, BSON, Ion, Avro, ORC, Parquet, CSV, JSON Schema, and XML. + +As an example, most systems that are based on semi-structured data would +say the JSON value +``` +{"a":[1,"foo"]} +``` +is of type object and the value of key `a` is type array. +In Zed, however, this value's type is type `record` with field `a` +of type `array` of type `union` of `int64` and `string`, +expressed succinctly in ZSON as +``` +{a:[(int64,string)]} +``` +This is super-structuredness in a nutshell. + +### 2.1 Zed and Schemas + +While the Zed data model removes the schema constraint, +the implication here is not that schemas are unimportant; +to the contrary, schemas are foundational. Schemas not only define agreement +and semantics between communicating entities, but also serve as the cornerstone +for organizing and modeling data for data engineering and business intelligence. + +That said, schemas often create complexity in system designs +where components might simply want to store and communicate data in some +meaningful way. For example, an ETL pipeline should not break when upstream +structural changes prevent data from fitting in downstream relational tables. +Instead, the pipeline should continue to operate and the data should continue +to land on the target system without having to fit into a predefined table, +while also preserving its super-structure. + +This is precisely what Zed enables. A system layer above and outside +the scope of the Zed data layer can decide how to adapt to the structural +changes with or without administrative intervention. + +To this end, whether all the values must conform to a schema and +how schemas are managed, revised, and enforced is all outside the scope of Zed; +rather, the Zed data model provides a flexible and rich foundation +for schema interpretation and management. + +### 2.2 Type Combinatorics + +A common objection to using a type system to represent schemas is that +diverse applications generating arbitrarily structured data can produce +a combinatorial explosion of types for each shape of data. + +In practice, this condition rarely arises. Applications generating +"arbitrary" JSON data generally conform to a well-defined set of +JSON object structures. + +A few rare applications carry unique data values as JSON object keys, +though this is considered bad practice. + +Even so, this is all manageable in the Zed data model as types are localized +in scope. The number of types that must be defined in a stream of values +is linear in the input size. Since data is self-describing and there is +no need for a global schema registry in Zed, this hypothetical problem is moot. + +### 2.3 Analytics Performance + +One might think that removing schemas from the Zed data model would conflict +with an efficient columnar format for Zed, which is critical for +high-performance analytics. +After all, database +tables and formats like Parquet and ORC all require schemas to organize values +and then rely upon the natural mapping of schemas to columns. + +Super-structure, on the other hand, provides an alternative approach to columnar structure. +Instead of defining a schema and then fitting a sequence of values into their appropriate +columns based on the schema, Zed values self-organize into columns based on their +super-structure. Here columns are created dynamically as data is analyzed +and each top-level type induces a specific set of columns. When all of the +values have the same top-level type (i.e., like a schema), then the Zed columnar +object is just as performant as a traditional schema-based columnar format like Parquet. + +### 2.4 First-class Types + +With first-class types, any type can also be a value, which means that in +a properly designed query and analytics system based on Zed, a type can appear +anywhere that a value can appear. In particular, types can be aggregation keys. + +This is very powerful for data discovery and introspection. For example, +to count the different shapes of data, you might have a SQL-like query, +operating on each input value as `this`, that has the form: +``` + SELECT count(), typeof(this) as shape GROUP by shape, count +``` +Likewise, you could select a sample value of each shape like this: +``` + SELECT shape FROM ( + SELECT any(this) as sample, typeof(this) as shape GROUP by shape,sample + ) +``` +The Zed language is exploring syntax so that such operations are tighter +and more natural given the super-structure of Zed. For example, the above +two SQL-like queries could be written as: +``` + count() by shape:=typeof(this) + any(this) by typeof(this) | cut any +``` + +### 2.5 First-class Errors + +In SQL based systems, errors typically +result in cryptic messages or null values offering little insight as to the +actual cause of the error. + +Zed however includes first-class errors. When combined with the super-structured +data model, error values may appear anywhere in the output and operators +can propagate or easily wrap errors so complicated analytics pipelines +can be debugged by observing the location of errors in the output results. + +## 3. The Data Model and Formats + +The concept of super-structured data and first-class types and errors +is solidified in the [Zed data model specification](zed.md), +which defines the model but not the serialization formats. + +A set of companion documents define a family of tightly integrated +serialization formats that all adhere to the same Zed data model, +providing a unified approach to row, columnar, and human-readable formats: + +* [ZSON](zson.md) is a JSON-like, human readable format for Zed data. All JSON +documents are Zed values as the ZSON format is a strict superset of the JSON syntax. +* [ZNG](zng.md) is a row-based, binary representation of Zed data somewhat like +Avro but with Zed's more general model to represent a sequence of arbitrarily-typed +values. +* [VNG](vng.md) is a columnar version of ZNG like Parquet or ORC but also +embodies Zed's more general model for heterogeneous and self-describing schemas. +* [Zed over JSON](zjson.md) defines a JSON format for encapsulating Zed data +in JSON for easy decoding by JSON-based clients, e.g., +the [zed-js JavaScript library](https://github.com/brimdata/zui/tree/main/packages/zed-js) +and the [Zed Python library](../libraries/python.md). + +Because all of the formats conform to the same Zed data model, conversions between +a human-readable form, a row-based binary form, and a row-based columnar form can +be trivially carried out with no loss of information. This is the best of both worlds: +the same data can be easily expressed in and converted between a human-friendly +and easy-to-program text form alongside efficient row and columnar formats. diff --git a/versioned_docs/version-v1.15.0/formats/_category_.yaml b/versioned_docs/version-v1.15.0/formats/_category_.yaml new file mode 100644 index 00000000..6464e66a --- /dev/null +++ b/versioned_docs/version-v1.15.0/formats/_category_.yaml @@ -0,0 +1,2 @@ +position: 6 +label: Formats diff --git a/versioned_docs/version-v1.15.0/formats/compression.md b/versioned_docs/version-v1.15.0/formats/compression.md new file mode 100644 index 00000000..a478e5c1 --- /dev/null +++ b/versioned_docs/version-v1.15.0/formats/compression.md @@ -0,0 +1,17 @@ +--- +sidebar_position: 6 +sidebar_label: Compression +--- + +# ZNG Compression Types + +This document specifies values for the `` byte of a +[ZNG compressed value message block](zng.md#2-the-zng-format) +and the corresponding algorithms for the `` byte sequence. + +As new compression algorithms are specified, they will be documented +here without any need to change the ZNG specification. + +Of the 256 possible values for the `` byte, only type `0` is currently +defined and specifies that `` contains an +[LZ4 block](https://github.com/lz4/lz4/blob/master/doc/lz4_Block_format.md). diff --git a/versioned_docs/version-v1.15.0/formats/vng.md b/versioned_docs/version-v1.15.0/formats/vng.md new file mode 100644 index 00000000..f358bbdc --- /dev/null +++ b/versioned_docs/version-v1.15.0/formats/vng.md @@ -0,0 +1,399 @@ +--- +sidebar_position: 4 +sidebar_label: VNG +--- + +# VNG Specification + +VNG, pronounced "ving", is a file format for columnar data based on +[the Zed data model](zed.md). +VNG is the "stacked" version of Zed, where the fields from a stream of +Zed records are stacked into vectors that form columns. +Its purpose is to provide for efficient analytics and search over +bounded-length sequences of [ZNG](zng.md) data that is stored in columnar form. + +Like [Parquet](https://github.com/apache/parquet-format), +VNG provides an efficient columnar representation for semi-structured data, +but unlike Parquet, VNG is not based on schemas and does not require +a schema to be declared when writing data to a file. Instead, +VNG exploits the super-structured nature of Zed data: columns of data +self-organize around their type structure. + +## VNG Files + +A VNG file encodes a bounded, ordered sequence of Zed values. +To provide for efficient access to subsets of VNG-encoded data (e.g., columns), +the VNG file is presumed to be accessible via random access +(e.g., range requests to a cloud object store or seeks in a Unix file system) +and VNG is therefore not intended as a streaming or communication format. + +A VNG file can be stored entirely as one storage object +or split across separate objects that are treated +together as a single VNG entity. While the VNG format provides much flexibility +for how data is laid out, it is left to an implementation to lay out data +in intelligent ways for efficient sequential read accesses of related data. + +## Column Streams + +The VNG data abstraction is built around a collection of _column streams_. + +There is one column stream for each top-level type encountered in the input where +each column stream is encoded according to its type. For top-level complex types, +the embedded elements are encoded recursively in additional column streams +as described below. For example, +a record column encodes a "presence" vector encoding any null value for +each field then encodes each non-null field recursively, whereas +an array column encodes a "lengths" vector and encodes each +element recursively. + +Values are reconstructed one by one from the column streams by picking values +from each appropriate column stream based on the type structure of the value and +its relationship to the various column streams. For hierarchical records +(i.e., records inside of records, or records inside of arrays inside of records, etc), +the reconstruction process is recursive (as described below). + +## The Physical Layout + +The overall layout of a VNG file is comprised of the following sections, +in this order: +* the data section, +* the reassembly section, and +* the trailer. + +This layout allows an implementation to buffer metadata in +memory while writing column data in a natural order to the +data section (based on the volume statistics of each column), +then write the metadata into the reassembly section along with the trailer +at the end. This allows a ZNG stream to be converted to a VNG file +in a single pass. + +> That said, the layout is +> flexible enough that an implementation may optimize the data layout with +> additional passes or by writing the output to multiple files then +> merging them together (or even leaving the VNG entity as separate files). + +### The Data Section + +The data section contains raw data values organized into _segments_, +where a segment is a seek offset and byte length relative to the +data section. Each segment contains a sequence of +[primitive-type Zed values](zed.md#1-primitive-types), +encoded as counted-length byte sequences where the counted-length is +variable-length encoded as in the [ZNG specification](zng.md). +Segments may be compressed. + +There is no information in the data section for how segments relate +to one another or how they are reconstructed into columns. They are just +blobs of ZNG data. + +> Unlike Parquet, there is no explicit arrangement of the column chunks into +> row groups but rather they are allowed to grow at different rates so a +> high-volume column might be comprised of many segments while a low-volume +> column must just be one or several. This allows scans of low-volume record types +> (the "mice") to perform well amongst high-volume record types (the "elephants"), +> i.e., there are not a bunch of seeks with tiny reads of mice data interspersed +> throughout the elephants. +> +> TBD: The mice/elephants model creates an interesting and challenging layout +> problem. If you let the row indexes get too far apart (call this "skew"), then +> you have to buffer very large amounts of data to keep the column data aligned. +> This is the point of row groups in Parquet, but the model here is to leave it +> up to the implementation to do layout as it sees fit. You can also fall back +> to doing lots of seeks and that might work perfectly fine when using SSDs but +> this also creates interesting optimization problems when sequential reads work +> a lot better. There could be a scan optimizer that lays out how the data is +> read that lives under the column stream reader. Also, you can make tradeoffs: +> if you use lots of buffering on ingest, you can write the mice in front of the +> elephants so the read path requires less buffering to align columns. Or you can +> do two passes where you store segments in separate files then merge them at close +> according to an optimization plan. + +### The Reassembly Section + +The reassembly section provides the information needed to reconstruct +column streams from segments, and in turn, to reconstruct the original Zed values +from column streams, i.e., to map columns back to composite values. + +> Of course, the reassembly section also provides the ability to extract just subsets of columns +> to be read and searched efficiently without ever needing to reconstruct +> the original rows. How well this performs is up to any particular +> VNG implementation. +> +> Also, the reassembly section is in general vastly smaller than the data section +> so the goal here isn't to express information in cute and obscure compact forms +> but rather to represent data in an easy-to-digest, programmer-friendly form that +> leverages ZNG. + +The reassembly section is a ZNG stream. Unlike Parquet, +which uses an externally described schema +(via [Thrift](https://thrift.apache.org/)) to describe +analogous data structures, we simply reuse ZNG here. + +#### The Super Types + +This reassembly stream encodes 2*N+1 Zed values, where N is equal to the number +of top-level Zed types that are present in the encoded input. +To simplify terminology, we call a top-level Zed type a "super type", +e.g., there are N unique super types encoded in the VNG file. + +These N super types are defined by the first N values of the reassembly stream +and are encoded as a null value of the indicated super type. +A super type's integer position in this sequence defines its identifier +encoded in the super column (defined below). This identifier is called +the super ID. + +> Change the first N values to type values instead of nulls? + +The next N+1 records contain reassembly information for each of the N super types +where each record defines the column streams needed to reconstruct the original +Zed values. + +#### Segment Maps + +The foundation of column reconstruction is based on _segment maps_. +A segment map is a list of the segments from the data area that are +concatenated to form the data for a column stream. + +Each segment map that appears within the reassembly records is represented +with a Zed array of records that represent seek ranges conforming to this +type signature: +``` +[{offset:uint64,length:uint32,mem_length:uint32,compression_format:uint8}] +``` + +In the rest of this document, we will refer to this type as `` for +shorthand and refer to the concept as a "segmap". + +> We use the type name "segmap" to emphasize that this information represents +> a set of byte ranges where data is stored and must be read from *rather than* +> the data itself. + +#### The Super Column + +The first of the N+1 reassembly records defines the "super column", where this column +represents the sequence of super types of each original Zed value, i.e., indicating +which super type's column stream to select from to pull column values to form +the reconstructed value. +The sequence of super types is defined by each type's super ID (as defined above), +0 to N-1, within the set of N super types. + +The super column stream is encoded as a sequence of ZNG-encoded `int32` primitive values. +While there are a large number entries in the super column (one for each original row), +the cardinality of super IDs is small in practice so this column +will compress very significantly, e.g., in the special case that all the +values in the VNG file have the same super ID, +the super column will compress trivially. + +The reassembly map appears as the next value in the reassembly section +and is of type ``. + +#### The Reassembly Records + +Following the root reassembly map are N reassembly maps, one for each unique super type. + +Each reassembly record is a record of type ``, as defined below, +where each reassembly record appears in the same sequence as the original N schemas. +Note that there is no "any" type in Zed, but rather this terminology is used +here to refer to any of the concrete type structures that would appear +in a given VNG file. + +In other words, the reassembly record of the super column +combined with the N reassembly records collectively define the original sequence +of Zed data values in the original order. +Taken in pieces, the reassembly records allow efficient access to sub-ranges of the +rows, to subsets of columns of the rows, to sub-ranges of columns of the rows, and so forth. + +This simple top-down arrangement, along with the definition of the other +column structures below, is all that is needed to reconstruct all of the +original data. + +> Note that each row reassembly record has its own layout of columnar +> values and there is no attempt made to store like-typed columns from different +> schemas in the same physical column. + +The notation `` refers to any instance of the five column types: +* ``, +* ``, +* ``, +* ``, or +* ``. + +Note that when decoding a column, all type information is known +from the super type in question so there is no need +to encode the type information again in the reassembly record. + +#### Record Column + +A `` is defined recursively in terms of the column types of +its fields, i.e., other types that represent arrays, unions, or primitive types +and has the form: +``` +{ + :{column:,presence:}, + :{column:,presence:}, + ... + :{column:,presence:} +} +``` +where +* `` through `` are the names of the top-level fields of the +original row record, +* the `column` fields are column stream definitions for each field, and +* the `presence` columns are `int32` ZNG column streams comprised of a +run-length encoding of the locations of column values in their respective rows, +when there are null values (as described below). + +If there are no null values, then the `presence` field contains an empty ``. +If all of the values are null, then the `column` field is null (and the `presence` +contains an empty ``). For an empty ``, there is no +corresponding data stored in the data section. Since a `` is a Zed +array, an empty `` is simply the empty array value `[]`. + +#### Array Column + +An `` has the form: +``` +{values:,lengths:} +``` +where +* `values` represents a continuous sequence of values of the array elements +that are sliced into array values based on the length information, and +* `lengths` encodes a Zed `int32` sequence of values that represent the length + of each array value. + +The `` structure is used for both Zed arrays and sets. + +#### Map Column + +A `` has the form: +``` +{key:,value:} +``` +where +* `key` encodes the column of map keys and +* `value` encodes the column of map values. + +#### Union Column + +A `` has the form: +``` +{columns:[],tags:} +``` +where +* `columns` is an array containing the reassembly information for each tagged union value +in the same column order implied by the union type, and +* `tags` is a column of `int32` values where each subsequent value encodes +the tag of the union type indicating which column the value falls within. + +> TBD: change code to conform to columns array instead of record{c0,c1,...} + +The number of times each value of `tags` appears must equal the number of values +in each respective column. + +#### Primitive Column + +A `` is a `` that defines a column stream of +primitive values. + +#### Presence Columns + +The presence column is logically a sequence of booleans, one for each position +in the original column, indicating whether a value is null or present. +The number of values in the encoded column is equal to the number of values +present so that null values are not encoded. + +Instead the presence column is encoded as a sequence of alternating runs. +First, the number of values present is encoded, then the number of values not present, +then the number of values present, and so forth. These runs are then stored +as Zed `int32` values in the presence column (which may be subject to further +compression based on segment compression). + +### The Trailer + +After the reassembly section is a ZNG stream with a single record defining +the "trailer" of the VNG file. The trailer provides a magic field +indicating the "vng" format, a version number, +the size of the segment threshold for decomposing segments into frames, +the size of the skew threshold for flushing all segments to storage when +the memory footprint roughly exceeds this threshold, +and an array of sizes in bytes of the sections of the VNG file. + +This type of this record has the format +``` +{magic:string,type:string,version:int64,sections:[int64],meta:{skew_thresh:int64,segment_thresh:int64} +``` +The trailer can be efficiently found by scanning backward from the end of the +VNG file to find a valid ZNG stream containing a single record value +conforming to the above type. + +## Decoding + +To decode an entire VNG file into rows, the trailer is read to find the sizes +of the sections, then the ZNG stream of the reassembly section is read, +typically in its entirety. + +Since this data structure is relatively small compared to all of the columnar +data in the VNG file, +it will typically fit comfortably in memory and it can be very fast to scan the +entire reassembly structure for any purpose. + +> For example, for a given query, a "scan planner" could traverse all the +> reassembly records to figure out which segments will be needed, then construct +> an intelligent plan for reading the needed segments and attempt to read them +> in mostly sequential order, which could serve as +> an optimizing intermediary between any underlying storage API and the +> VNG decoding logic. + +To decode the "next" row, its schema index is read from the root reassembly +column stream. + +This schema index then determines which reassembly record to fetch +column values from. + +The top-level reassembly fetches column values as a ``. + +For any ``, a value from each field is read from each field's column, +accounting for the presence column indicating null, +and the results are encoded into the corresponding ZNG record value using +ZNG type information from the corresponding schema. + +For a `` a value is determined by reading the next +value from its segmap. + +For an ``, a length is read from its `lengths` segmap as an `int32` +and that many values are read from its the `values` sub-column, +encoding the result as a ZNG array value. + +For a ``, a value is read from its `tags` segmap +and that value is used to select the corresponding column stream +`c0`, `c1`, etc. The value read is then encoded as a ZNG union value +using the same tag within the union value. + +## Examples + +### Hello, world + +Start with this ZNG data (shown as human-readable [ZSON](zson.md)): +``` +{a:"hello",b:"world"} +{a:"goodnight",b:"gracie"} +``` + +To convert to VNG format: +``` +zq -f vng hello.zson > hello.vng +``` + +Segments in the VNG format would be laid out like this: +``` +=== column for a +hello +goodnight +=== column for b +world +gracie +=== column for schema IDs +0 +0 +=== +``` diff --git a/versioned_docs/version-v1.15.0/formats/zed.md b/versioned_docs/version-v1.15.0/formats/zed.md new file mode 100644 index 00000000..60187042 --- /dev/null +++ b/versioned_docs/version-v1.15.0/formats/zed.md @@ -0,0 +1,240 @@ +--- +sidebar_position: 1 +sidebar_label: Data Model +--- + +# Zed Data Model + +Zed data is defined as an ordered sequence of one or more typed data values. +Each value's type is either a "primitive type", a "complex type", the "type type", +a "named type", or the "null type". + +## 1. Primitive Types + +Primitive types include signed and unsigned integers, IEEE binary and decimal +floating point, string, byte sequence, Boolean, IP address, IP network, +null, and a first-class type _type_. + +There are 30 types of primitive values with syntax defined as follows: + +| Name | Definition | +|------------|-------------------------------------------------| +| `uint8` | unsigned 8-bit integer | +| `uint16` | unsigned 16-bit integer | +| `uint32` | unsigned 32-bit integer | +| `uint64` | unsigned 64-bit integer | +| `uint128` | unsigned 128-bit integer | +| `uint256` | unsigned 256-bit integer | +| `int8` | signed 8-bit integer | +| `int16` | signed 16-bit integer | +| `int32` | signed 32-bit integer | +| `int64` | signed 64-bit integer | +| `int128` | signed 128-bit integer | +| `int256` | signed 256-bit integer | +| `duration` | signed 64-bit integer as nanoseconds | +| `time` | signed 64-bit integer as nanoseconds from epoch | +| `float16` | IEEE-754 binary16 | +| `float32` | IEEE-754 binary32 | +| `float64` | IEEE-754 binary64 | +| `float128` | IEEE-754 binary128 | +| `float256` | IEEE-754 binary256 | +| `decimal32` | IEEE-754 decimal32 | +| `decimal64` | IEEE-754 decimal64 | +| `decimal128` | IEEE-754 decimal128 | +| `decimal256` | IEEE-754 decimal256 | +| `bool` | the Boolean value `true` or `false` | +| `bytes` | a bounded sequence of 8-bit bytes | +| `string` | a UTF-8 string | +| `ip` | an IPv4 or IPv6 address | +| `net` | an IPv4 or IPv6 address and net mask | +| `type` | a Zed type value | +| `null` | the null type | + +The _type_ type provides for first-class types and even though a type value can +represent a complex type, the value itself is a singleton. + +Two type values are equivalent if their underlying types are equal. Since +every type in the Zed type system is uniquely defined, type values are equal +if and only if their corresponding types are uniquely equal. + +The _null_ type is a primitive type representing only a `null` value. +A `null` value can have any type. + +> Note that `time` values correspond to 64-bit epoch nanoseconds and thus +> not every valid RFC 3339 date and time string represents a valid Zed time. +> In addition, nanosecond epoch times overflow on April 11, 2262. +> For the world of 2262, a new epoch can be created well in advance +> and the old time epoch and new time epoch can live side by side with +> the old using a named type for the new epoch time referring to the old `time`. +> An app that wants more than 64 bits of timestamp precision can always use +> a named type of a `bytes` type and do its own conversions to and from the +> corresponding bytes values. A time with a local time zone can be represented +> as a Zed record of a time field and a zone field + +## 2. Complex Types + +Complex types are composed of primitive types and/or other complex types. +The categories of complex types include: +* _record_ - an ordered collection of zero or more named values called fields, +* _array_ - an ordered sequence of zero or more values called elements, +* _set_ - a set of zero or more unique values called elements, +* _map_ - a collection of zero or more key/value pairs where the keys are of a +uniform type called the key type and the values are of a uniform type called +the value type, +* _union_ - a type representing values whose type is any of a specified collection of two or more unique types, +* _enum_ - a type representing a finite set of symbols typically representing categories, and +* _error_ - any value wrapped as an "error". + +The type system comprises a total order: +* The order of primitive types corresponds to the order in the table above. +* All primitive types are ordered before any complex types. +* The order of complex type categories corresponds to the order above. +* For complex types of the same category, the order is defined below. + +### 2.1 Record + +A record comprises an ordered set of zero or more named values +called "fields". The field names must be unique in a given record +and the order of the fields is significant, e.g., type `{a:string,b:string}` +is distinct from type `{b:string,a:string}`. + +A field name is any UTF-8 string. + +A field value is any Zed value. + +In contrast to many schema-oriented data formats, Zed has no way to specify +a field as "optional" since any field value can be a null value. + +If an instance of a record value omits a value +by dropping the field altogether rather than using a null, then that record +value corresponds to a different record type that elides the field in question. + +A record type is uniquely defined by its ordered list of field-type pairs. + +The type order of two records is as follows: +* Record with fewer columns than other is ordered before the other. +* Records with the same number of columns are ordered as follows according to: + * the lexicographic order of the field names from left to right, + * or if all the field names are the same, the type order of the field types from left to right. + +### 2.2 Array + +An array is an ordered sequence of zero or more Zed values called "elements" +all conforming to the same Zed type. + +An array value may be empty. An empty array may have element type `null`. + +An array type is uniquely defined by its single element type. + +The type order of two arrays is defined as the type order of the +two array element types. + +> Note that mixed-type JSON arrays are representable as a Zed array with +> elements of type union. + +### 2.3 Set + +A set is an unordered sequence of zero or more Zed values called "elements" +all conforming to the same Zed type. + +A set may be empty. An empty set may have element type `null`. + +A set of mixed-type values is representable as a Zed set with +elements of type union. + +A set type is uniquely defined by its single element type. + +The type order of two sets is defined as the type order of the +two set element types. + +### 2.4 Map + +A map represents a list of zero or more key-value pairs, where the keys +have a common Zed type and the values have a common Zed type. + +Each key across an instance of a map value must be a unique value. + +A map value may be empty. + +A map type is uniquely defined by its key type and value type. + +The type order of two map types is as follows: +* the type order of their key types, +* or if they are the same, then the order of their value types. + +### 2.5 Union + +A union represents a value that may be any one of a specific enumeration +of two or more unique Zed types that comprise its "union type". + +A union type is uniquely defined by an ordered set of unique types (which may be +other union types) where the order corresponds to the Zed type system's total order. + +Union values are tagged in that +any instance of a union value explicitly conforms to exactly one of the union's types. +The union tag is an integer indicating the position of its type in the union +type's ordered list of types. + +The type order of two union types is as follows: +* The union type with fewer types than other is ordered before the other. +* Two union types with the same number of types are ordered according to +the type order of the constituent types in left to right order. + +### 2.6 Enum + +An enum represents a symbol from a finite set of one or more unique symbols +referenced by name. An enum name may be any UTF-8 string. + +An enum type is uniquely defined by its ordered set of unique symbols, +where the order is significant, e.g., two enum types +with the same set of symbols but in different order are distinct. + +The type order of two enum types is as follows: +* The enum type with fewer symbols than other is ordered before the other. +* Two enum types with the same number of symbols are ordered according to +the type order of the constituent types in left to right order. + +### 2.7 Error + +An error represents any value designated as an error. + +The type order of an error is the type order of the type of its contained value. + +## 3. Named Type + +A _named type_ is a name for a specific Zed type. +Any value can have a named type and the named type is a distinct type +from the underlying type. A named type can refer to another named type. + +The binding between a named type and its underlying type is local in scope +and need not be unique across a sequence of values. + +A type name may be any UTF-8 string exclusive of primitive type names. + +For example, if "port" is a named type for `uint16`, then two values of +type "port" have the same type but a value of type "port" and a value of type `uint16` +do not have the same type. + +The type order of a named type is the type order of its underlying type with two +exceptions: +* A named type is ordered after its underlying type. +* Named types sharing an underlying type are ordered lexicographically by name. + +> While the Zed data model does not include explicit support for schema versioning, +> named types provide a flexible mechanism to implement versioning +> on top of the Zed serialization formats. For example, a Zed-based system +> could define a naming convention of the form `.` +> where `` is the type name of a record representing the schema +> and `` is a decimal string indicating the version of that schema. +> Since types need only be parsed once per stream +> in the Zed binary serialization formats, a Zed type implementation could +> efficiently support schema versioning using such a convention. + +## 4. Null Values + +All Zed types have a null representation. It is up to an +implementation to decide how external data structures map into and +out of values with nulls. Typically, a null value is either the +zero value or, in the case of record fields, an optional field whose +value is not present, though these semantics are not explicitly +defined by the Zed data model. diff --git a/versioned_docs/version-v1.15.0/formats/zjson.md b/versioned_docs/version-v1.15.0/formats/zjson.md new file mode 100644 index 00000000..af8ff2ad --- /dev/null +++ b/versioned_docs/version-v1.15.0/formats/zjson.md @@ -0,0 +1,476 @@ +--- +sidebar_position: 5 +sidebar_label: ZJSON +--- + +# ZJSON Specification + +## 1. Introduction + +The [Zed data model](zed.md) +is based on richly typed records with a deterministic field order, +as is implemented by the [ZSON](zson.md), [ZNG](zng.md), and [VNG](vng.md) formats. +Given the ubiquity of JSON, it is desirable to also be able to serialize +Zed data into the JSON format. However, encoding Zed data values +directly as JSON values would not work without loss of information. + +For example, consider this Zed data as [ZSON](zson.md): +``` +{ + ts: 2018-03-24T17:15:21.926018012Z, + a: "hello, world", + b: { + x: 4611686018427387904, + y: 127.0.0.1 + } +} +``` +A straightforward translation to JSON might look like this: +``` +{ + "ts": 1521911721.926018012, + "a": "hello, world", + "b": { + "x": 4611686018427387904, + "y": "127.0.0.1" + } +} +``` +But, when this JSON is transmitted to a JavaScript client and parsed, +the result looks something like this: +``` +{ + "ts": 1521911721.926018, + "a": "hello, world", + "b": { + "x": 4611686018427388000, + "y": "127.0.0.1" + } +} +``` +The good news is the `a` field came through just fine, but there are +a few problems with the remaining fields: +* the timestamp lost precision (due to 53 bits of mantissa in a JavaScript +IEEE 754 floating point number) and was converted from a time type to a number, +* the int64 lost precision for the same reason, and +* the IP address has been converted to a string. + +As a comparison, Python's `json` module handles the 64-bit integer to full +precision, but loses precision on the floating point timestamp. +Also, it is at the whim of a JSON implementation whether +or not the order of object keys is preserved. + +While JSON is well suited for data exchange of generic information, it is not +so appropriate for a [super-structured data model](./README.md#2-zed-a-super-structured-pattern) +like Zed. That said, JSON can be used as an encoding format for Zed by mapping Zed data +onto a JSON-based protocol. This allows clients like web apps or +Electron apps to receive and understand Zed and, with the help of client +libraries like [zed-js](https://github.com/brimdata/zui/tree/main/packages/zed-js), +to manipulate the rich, structured Zed types that are implemented on top of +the basic JavaScript types. + +In other words, +because JSON objects do not have a deterministic field order nor does JSON +in general have typing beyond the basics (i.e., strings, floating point numbers, +objects, arrays, and booleans), we decided to encode Zed data with +its embedded type model all in a layer above regular JSON. + +## 2. The Format + +The format for representing Zed in JSON is called ZJSON. +Converting ZSON, ZNG, or VNG to ZJSON and back results in a complete and +accurate restoration of the original Zed data. + +A ZJSON stream is defined as a sequence of JSON objects where each object +represents a Zed value and has the form: +``` +{ + "type": , + "value": +} +``` +The type and value fields are encoded as defined below. + +### 2.1 Type Encoding + +The type encoding for a primitive type is simply its [Zed type name](zed.md#1-primitive-types) +e.g., "int32" or "string". + +Complex types are encoded with small-integer identifiers. +The first instance of a unique type defines the binding between the +integer identifier and its definition, where the definition may recursively +refer to earlier complex types by their identifiers. + +For example, the Zed type `{s:string,x:int32}` has this ZJSON format: +``` +{ + "id": 123, + "kind": "record", + "fields": [ + { + "name": "s", + "type": { + "kind": "primitive", + "name": "string" + } + }, + { + "name": "x", + "type": { + "kind": "primitive", + "name": "int32" + } + } + ] +} +``` + +A previously defined complex type may be referred to using a reference of the form: +``` +{ + "kind": "ref", + "id": 123 +} +``` + +#### 2.1.1 Record Type + +A record type is a JSON object of the form +``` +{ + "id": , + "kind": "record", + "fields": [ , , ... ] +} +``` +where each of the fields has the form +``` +{ + "name": , + "type": , +} +``` +and `` is a string defining the field name and `` is a +recursively encoded type. + +#### 2.1.2 Array Type + +An array type is defined by a JSON object having the form +``` +{ + "id": , + "kind": "array", + "type": +} +``` +where `` is a recursively encoded type. + +#### 2.1.3 Set Type + +A set type is defined by a JSON object having the form +``` +{ + "id": , + "kind": "set", + "type": +} +``` +where `` is a recursively encoded type. + +#### 2.1.4 Map Type + +A map type is defined by a JSON object of the form +``` +{ + "id": , + "kind": "map", + "key_type": , + "val_type": +} +``` +where each `` is a recursively encoded type. + +#### 2.1.5 Union type + +A union type is defined by a JSON object having the form +``` +{ + "id": , + "kind": "union", + "types": [ , , ... ] +} +``` +where the list of types comprise the types of the union and +and each ``is a recursively encoded type. + +#### 2.1.6 Enum Type + +An enum type is a JSON object of the form +``` +{ + "id": , + "kind": "enum", + "symbols": [ , , ... ] +} +``` +where the unique `` values define a finite set of symbols. + +#### 2.1.7 Error Type + +An error type is a JSON object of the form +``` +{ + "id": , + "kind": "error", + "type": +} +``` +where `` is a recursively encoded type. + +#### 2.1.8 Named Type + +A named type is encoded as a binding between a name and a Zed type +and represents a new type so named. A type definition type has the form +``` +{ + "id": , + "kind": "named", + "name": , + "type": , +} +``` +where `` is a JSON string representing the newly defined type name +and `` is a recursively encoded type. + +### 2.2 Value Encoding + +The primitive values comprising an arbitrarily complex Zed data value are encoded +as a JSON array of strings mixed with nested JSON arrays whose structure +conforms to the nested structure of the value's schema as follows: +* each record, array, and set is encoded as a JSON array of its composite values, +* a union is encoded as a string of the form `:` where `tag` +is an integer string representing the positional index in the union's list of +types that specifies the type of ``, which is a JSON string or array +as described recursively herein, +* a map is encoded as a JSON array of two-element arrays of the form +`[ , ]` where `key` and `value` are recursively encoded, +* a type value is encoded [as above](#21-type-encoding), +* each primitive that is not a type value +is encoded as a string conforming to its ZSON representation, as described in the +[corresponding section of the ZSON specification](zson.md#23-primitive-values). + +For example, a record with three fields --- a string, an array of integers, +and an array of union of string, and float64 --- might have a value that looks like this: +``` +[ "hello, world", ["1","2","3","4"], ["1:foo", "0:10" ] ] +``` + +## 3. Object Framing + +A ZJSON file is composed of ZJSON objects formatted as +[newline delimited JSON (NDJSON)](https://en.wikipedia.org/wiki/JSON_streaming#NDJSON). +e.g., the [zq](../commands/zq.md) CLI command +writes its ZJSON output as lines of NDJSON. + +## 4. Example + +Here is an example that illustrates values of a repeated type, +nesting, records, array, and union. Consider the file `input.zson`: + +```mdtest-input input.zson +{s:"hello",r:{a:1,b:2}} +{s:"world",r:{a:3,b:4}} +{s:"hello",r:{a:[1,2,3]}} +{s:"goodnight",r:{x:{u:"foo"((string,int64))}}} +{s:"gracie",r:{x:{u:12((string,int64))}}} +``` + +This data is represented in ZJSON as follows: + +```mdtest-command +zq -f zjson input.zson | jq . +``` + +```mdtest-output +{ + "type": { + "kind": "record", + "id": 31, + "fields": [ + { + "name": "s", + "type": { + "kind": "primitive", + "name": "string" + } + }, + { + "name": "r", + "type": { + "kind": "record", + "id": 30, + "fields": [ + { + "name": "a", + "type": { + "kind": "primitive", + "name": "int64" + } + }, + { + "name": "b", + "type": { + "kind": "primitive", + "name": "int64" + } + } + ] + } + } + ] + }, + "value": [ + "hello", + [ + "1", + "2" + ] + ] +} +{ + "type": { + "kind": "ref", + "id": 31 + }, + "value": [ + "world", + [ + "3", + "4" + ] + ] +} +{ + "type": { + "kind": "record", + "id": 34, + "fields": [ + { + "name": "s", + "type": { + "kind": "primitive", + "name": "string" + } + }, + { + "name": "r", + "type": { + "kind": "record", + "id": 33, + "fields": [ + { + "name": "a", + "type": { + "kind": "array", + "id": 32, + "type": { + "kind": "primitive", + "name": "int64" + } + } + } + ] + } + } + ] + }, + "value": [ + "hello", + [ + [ + "1", + "2", + "3" + ] + ] + ] +} +{ + "type": { + "kind": "record", + "id": 38, + "fields": [ + { + "name": "s", + "type": { + "kind": "primitive", + "name": "string" + } + }, + { + "name": "r", + "type": { + "kind": "record", + "id": 37, + "fields": [ + { + "name": "x", + "type": { + "kind": "record", + "id": 36, + "fields": [ + { + "name": "u", + "type": { + "kind": "union", + "id": 35, + "types": [ + { + "kind": "primitive", + "name": "int64" + }, + { + "kind": "primitive", + "name": "string" + } + ] + } + } + ] + } + } + ] + } + } + ] + }, + "value": [ + "goodnight", + [ + [ + [ + "1", + "foo" + ] + ] + ] + ] +} +{ + "type": { + "kind": "ref", + "id": 38 + }, + "value": [ + "gracie", + [ + [ + [ + "0", + "12" + ] + ] + ] + ] +} +``` diff --git a/versioned_docs/version-v1.15.0/formats/zng.md b/versioned_docs/version-v1.15.0/formats/zng.md new file mode 100644 index 00000000..1c2a3b46 --- /dev/null +++ b/versioned_docs/version-v1.15.0/formats/zng.md @@ -0,0 +1,646 @@ +--- +sidebar_position: 2 +sidebar_label: ZNG +--- + +# ZNG Specification + +## 1. Introduction + +ZNG (pronounced "zing") is an efficient, sequence-oriented serialization format for any data +conforming to the [Zed data model](zed.md). + +ZNG is "row oriented" and +analogous to [Apache Avro](https://avro.apache.org) but does not +require schema definitions as it instead utilizes the fine-grained type system +of the Zed data model. +This binary format is based on machine-readable data types with an +encoding methodology inspired by Avro, +[Parquet](https://en.wikipedia.org/wiki/Apache_Parquet), and +[Protocol Buffers](https://developers.google.com/protocol-buffers). + +To this end, ZNG embeds all type information +in the stream itself while having a binary serialization format that +allows "lazy parsing" of fields such that +only the fields of interest in a stream need to be deserialized and interpreted. +Unlike Avro, ZNG embeds its "schemas" in the data stream as Zed types and thereby admits +an efficient multiplexing of heterogeneous data types by prepending to each +data value a simple integer identifier to reference its type. + +Since no external schema definitions exist in ZNG, a "type context" is constructed +on the fly by composing dynamic type definitions embedded in the ZNG format. +ZNG can be readily adapted to systems like +[Apache Kafka](https://kafka.apache.org/) which utilize schema registries, +by having a connector translate the schemas implied in the +ZNG stream into registered schemas and vice versa. Better still, Kafka could +be used natively with ZNG obviating the need for the schema registry. + +Multiple ZNG streams with different type contexts are easily merged because the +serialization of values does not depend on the details of +the type context. One or more streams can be merged by simply merging the +input contexts into an output context and adjusting the type reference of +each value in the output ZNG sequence. The values need not be traversed +or otherwise rewritten to be merged in this fashion. + +## 2. The ZNG Format + +A ZNG stream comprises a sequence of frames where +each frame contains one of three types of data: +_types_, _values_, or externally-defined _control_. + +A stream is punctuated by the end-of-stream value `0xff`. + +Each frame header includes a length field +allowing an implementation to easily skip from frame to frame. + +Each frame begins with a single-byte "frame code": +``` + 7 6 5 4 3 2 1 0 + +-+-+-+-+-+-+-+-+ + |V|C| T| L| + +-+-+-+-+-+-+-+-+ + + V: 1 bit + + Version number. Must be zero. + + C: 1 bit + + Indicates compressed frame data. + + T: 2 bits + + Type of frame data. + + 00: Types + 01: Values + 10: Control + 11: End of stream + + L: 4 bits + + Low-order bits of frame length. +``` + +Bit 7 of the frame code must be zero as it defines version 0 +of the ZNG stream format. If a future version of ZNG +arises, bit 7 of future ZNG frames will be 1. +ZNG version 0 readers must ignore and skip over such frames using the +`len` field, which must survive future versions. +Any future versions of ZNG must be able to integrate version 0 frames +for backward compatibility. + +Following the frame code is its encoded length followed by a "frame payload" +of bytes of said length: +``` + +``` +The length encoding utilizes a variable-length unsigned integer called herein a `uvarint`: + +> Inspired by Protocol Buffers, +> a `uvarint` is an unsigned, variable-length integer encoded as a sequence of +> bytes consisting of N-1 bytes with bit 7 clear and the Nth byte with bit 7 set, +> whose value is the base-128 number composed of the digits defined by the lower +> 7 bits of each byte from least-significant digit (byte 0) to +> most-significant digit (byte N-1). + +The frame payload's length is equal to the value of the `uvarint` following the +frame code times 16 plus the low 4-bit integer value `L` field in the frame code. + +If the `C` bit is set in the frame code, then the frame payload following the +frame length is compressed and has the form: +``` + +``` +where +* `` is a single byte indicating the compression format of the the compressed payload, +* `` is a `uvarint` encoding the size of the uncompressed payload, and +* `` is a bytes sequence whose length equals +the outer frame length less 1 byte for the compression format and the encoded length +of the `uvarint` size field. + +The `compressed payload` is compressed according to the compression algorithm +specified by the `format` byte. Each frame is compressed independently +such that the compression algorithm's state is not carried from frame to frame +(thereby enabling parallel decoding). + +The `` value is redundant with the compressed payload +but is useful to an implementation to deterministically +size decompression buffers in advance of decoding. + +Values for the `format` byte are defined in the +[ZNG compression format specification](./compression.md). + +> This arrangement of frames separating types and values allows +> for efficient scanning and parallelization. In general, values depend +> on type definitions but as long as all of the types are known when +> values are used, decoding can be done in parallel. Likewise, since +> each block is independently compressed, the blocks can be decompressed +> in parallel. Moreover, efficient filtering can be carried out over +> uncompressed data before it is deserialized into native data structures, +> e.g., allowing entire frames to be discarded based on +> heuristics, e.g., knowing a filtering predicate can't be true based on a +> quick scan of the data perhaps using the Boyer-Moore algorithm to determine +> that a comparison with a string constant would not work for any +> value in the buffer. + +Whether the payload was originally uncompressed or was decompressed, it is +then interpreted according to the `T` bits of the frame code as a +* [types frame](#21-types-frame), +* [values frame](#22-values-frame), or +* [control frame](#23-control-frame). + +### 2.1 Types Frame + +A _types frame_ encodes a sequence of type definitions for complex Zed types +and establishes a "type ID" for each such definition. +Type IDs for the "primitive types" +are predefined with the IDs listed in the [Primitive Types](#3-primitive-types) table. + +Each definition, or "typedef", +consists of a typedef code followed by its type-specific encoding as described below. +Each type must be decoded in sequence to find the start of the next type definition +as there is no framing to separate the typedefs. + +The typedefs are numbered in the order encountered starting at 30 +(as the largest primary type ID is 29). Types refer to other types +by their type ID. Note that the type ID of a typedef is implied by its +position in the sequence and is not explicitly encoded. + +The typedef codes are defined as follows: + +| Code | Complex Type | +|------|--------------------------| +| 0 | record type definition | +| 1 | array type definition | +| 2 | set type definition | +| 3 | map type definition | +| 4 | union type definition | +| 5 | enum type definition | +| 6 | error type definition | +| 7 | named type definition | + +Any references to a type ID in the body of a typedef are encoded as a `uvarint`, + +#### 2.1.1 Record Typedef + +A record typedef creates a new type ID equal to the next stream type ID +with the following structure: +``` +--------------------------------------------------------- +|0x00||...| +--------------------------------------------------------- +``` +Record types consist of an ordered set of fields where each field consists of +a name and its type. Unlike JSON, the ordering of the fields is significant +and must be preserved through any APIs that consume, process, and emit ZNG records. + +A record type is encoded as a count of fields, i.e., `` from above, +followed by the field definitions, +where a field definition is a field name followed by a type ID, i.e., +`` followed by `` etc. as indicated above. + +The field names in a record must be unique. + +The `` value is encoded as a `uvarint`. + +The field name is encoded as a UTF-8 string defining a "ZNG identifier". +The UTF-8 string +is further encoded as a "counted string", which is the `uvarint` encoding +of the length of the string followed by that many bytes of UTF-8 encoded +string data. + +N.B.: As defined by [ZSON](zson.md), a field name can be any valid UTF-8 string much like JSON +objects can be indexed with arbitrary string keys (via index operator) +even if the field names available to the dot operator are restricted +by language syntax for identifiers. + +The type ID follows the field name and is encoded as a `uvarint`. + +#### 2.1.2 Array Typedef + +An array type is encoded as simply the type code of the elements of +the array encoded as a `uvarint`: +``` +---------------- +|0x01|| +---------------- +``` + +#### 2.1.3 Set Typedef + +A set type is encoded as the type ID of the +elements of the set, encoded as a `uvarint`: +``` +---------------- +|0x02|| +---------------- +``` + +#### 2.1.4 Map Typedef + +A map type is encoded as the type code of the key +followed by the type code of the value. +``` +-------------------------- +|0x03||| +-------------------------- +``` +Each `` is encoded as `uvarint`. + +#### 2.1.5 Union Typedef + +A union typedef creates a new type ID equal to the next stream type ID +with the following structure: +``` +----------------------------------------- +|0x04||...| +----------------------------------------- +``` +A union type consists of an ordered set of types +encoded as a count of the number of types, i.e., `` from above, +followed by the type IDs comprising the types of the union. +The type IDs of a union must be unique. + +The `` and the type IDs are all encoded as `uvarint`. + +`` cannot be 0. + +#### 2.1.6 Enum Typedef + +An enum type is encoded as a `uvarint` representing the number of symbols +in the enumeration followed by the names of each symbol. +``` +-------------------------------- +|0x05||...| +-------------------------------- +``` +`` is encoded as `uvarint`. +The names have the same UTF-8 format as record field names and are encoded +as counted strings following the same convention as record field names. + +#### 2.1.7 Error Typedef + +An error type is encoded as follows: +``` +---------------- +|0x06|| +---------------- +``` +which defines a new error type for error values that have the underlying type +indicated by ``. + +#### 2.1.8 Named Type Typedef + +A named type defines a new type ID that binds a name to a previously existing type ID. + +A named type is encoded as follows: +``` +---------------------- +|0x07|| +---------------------- +``` +where `` is an identifier representing the new type name with a new type ID +allocated as the next available type ID in the stream that refers to the +existing type ID ``. `` is encoded as a `uvarint` and `` +is encoded as a `uvarint` representing the length of the name in bytes, +followed by that many bytes of UTF-8 string. + +As indicated in the [data model](zed.md), +it is an error to define a type name that has the same name as a primitive type, +and it is permissible to redefine a previously defined type name with a +type that differs from the previous definition. + +### 2.2 Values Frame + +A _values frame_ is a sequence of Zed values each encoded as the value's type ID, +encoded as a `uvarint`, followed by its tag-encoded serialization as described below. + +Since a single type ID encodes the entire value's structure, no additional +type information is needed. Also, the value encoding follows the structure +of the type explicitly so the type is not needed to parse the structure of the +value, but rather only its semantics. + +It is an error for a value to reference a type ID that has not been +previously defined by a typedef scoped to the stream in which the value +appears. + +The value is encoded using a "tag-encoding" scheme +that captures the structure of both primitive types and the recursive +nature of complex types. This structure is encoded +explicitly in every value and the boundaries of each value and its +recursive nesting can be parsed without knowledge of the type or types of +the underlying values. This admits an efficient implementation +for traversing the values, inclusive of recursive traversal of complex values, +whereby the inner loop need not consult and interpret the type ID of each element. + +#### 2.2.1 Tag-Encoding of Values + +Each value is prefixed with a "tag" that defines: +* whether it is the null value, and +* its encoded length in bytes. + +The tag is 0 for the null value and `length+1` for non-null values where +`length` is the encoded length of the value. Note that this encoding +differentiates between a null value and a zero-length value. Many data types +have a meaningful interpretation of a zero-length value, for example, an +empty array, the empty record, etc. + +The tag itself is encoded as a `uvarint`. + +#### 2.2.2 Tag-Encoded Body of Primitive Values + +Following the tag encoding is the value encoded in N bytes as described above. +A typed value with a `value` of length `N` is interpreted as described in the +[Primitive Types](#3-primitive-types) table. The type information needed to +interpret all of the value elements of a complex type are all implied by the +top-level type ID of the values frame. For example, the type ID could indicate +a particular record type, which recursively provides the type information +for all of the elements within that record, including other complex types +embedded within the top-level record. + +Note that because the tag indicates the length of the value, there is no need +to use varint encoding of integer values. Instead, an integer value is encoded +using the full 8 bits of each byte in little-endian order. Signed values, +before encoding, are shifted left one bit, and the sign bit stored as bit 0. +For negative numbers, the remaining bits are negated so that the upper bytes +tend to be zero-filled for small integers. + +#### 2.2.3 Tag-Encoded Body of Complex Values + +The body of a length-N container comprises zero or more tag-encoded values, +where the values are encoded as follows: + +| Type | Value | +|----------|-----------------------------------------| +| `array` | concatenation of elements | +| `set` | normalized concatenation of elements | +| `record` | concatenation of elements | +| `map` | concatenation of key and value elements | +| `union` | concatenation of tag and value | +| `enum` | position of enum element | +| `error` | wrapped element | + +Since N, the byte length of any of these container values, is known, +there is no need to encode a count of the +elements present. Also, since the type ID is implied by the typedef +of any complex type, each value is encoded without its type ID. + +For sets, the concatenation of elements must be normalized so that the +sequence of bytes encoding each element's tag-counted value is +lexicographically greater than that of the preceding element. + +A union value is encoded as a container with two elements. The first +element, called the tag, is the `uvarint` encoding of the +positional index determining the type of the value in reference to the +union's list of defined types, and the second element is the value +encoded according to that type. + +An enumeration value is represented as the `uvarint` encoding of the +positional index of that value's symbol in reference to the enum's +list of defined symbols. + +A map value is encoded as a container whose elements are alternating +tag-encoded keys and values, with keys and values encoded according to +the map's key type and value type, respectively. + +The concatenation of elements must be normalized so that the +sequence of bytes encoding each tag-counted key (of the key/value pair) is +lexicographically greater than that of the preceding key (of the preceding +key/value pair). + +### 2.3 Control Frame + +A _control frame_ contains an application-defined control message. + +Control frames are available to higher-layer protocols and are carried +in ZNG as a convenient signaling mechanism. A ZNG implementation +may skip over all control frames and is guaranteed by +this specification to decode all of the data as described herein even if such +frames provide additional semantics on top of the base ZNG format. + +The body of a control frame is a control message and may be JSON, +ZSON, ZNG, binary, or UTF-8 text. The serialization of the control +frame body is independent of the ZNG stream containing the control +frame. + +Any control message not known by a ZNG data receiver shall be ignored. + +The delivery order of control messages with respect to the delivery +order of values of the ZNG stream should be preserved by an API implementing +ZNG serialization and deserialization. +In this way, system endpoints that communicate using ZNG can embed +protocol directives directly into the ZNG stream as control payloads +in an order-preserving semantics rather than defining additional +layers of encapsulation and synchronization between such layers. + +A control frame has the following form: +``` +------------------------- +|||| +------------------------- +``` +where +* `` is a single byte indicating whether the body is encoded +as ZNG (0), JSON (1), ZSON (2), an arbitrary UTF-8 string (3), or arbitrary binary data (4), +* `` is a `uvarint` encoding the length in bytes of the body +(exclusive of the length 1 encoding byte), and +* `` is a control message whose semantics are outside the scope of +the base ZNG specification. + +If the encoding type is ZNG, the embedded ZNG data +starts and ends a single ZNG stream independent of the outer ZNG stream. + +### 2.4 End of Stream + +A ZNG stream must be terminated by an end-of-stream marker. +A new ZNG stream may begin immediately after an end-of-stream marker. +Each such stream has its own, independent type context. + +In this way, the concatenation of ZNG streams (or ZNG files containing +ZNG streams) results in a valid ZNG data sequence. + +For example, a large ZNG file can be arranged into multiple, smaller streams +to facilitate random access at stream boundaries. +This benefit comes at the cost of some additional overhead -- +the space consumed by stream boundary markers and repeated type definitions. +Choosing an appropriate stream size that balances this overhead with the +benefit of enabling random access is left up to implementations. + +End-of-stream markers are also useful in the context of sending ZNG over Kafka, +as a receiver can easily resynchronize with a live Kafka topic by +discarding incomplete frames until a frame is found that is terminated +by an end-of-stream marker (presuming the sender implementation aligns +the ZNG frames on Kafka message boundaries). + +A end-of-stream marker is encoded as follows: +``` +------ +|0xff| +------ +``` + +After this marker, all previously read +typedefs are invalidated and the "next available type ID" is reset to +the initial value of 30. To represent subsequent values that use a +previously defined type, the appropriate typedef control code must +be re-emitted +(and note that the typedef may now be assigned a different ID). + +## 3. Primitive Types + +For each ZNG primitive type, the following table describes: +* its type ID, and +* the interpretation of a length `N` [value frame](#22-values-frame). + +All fixed-size multi-byte sequences representing machine words +are serialized in little-endian format. + +| Type | ID | N | ZNG Value Interpretation | +|--------------|---:|:--------:|------------------------------------------------| +| `uint8` | 0 | variable | unsigned int of length N | +| `uint16` | 1 | variable | unsigned int of length N | +| `uint32` | 2 | variable | unsigned int of length N | +| `uint64` | 3 | variable | unsigned int of length N | +| `uint128` | 4 | variable | unsigned int of length N | +| `uint256` | 5 | variable | unsigned int of length N | +| `int8` | 6 | variable | signed int of length N | +| `int16` | 7 | variable | signed int of length N | +| `int32` | 8 | variable | signed int of length N | +| `int64` | 9 | variable | signed int of length N | +| `int128` | 10 | variable | signed int of length N | +| `int256` | 11 | variable | signed int of length N | +| `duration` | 12 | variable | signed int of length N as ns | +| `time` | 13 | variable | signed int of length N as ns since epoch | +| `float16` | 14 | 2 | 2 bytes of IEEE 64-bit format | +| `float32` | 15 | 4 | 4 bytes of IEEE 64-bit format | +| `float64` | 16 | 8 | 8 bytes of IEEE 64-bit format | +| `float128` | 17 | 16 | 16 bytes of IEEE 64-bit format | +| `float256` | 18 | 32 | 32 bytes of IEEE 64-bit format | +| `decimal32` | 19 | 4 | 4 bytes of IEEE decimal format | +| `decimal64` | 20 | 8 | 8 bytes of IEEE decimal format | +| `decimal128` | 21 | 16 | 16 bytes of IEEE decimal format | +| `decimal256` | 22 | 32 | 32 bytes of IEEE decimal format | +| `bool` | 23 | 1 | one byte 0 (false) or 1 (true) | +| `bytes` | 24 | variable | N bytes of value | +| `string` | 25 | variable | UTF-8 byte sequence | +| `ip` | 26 | 4 or 16 | 4 or 16 bytes of IP address | +| `net` | 27 | 8 or 32 | 8 or 32 bytes of IP prefix and subnet mask | +| `type` | 28 | variable | type value byte sequence [as defined below](#4-type-values) | +| `null` | 29 | 0 | No value, always represents an undefined value | + +## 4. Type Values + +As the ZSON data model supports first-class types and because the ZNG design goals +require that value serializations cannot change across type contexts, type values +must be encoded in a fashion that is independent of the type context. +Thus, a serialized type value encodes the entire type in a canonical form +according to the recursive definition in this section. + +The type value of a primitive type (include type `type`) is its primitive ID, +serialized as a single byte. + +The type value of a complex type is serialized recursively according to the +complex type it represents as described below. + +#### 4.1 Record Type Value + +A record type value has the form: +``` +--------------------------------------------------- +|30||...| +--------------------------------------------------- +``` +where `` is the number of fields in the record encoded as a `uvarint`, +`` etc. are the field names encoded as in the +record typedef, and each `` is a recursive encoding of a type value. + +#### 4.2 Array Type Value + +An array type value has the form: +``` +-------------- +|31|| +-------------- +``` +where `` is a recursive encoding of a type value. + +#### 4.3 Set Type Value + +An set type value has the form: +``` +-------------- +|32|| +-------------- +``` +where `` is a recursive encoding of a type value. + +#### 4.4 Map Type Value + +A map type value has the form: +``` +-------------------------- +|33||| +-------------------------- +``` +where `` and `` are recursive encodings of type values. + +#### 4.5 Union Type Value + +A union type value has the form: +``` +----------------------------------- +|34||...| +----------------------------------- +``` +where `` is the number of types in the union encoded as a `uvarint` +and each `` is a recursive definition of a type value. + +#### 4.6 Enum Type Value + +An enum type value has the form: +``` +------------------------------ +|35||...| +------------------------------ +``` +where `` and each symbol name is encoded as in an enum typedef. + +#### 4.7 Error Type Value + +An error type value has the form: +``` +----------- +|36|| +----------- +``` +where `` is the type value of the error. + +#### 4.8 Named Type Type Value + +A named type type value may appear either as a definition or a reference. +When a named type is referenced, it must have been previously +defined in the type value in accordance with a left-to-right depth-first-search (DFS) +traversal of the type. + +A named type definition has the form: +``` +-------------------- +|37|| +-------------------- +``` +where `` is encoded as in an named type typedef +and `` is a recursive encoding of a type value. This creates +a binding between the given name and the indicated type value only within the +scope of the encoded value and does not affect the type context. +This binding may be changed by another named type definition +of the same name in the same type value according to the DFS order. + +An named type reference has the form: +``` +----------- +|38|| +----------- +``` +It is an error for an named type reference to appear in a type value with a name +that has not been previously defined according to the DFS order. diff --git a/versioned_docs/version-v1.15.0/formats/zson.md b/versioned_docs/version-v1.15.0/formats/zson.md new file mode 100644 index 00000000..42452771 --- /dev/null +++ b/versioned_docs/version-v1.15.0/formats/zson.md @@ -0,0 +1,582 @@ +--- +sidebar_position: 3 +sidebar_label: ZSON +--- + +# ZSON Specification + +## 1. Introduction + +ZSON is the human-readable, text-based serialization format of +the super-structured [Zed data model](zed.md). + +ZSON builds upon the elegant simplicity of JSON with "type decorators". +Where the type of a value is not implied by its syntax, a parenthesized +type decorator is appended to the value thus establishing a well-defined +type for every value expressed in ZSON text. + +ZSON is also a superset of JSON in that all JSON documents are valid ZSON values. + +## 2. The ZSON Format + +A ZSON text is a sequence of UTF-8 characters organized either as a bounded input +or an unbounded stream. + +The input text is organized as a sequence of one or more Zed values optionally +separated by and interspersed with whitespace. +Single-line (`//`) and multi-line (`/* ... */`) comments are +treated as whitespace and ignored. + +All subsequent references to characters and strings in this section refer to +the Unicode code points that result when the stream is decoded. +If a ZSON input includes data that is not valid UTF-8, the input is invalid. + +### 2.1 Names + +ZSON _names_ encode record fields, enum symbols, and named types. +A name is either an _identifier_ or a [quoted string](#231-strings). +Names are referred to as `` below. + +An _identifier_ is case-sensitive and can contain Unicode letters, `$`, `_`, +and digits (0-9), but may not start with a digit. An identifier cannot be +`true`, `false`, or `null`. + +### 2.2 Type Decorators + +A value may be explicitly typed by tagging it with a type decorator. +The syntax for a decorator is a parenthesized type: +``` + ( ) +``` +For union values, multiple decorators might be +required to distinguish the union-member type from the possible set of +union types when there is ambiguity, as in +``` +123. (float32) ((int64,float32,float64)) +``` +In contrast, this union value is unambiguous: +``` +123. ((int64,float64)) +``` + +The syntax of a union value decorator is +``` + ( ) [ ( ) ...] +``` +where the rightmost type must be a union type if more than one decorator +is present. + +A decorator may also define a [named type](#258-named-type): +``` + ( = ) +``` +which declares a new type with the indicated type name using the +implied type of the value. Type names may not be numeric, where a +numeric is a sequence of one or more characters in the set `[0-9]`. + +A decorator may also defined a temporary numeric reference of the form: +``` + ( = ) +``` +Once defined, this numeric reference may then be used anywhere a named type +is used but a named type is not created. + +It is an error for the decorator to be type incompatible with its referenced value. + +Note that the `=` sigil here disambiguates between the case that a new +type is defined, which may override a previous definition of a different type with the +same name, from the case that an existing named type is merely decorating the value. + +### 2.3 Primitive Values + +The type names and format for +[Zed primitive](zed.md#1-primitive-types) values is as follows: + +| Type | Value Format | +|------------|---------------------------------------------------------------| +| `uint8` | decimal string representation of any unsigned, 8-bit integer | +| `uint16` | decimal string representation of any unsigned, 16-bit integer | +| `uint32` | decimal string representation of any unsigned, 32-bit integer | +| `uint64` | decimal string representation of any unsigned, 64-bit integer | +| `uint128` | decimal string representation of any unsigned, 128-bit integer | +| `uint256` | decimal string representation of any unsigned, 256-bit integer | +| `int8` | decimal string representation of any signed, 8-bit integer | +| `int16` | decimal string representation of any signed, 16-bit integer | +| `int32` | decimal string representation of any signed, 32-bit integer | +| `int64` | decimal string representation of any signed, 64-bit integer | +| `int128` | decimal string representation of any signed, 128-bit integer | +| `int256` | decimal string representation of any signed, 256-bit integer | +| `duration` | a _duration string_ representing signed 64-bit nanoseconds | +| `time` | an RFC 3339 UTC date/time string representing signed 64-bit nanoseconds from epoch | +| `float16` | a _non-integer string_ representing an IEEE-754 binary16 value | +| `float32` | a _non-integer string_ representing an IEEE-754 binary32 value | +| `float64` | a _non-integer string_ representing an IEEE-754 binary64 value | +| `float128` | a _non-integer string_ representing an IEEE-754 binary128 value | +| `float256` | a _non-integer string_ representing an IEEE-754 binary256 value | +| `decimal32` | a _non-integer string_ representing an IEEE-754 decimal32 value | +| `decimal64` | a _non-integer string_ representing an IEEE-754 decimal64 value | +| `decimal128` | a _non-integer string_ representing an IEEE-754 decimal128 value | +| `decimal256` | a _non-integer string_ representing an IEEE-754 decimal256 value | +| `bool` | the string `true` or `false` | +| `bytes` | a sequence of bytes encoded as a hexadecimal string prefixed with `0x` | +| `string` | a double-quoted or backtick-quoted UTF-8 string | +| `ip` | a string representing an IP address in [IPv4 or IPv6 format](https://tools.ietf.org/html/draft-main-ipaddr-text-rep-02#section-3) | +| `net` | a string in CIDR notation representing an IP address and prefix length as defined in RFC 4632 and RFC 4291. | +| `type` | a string in canonical form as described in [Section 2.5](#25-types) | +| `null` | the string `null` | + +The format of a _duration string_ +is an optionally-signed concatenation of decimal numbers, +each with optional fraction and a unit suffix, +such as "300ms", "-1.5h" or "2h45m", representing a 64-bit nanosecond value. +Valid time units are +"ns" (nanosecond), +"us" (microsecond), +"ms" (millisecond), +"s" (second), +"m" (minute), +"h" (hour), +"d" (day), +"w" (7 days), and +"y" (365 days). +Note that each of these time units accurately represents its calendar value, +except for the "y" unit, which does not reflect leap years and so forth. +Instead, "y" is defined as the number of nanoseconds in 365 days. + +The format of floating point values is a _non-integer string_ +conforming to any floating point representation that cannot be +interpreted as an integer, e.g., `1.` or `1.0` instead of +`1` or `1e3` instead of `1000`. Unlike JSON, a floating point number can +also be one of: +`Inf`, `+Inf`, `-Inf`, or `Nan`. + +A floating point value may be expressed with an integer string provided +a type decorator is applied, e.g., `123 (float64)`. + +Decimal values require type decorators. + +A string may be backtick-quoted with the backtick character `` ` ``. +None of the text between backticks is escaped, but by default, any newlines +followed by whitespace are converted to a single newline and the first +newline of the string is deleted. To avoid this automatic deletion and +preserve indentation, the backtick-quoted string can be preceded with `=>`. + +Of the 30 primitive types, eleven of them represent _implied-type_ values: +`int64`, `time`, `duration`, `float64`, `bool`, `bytes`, `string`, `ip`, `net`, `type`, and `null`. +Values for these types are determined by the format of the value and +thus do not need decorators to clarify the underlying type, e.g., +``` +123 (int64) +``` +is the same as `123`. + +Values that do not have implied types must include a type decorator to clarify +its type or appear in a context for which its type is defined (i.e., as a field +value in a record, as an element in an array, etc.). + +While a `type` value may represent a complex type, the value itself is a singleton +and thus always a primitive type. A `type` value is encoded as: +* a left angle bracket `<`, followed by +* a type as [encoded below](#25-types), followed by +* a right angle bracket `>`. + +A `time` value corresponds to 64-bit Unix epoch nanoseconds and thus +not all possible RFC 3339 date/time strings are valid. In addition, +nanosecond epoch times overflow on April 11, 2262. +For the world of 2262, a new epoch can be created well in advance +and the old time epoch and new time epoch can live side by side with +the old using a named type for the new epoch time defined as the old `time` type. +An app that requires more than 64 bits of timestamp precision can always use +a typedef of a `bytes` type and do its own conversions to and from the +corresponding `bytes` values. + +#### 2.3.1 Strings + +Double-quoted `string` syntax is the same as that of JSON as described +in [RFC 8259](https://tools.ietf.org/html/rfc8259#section-7). Notably, +the following escape sequences are recognized: + +| Sequence | Unicode Character | +|----------|------------------------| +| `\"` | quotation mark U+0022 | +| `\\` | reverse solidus U+005C | +| `\/` | solidus U+002F | +| `\b` | backspace U+0008 | +| `\f` | form feed U+000C | +| `\n` | line feed U+000A | +| `\r` | carriage return U+000D | +| `\t` | tab U+0009 | +| `\uXXXX` | U+XXXX | + +In `\uXXXX` sequences, each `X` is a hexadecimal digit, and letter +digits may be uppercase or lowercase. + +The behavior of an implementation that encounters an unrecognized escape +sequence in a `string` type is undefined. + +`\u` followed by anything that does not conform to the above syntax +is not a valid escape sequence. The behavior of an implementation +that encounters such invalid sequences in a `string` type is undefined. + +These escaping rules apply also to quoted field names in record values and +record types as well as enum symbols. + +### 2.4 Complex Values + +Complex values are built from primitive values and/or other complex values +and conform to the Zed data model's complex types: +[record](zed.md#21-record), +[array](zed.md#22-array), +[set](zed.md#23-set), +[map](zed.md#24-map), +[union](zed.md#25-union), +[enum](zed.md#26-enum), and +[error](zed.md#27-error). + +Complex values have an implied type when their constituent values all have +implied types. + +#### 2.4.1 Record Value + +A record value has the form: +``` +{ : , : , ... } +``` +where `` is a [ZSON name](#21-names) and `` is +any optionally-decorated ZSON value inclusive of other records. +Each name/value pair is called a _field_. +There may be zero or more fields. + +#### 2.4.2 Array Value + +An array value has the form: +``` +[ , , ... ] +``` +If the elements of the array are not of uniform type, then the implied type of +the array elements is a union of the types present. + +An array value may be empty. An empty array value without a type decorator is +presumed to be an empty array of type `null`. + +#### 2.4.3 Set Value + +A set value has the form: +``` +|[ , , ... ]| +``` +where the indicated values must be distinct. + +If the elements of the set are not of uniform type, then the implied type of +the set elements is a union of the types present. + +A set value may be empty. An empty set value without a type decorator is +presumed to be an empty set of type `null`. + +#### 2.4.4 Map Value + +A map value has the form: +``` +|{ : , : , ... }| +``` +where zero or more comma-separated, key/value pairs are present. + +Whitespace around keys and values is generally optional, but to +avoid ambiguity, whitespace must separate an IPv6 key from the colon +that follows it. + +An empty map value without a type decorator is +presumed to be an empty map of type `|{null: null}|`. + +#### 2.4.5 Union Value + +A union value is a value that conforms to one of the types within a union type. +If the value appears in a context in which the type is unknown or ambiguous, +then the value must be decorated as [described above](#22-type-decorators). + +#### 2.4.6 Enum Value + +An enum type represents a symbol from a finite set of symbols +referenced by name. + +An enum value is indicated with the sigil `%` and has the form +``` +% +``` +where the `` is [ZSON name](#21-names). + +An enum value must appear in a context where the enum type is known, i.e., +with an explicit enum type decorator or within a complex type where the +contained enum type is defined by the complex type's decorator. + +A sequence of enum values might look like this: +``` +%HEADS (flip=(enum(HEADS,TAILS))) +%TAILS (flip) +%HEADS (flip) +``` + +#### 2.4.7 Error Value + +An error value has the form: +``` +error() +``` +where `` is any ZSON value. + +### 2.5 Types + +A primitive type is simply the name of the primitive type, i.e., `string`, +`uint16`, etc. Complex types are defined as follows. + +#### 2.5.1 Record Type + +A _record type_ has the form: +``` +{ : , : , ... } +``` +where `` is a [ZSON name](#21-names) and +`` is any type. + +The order of the record fields is significant, +e.g., type `{a:int32,b:int32}` is distinct from type `{b:int32,a:int32}`. + +#### 2.5.2 Array Type + +An _array type_ has the form: +``` +[ ] +``` + +#### 2.5.3 Set Type + +A _set type_ has the form: +``` +|[ ]| +``` + +#### 2.5.4 Map Type + +A _map type_ has the form: +``` +|{ : }| +``` +where `` is the type of the keys and `` is the +type of the values. + +#### 2.5.5 Union Type + +A _union type_ has the form: +``` +( , , ... ) +``` +where there are at least two types in the list. + +#### 2.5.6 Enum Type + +An _enum type_ has the form: +``` +enum( , , ... ) +``` +where `` is a [ZSON name](#21-names). +Each enum name must be unique and the order is not significant, e.g., +enum type `enum(HEADS,TAILS)` is equal to type `enum(TAILS,HEADS)`. + +#### 2.5.7 Error Type + +An _error type_ has the form: +``` +error( ) +``` +where `` is the type of the underlying ZSON values wrapped as an error. + +#### 2.5.8 Named Type + +A named type has the form: +``` + = +``` +where a new type is defined with the given name and type. + +When a named type appears in a complex value, the new type name may be +referenced by any subsequent value in left-to-right depth-first order. + +For example, +``` +{p1:80 (port=uint16), p2: 8080 (port)} +```` +is valid but +``` +{p1:80 (port), p2: 8080 (port=uint16)} +```` +is invalid. + +Named types may be redefined, in which case subsequent references +resolve to the most recent definition according to +* sequence order across values, or +* left-to-right depth-first order within a complex value. + +### 2.6 Null Value + +The null value is represented by the string `null`. + +A value of any type can be null. It is up to an +implementation to decide how external data structures map into and +out of null values of different types. Typically, a null value means either the +zero value or, in the case of record fields, an optional field whose +value is not present, though these semantics are not explicitly +defined by ZSON. + +## 3. Examples + +The simplest ZSON value is a single value, perhaps a string like this: +``` +"hello, world" +``` +There's no need for a type declaration here. It's explicitly a string. + +A relational table might look like this: +``` +{ city: "Berkeley", state: "CA", population: 121643 (uint32) } (=city_schema) +{ city: "Broad Cove", state: "ME", population: 806 (uint32) } (=city_schema) +{ city: "Baton Rouge", state: "LA", population: 221599 (uint32) } (=city_schema) +``` +This ZSON text here depicts three record values. It defines a type called `city_schema` +and the inferred type of the `city_schema` has the signature: +``` +{ city:string, state:string, population:uint32 } +``` +When all the values in a sequence have the same record type, the sequence +can be interpreted as a _table_, where the ZSON record values form the _rows_ +and the fields of the records form the _columns_. In this way, these +three records form a relational table conforming to the schema `city_schema`. + +In contrast, a ZSON text representing a semi-structured sequence of log lines +might look like this: +``` +{ + info: "Connection Example", + src: { addr: 10.1.1.2, port: 80 (uint16) } (=socket), + dst: { addr: 10.0.1.2, port: 20130 (uint16) } (=socket) +} (=conn) +{ + info: "Connection Example 2", + src: { addr: 10.1.1.8, port: 80 (uint16) } (=socket), + dst: { addr: 10.1.2.88, port: 19801 (uint16) } (=socket) +} (=conn) +{ + info: "Access List Example", + nets: [ 10.1.1.0/24, 10.1.2.0/24 ] +} (=access_list) +{ metric: "A", ts: 2020-11-24T08:44:09.586441-08:00, value: 120 } +{ metric: "B", ts: 2020-11-24T08:44:20.726057-08:00, value: 0.86 } +{ metric: "A", ts: 2020-11-24T08:44:32.201458-08:00, value: 126 } +{ metric: "C", ts: 2020-11-24T08:44:43.547506-08:00, value: { x:10, y:101 } } +``` +In this case, the first record defines not just a record type +with named type `conn`, but also a second embedded record type called `socket`. +The parenthesized decorators are used where a type is not inferred from +the value itself: +* `socket` is a record with typed fields `addr` and `port` where `port` is an unsigned 16-bit integer, and +* `conn` is a record with typed fields `info`, `src`, and `dst`. + +The subsequent value defines a type called `access_list`. In this case, +the `nets` field is an array of networks and illustrates the helpful range of +primitive types in ZSON. Note that the syntax here implies +the type of the array, as it is inferred from the type of the elements. + +Finally, there are four more values that show ZSON's efficacy for +representing metrics. Here, there are no type decorators as all of the field +types are implied by their syntax, and hence, the top-level record type is implied. +For instance, the `ts` field is an RFC 3339 date and time string, +unambiguously the primitive type `time`. Further, +note that the `value` field takes on different types and even a complex record +type on the last line. In this case, there is a different top-level +record type implied by each of the three variations of type of the `value` field. + +## 4. Grammar + +Here is a left-recursive pseudo-grammar of ZSON. Note that not all +acceptable inputs are semantically valid as type mismatches may arise. +For example, union and enum values must both appear in a context +that defines their type. + +``` + = | | + + = . + + = | | + + = "(" "=" ")" + + = "(" ")" | "(" ")" + + = | | | | | | + + = primitive value as defined above + + = "{" "}" | "{" "}" + + = "," | + + = ":" + + = | + + = quoted string as defined above + + = as defined above + + = "[" "]" | "[" "]" + + = "," | + + = "|[" "]|" | "|[" "]|" + + = "%" ( | ) + + = "|{" "}|" | "|{" "}|" + + = | "," + + = ":" + + = "<" ">" + + = "error(" ")" + + = | | | | + | | | + | | | + + = uint8 | uint16 | etc. as defined above + + = "{" "}" | "{" "}" + + = "," | + + = ":" + + = "[" "]" | "[" "]" + + = "|[" "]|" | "|[" "]|" + + = "(" "," ")" + + = "," | + + = "enum(" ")" + + = "," | + + = "{" "," "}" + + = = + + = as defined above + + = [0-9]+ + + = "error(" ")" +``` diff --git a/versioned_docs/version-v1.15.0/install.md b/versioned_docs/version-v1.15.0/install.md new file mode 100644 index 00000000..bb698667 --- /dev/null +++ b/versioned_docs/version-v1.15.0/install.md @@ -0,0 +1,105 @@ +--- +sidebar_position: 2 +sidebar_label: Installation +--- + +# Installation + +Several options for installing `zq` and/or `zed` are available: +* [HomeBrew](#homebrew) for Mac or Linux, +* [Binary Download](#binary-download), or +* [Build from Source](#building-from-source). + +To install the Zed Python client, see the +[Python library documentation](libraries/python.md). + +## Homebrew + +On macOS and Linux, you can use [Homebrew](https://brew.sh/) to install `zq`: + +```bash +brew install brimdata/tap/zq +``` + +Similarly, to install `zed` for working with Zed lakes: +```bash +brew install brimdata/tap/zed +``` + +Once installed, run a [quick test](#quick-tests). + +## Binary Download + +We offer pre-built binaries for macOS, Windows and Linux for both x86 and arm +architectures in the Zed [Github Release page](https://github.com/brimdata/zed/releases). + +Each archive includes the build for `zq` and `zed`. + +Once installed, run a [quick test](#quick-tests). + +## Building from source + +If you have Go installed, you can easily build `zed` from source: + +```bash +go install github.com/brimdata/zed/cmd/{zed,zq}@latest +``` + +This installs the `zed` and `zq` binaries in your `$GOPATH/bin`. + +> If you don't have Go installed, download and install it from the +> [Go install page](https://golang.org/doc/install). Go 1.21 or later is +> required. + +Once installed, run a [quick test](#quick-tests). + +## Quick Tests + +`zq` and `zed` are easy to test as they are completely self-contained +command-line tools and require no external dependendies to run. + +### Test zq + +To test `zq`, simply run this command in your shell: +```mdtest-command +echo '"hello, world"' | zq -z - +``` +which should produce +```mdtest-output +"hello, world" +``` + +### Test zed + +To test `zed`, we'll make a lake in `./scratch`, load data, and query it +as follows: +``` +export ZED_LAKE=./scratch +zed init +zed create Demo +echo '{s:"hello, world"}' | zed load -use Demo - +zed query "from Demo" +``` +which should display +``` +{s:"hello, world"} +``` +Alternatively, you can run a Zed lake service, load it with data using `zed load`, +and hit the API. + +In one shell, run the server: +``` +zed init -lake scratch +zed serve -lake scratch +``` +And in another shell, run the client: +``` +zed create Demo +zed use Demo +echo '{s:"hello, world"}' | zed load - +zed query "from Demo" +``` +which should also display +``` +{s:"hello, world"} +``` diff --git a/versioned_docs/version-v1.15.0/integrations/_category_.yaml b/versioned_docs/version-v1.15.0/integrations/_category_.yaml new file mode 100644 index 00000000..034b9479 --- /dev/null +++ b/versioned_docs/version-v1.15.0/integrations/_category_.yaml @@ -0,0 +1,2 @@ +position: 9 +label: Integrations diff --git a/versioned_docs/version-v1.15.0/integrations/allowed-callback-urls.png b/versioned_docs/version-v1.15.0/integrations/allowed-callback-urls.png new file mode 100644 index 00000000..7b65a193 Binary files /dev/null and b/versioned_docs/version-v1.15.0/integrations/allowed-callback-urls.png differ diff --git a/versioned_docs/version-v1.15.0/integrations/amazon-s3.md b/versioned_docs/version-v1.15.0/integrations/amazon-s3.md new file mode 100644 index 00000000..04376f02 --- /dev/null +++ b/versioned_docs/version-v1.15.0/integrations/amazon-s3.md @@ -0,0 +1,48 @@ +--- +sidebar_position: 1 +sidebar_label: Amazon S3 +--- + +# Amazon S3 + +Zed tools can access [Amazon S3](https://aws.amazon.com/s3/) and +S3-compatible storage via `s3://` URIs. Details are described below. + +## Region + +You must specify an AWS region via one of the following: +* The `AWS_REGION` environment variable +* The `~/.aws/config` file +* The file specified by the `AWS_CONFIG_FILE` environment variable + +You can create `~/.aws/config` by installing the +[AWS CLI](https://aws.amazon.com/cli/) and running `aws configure`. + +:::tip Note +If using S3-compatible storage that does not recognize the concept of regions, +a region must still be specified, e.g., by providing a dummy value for +`AWS_REGION`. +::: + +## Credentials + +You must specify AWS credentials via one of the following: +* The `AWS_ACCESS_KEY_ID` and`AWS_SECRET_ACCESS_KEY` environment variables +* The `~/.aws/credentials` file +* The file specified by the `AWS_SHARED_CREDENTIALS_FILE` environment variable + +You can create `~/.aws/credentials` by installing the +[AWS CLI](https://aws.amazon.com/cli/) and running `aws configure`. + +## Endpoint + +To use S3-compatible storage not provided by AWS, set the `AWS_S3_ENDPOINT` +environment variable to the hostname or URI of the provider. + +## Wildcard Support + +[Like the AWS CLI tools themselves](https://repost.aws/knowledge-center/s3-event-notification-filter-wildcard), +Zed does not currently expand UNIX-style `*` wildcards in S3 URIs. If you +find this limitation is impacting your workflow, please add your use case +details as a comment in issue [zed/1994](https://github.com/brimdata/zed/issues/1994) +to help us track the priority of possible enhancements in this area. diff --git a/versioned_docs/version-v1.15.0/integrations/api-allow-offline-access.png b/versioned_docs/version-v1.15.0/integrations/api-allow-offline-access.png new file mode 100644 index 00000000..3f14ba42 Binary files /dev/null and b/versioned_docs/version-v1.15.0/integrations/api-allow-offline-access.png differ diff --git a/versioned_docs/version-v1.15.0/integrations/api-name-identifier.png b/versioned_docs/version-v1.15.0/integrations/api-name-identifier.png new file mode 100644 index 00000000..0b52288a Binary files /dev/null and b/versioned_docs/version-v1.15.0/integrations/api-name-identifier.png differ diff --git a/versioned_docs/version-v1.15.0/integrations/application-name.png b/versioned_docs/version-v1.15.0/integrations/application-name.png new file mode 100644 index 00000000..a133c006 Binary files /dev/null and b/versioned_docs/version-v1.15.0/integrations/application-name.png differ diff --git a/versioned_docs/version-v1.15.0/integrations/click-api-settings.png b/versioned_docs/version-v1.15.0/integrations/click-api-settings.png new file mode 100644 index 00000000..e469a6c4 Binary files /dev/null and b/versioned_docs/version-v1.15.0/integrations/click-api-settings.png differ diff --git a/versioned_docs/version-v1.15.0/integrations/click-application-settings.png b/versioned_docs/version-v1.15.0/integrations/click-application-settings.png new file mode 100644 index 00000000..ee7678b3 Binary files /dev/null and b/versioned_docs/version-v1.15.0/integrations/click-application-settings.png differ diff --git a/versioned_docs/version-v1.15.0/integrations/click-grant-types.png b/versioned_docs/version-v1.15.0/integrations/click-grant-types.png new file mode 100644 index 00000000..c75b11ab Binary files /dev/null and b/versioned_docs/version-v1.15.0/integrations/click-grant-types.png differ diff --git a/versioned_docs/version-v1.15.0/integrations/create-api.png b/versioned_docs/version-v1.15.0/integrations/create-api.png new file mode 100644 index 00000000..91c36a21 Binary files /dev/null and b/versioned_docs/version-v1.15.0/integrations/create-api.png differ diff --git a/versioned_docs/version-v1.15.0/integrations/create-application.png b/versioned_docs/version-v1.15.0/integrations/create-application.png new file mode 100644 index 00000000..8dc21e02 Binary files /dev/null and b/versioned_docs/version-v1.15.0/integrations/create-application.png differ diff --git a/versioned_docs/version-v1.15.0/integrations/device-code-grant-type.png b/versioned_docs/version-v1.15.0/integrations/device-code-grant-type.png new file mode 100644 index 00000000..0fd87714 Binary files /dev/null and b/versioned_docs/version-v1.15.0/integrations/device-code-grant-type.png differ diff --git a/versioned_docs/version-v1.15.0/integrations/domain-client-id.png b/versioned_docs/version-v1.15.0/integrations/domain-client-id.png new file mode 100644 index 00000000..95832d03 Binary files /dev/null and b/versioned_docs/version-v1.15.0/integrations/domain-client-id.png differ diff --git a/versioned_docs/version-v1.15.0/integrations/expand-advanced-settings.png b/versioned_docs/version-v1.15.0/integrations/expand-advanced-settings.png new file mode 100644 index 00000000..3d4e3c88 Binary files /dev/null and b/versioned_docs/version-v1.15.0/integrations/expand-advanced-settings.png differ diff --git a/versioned_docs/version-v1.15.0/integrations/login-flow-thumbnail.png b/versioned_docs/version-v1.15.0/integrations/login-flow-thumbnail.png new file mode 100644 index 00000000..c9058b5c Binary files /dev/null and b/versioned_docs/version-v1.15.0/integrations/login-flow-thumbnail.png differ diff --git a/versioned_docs/version-v1.15.0/integrations/save-application-settings.png b/versioned_docs/version-v1.15.0/integrations/save-application-settings.png new file mode 100644 index 00000000..12a7cced Binary files /dev/null and b/versioned_docs/version-v1.15.0/integrations/save-application-settings.png differ diff --git a/versioned_docs/version-v1.15.0/integrations/zed-lake-auth.md b/versioned_docs/version-v1.15.0/integrations/zed-lake-auth.md new file mode 100644 index 00000000..07ce6bd5 --- /dev/null +++ b/versioned_docs/version-v1.15.0/integrations/zed-lake-auth.md @@ -0,0 +1,176 @@ +--- +sidebar_position: 2 +sidebar_label: Authentication Configuration +--- + +# Configuring Authentication for a Zed Lake Service + +A [Zed lake service](../commands/zed.md#serve) may be configured to require +user authentication to be accessed from clients such as the +[Zui](https://zui.brimdata.io/) application, the +[`zed`](../commands/zed.md) CLI tools, or the +[Zed Python client](../libraries/python.md). This document describes a simple +[Auth0](https://auth0.com) configuration with accompanying `zed serve` flags +that can be used as a starting point for creating similar configurations in +your own environment. + +## Environment + +Our example environment consists of the following: + +1. An Auth0 trial account containing a verified user `testuser@brimdata.io` +2. A Linux VM with IP address `192.168.5.75` on which we'll run our Zed lake service +3. A macOS desktop on which we'll run our Zui app and `zed` CLI tooling + +## Auth0 API Configuration + +1. Begin creating a new API by clicking **APIs** in the left navigation menu +and then clicking the **Create API** button. + + ![create-api](create-api.png) + +2. Enter any **Name** and URL **Identifier** for the API, then click the +**Create** button. +:::tip +Note the value you enter for the **Identifier** as you'll +need it later for the Zed lake service configuration. +::: + + ![api-name-identifier](api-name-identifier.png) + +3. Click the **Settings** tab for your newly created API. + + ![click-api-settings](click-api-settings.png) + +4. Scroll down in the **Settings** tab and click the toggle to enable +**Allow Offline Access**, then click the **Save** button. + + ![api-allow-offline-access](api-allow-offline-access.png) + +## Auth0 Application Configuration + +1. Begin creating a new application by clicking **Applications** in the left +navigation menu and then clicking the **Create Application** button. +:::tip Note +Neither the "Zed lake (Test Application)" that was created for us +automatically when we created our API nor the Default App that came with the +trial are used in this configuration. +::: + + ![create-application](create-application.png) + +2. Enter any **Name** for the application. Keeping the default selection for +creating a **Native** app, click the **Create** button. + + ![application-name](application-name.png) + +3. Click the **Settings** tab for your newly created application. + + ![click-application-settings](click-application-settings.png) + +4. On the **Settings** tab, note the **Domain** and **Client ID** values, as +you'll need them later for the Zed lake service configuration. + + ![domain-client-id](domain-client-id.png) + +5. If you wish to allow authentication from the Zui desktop application, +scroll down in the **Settings** tab to the **Allowed Callback URLs** and +enter the value `zui://auth/auth0/callback`. + + ![allowed-callback-urls](allowed-callback-urls.png) + +6. If you wish to allow authentication from the Zed CLI tooling and/or the +Python client, scroll down in the **Settings** tab and expand the +**Advanced Settings**, then click the **Grant Types** tab, then click the +checkbox to enable the **Device Code** grant type. + + ![expand-advanced-settings](expand-advanced-settings.png) + + ![click-grant-types](click-grant-types.png) + + ![device-code-grant-type](device-code-grant-type.png) + +7. Scroll down to the bottom of the **Settings** tab and click the +**Save Changes** button. + + ![save-application-settings](save-application-settings.png) + +## Zed Lake Service Configuration + +1. Login to our Linux VM and [install](../install.md#building-from-source) +the most recent Zed tools from source. + + ``` + $ go install github.com/brimdata/zed/cmd/zed@latest + go: downloading github.com/brimdata/zed v1.6.0 + + $ zed -version + Version: v1.6.0 + ``` + +2. Set shell variables with the values you noted previously from the +Auth0 configuration for the API **Identifier** and the **Domain** and +**Client ID** values from the application configuration. Make sure the +`https://` prefix is included in the **Identifier** value. + + ``` + $ auth0_api_identifier=https://zedlakeapi.example.com + $ auth0_clientid=9ooDnoufvD56HNqSM0KqsPLoB3XS55i3 + $ auth0_domain=https://dev-6actj4z0ihvkuzeh.us.auth0.com + ``` + +3. Download the JSON Web Key Set for the domain. + + ``` + $ curl -O $auth0_domain/.well-known/jwks.json + ``` + +4. Start the Zed service, specifying the required flags for the +authentication configuration along with a directory name for lake storage. + + ``` + $ zed serve \ + -auth.enabled \ + -auth.clientid=$auth0_clientid \ + -auth.domain=$auth0_domain \ + -auth.jwkspath=jwks.json \ + -auth.audience=$auth0_api_identifier \ + -lake=lake + + {"level":"info","ts":1678909988.9797907,"logger":"core","msg":"Started"} + {"level":"info","ts":1678909988.9804773,"logger":"httpd","msg":"Listening","addr":"[::]:9867"} + ... + ``` + +## Authenticated Login Flow + +Now that we've configured both Auth0 and the Zed lake service, we can test the +authenticated login flow with our Zed clients. The video below shows this +using [Zui v1.0.0](https://github.com/brimdata/zui/releases/tag/v1.0.0) +and the Zed CLI tooling and Zed Python client +[v1.6.0](https://github.com/brimdata/zed/releases/tag/v1.6.0). + +### Summary: + +1. After clicking **Add Lake** in Zui and entering a **Name** and our +**Lake URL**, the Zed lake service redirects the user to Auth0 to complete +authentication. Once authentication succeeds, Auth0 redirects back to Zui and +the user can begin operating with the lake. The user can log out by clicking +the lake URL in the left navigation menu, then clicking **Get info**, then +clicking **Logout**. + +2. Before starting the authentication flow with the Zed CLI tooling, ensure the +`ZED_LAKE` environment variable is pointing at the appropriate lake URL. Once +`zed auth login` is executed, the Zed lake service redirects the user to +Auth0 where a device code is displayed that can be compared against the one +shown by the Zed CLI. Once the user clicks the **Confirm** button, the +user has authenticated and can begin operating with the lake. The credentials +are cached in `$HOME/.zed`. The user can log out via `zed auth logout`. + +3. As the Python client depends on the same authentication used by the Zed CLI +tooling, you can begin authenticated operations in the Python client once +completing the previous step. + +### Video: + +[![login-flow-thumbnail](login-flow-thumbnail.png)](https://www.youtube.com/watch?v=iXK_9gd6obQ) diff --git a/versioned_docs/version-v1.15.0/integrations/zeek/README.md b/versioned_docs/version-v1.15.0/integrations/zeek/README.md new file mode 100644 index 00000000..7d87eceb --- /dev/null +++ b/versioned_docs/version-v1.15.0/integrations/zeek/README.md @@ -0,0 +1,10 @@ +# Zed Interoperability with Zeek Logs + +Zed includes functionality and reference configurations specific to working +with logs from the [Zeek](https://zeek.org/) open source network security +monitoring tool. Depending on how you use Zeek, one or more of the following +docs may be of interest to you. + +* [Reading Zeek Log Formats](reading-zeek-log-formats.md) +* [Zed/Zeek Data Type Compatibility](data-type-compatibility.md) +* [Shaping Zeek NDJSON](shaping-zeek-ndjson.md) diff --git a/versioned_docs/version-v1.15.0/integrations/zeek/_category_.yaml b/versioned_docs/version-v1.15.0/integrations/zeek/_category_.yaml new file mode 100644 index 00000000..a0a8034a --- /dev/null +++ b/versioned_docs/version-v1.15.0/integrations/zeek/_category_.yaml @@ -0,0 +1,2 @@ +position: 3 +label: Zeek diff --git a/versioned_docs/version-v1.15.0/integrations/zeek/data-type-compatibility.md b/versioned_docs/version-v1.15.0/integrations/zeek/data-type-compatibility.md new file mode 100644 index 00000000..24913125 --- /dev/null +++ b/versioned_docs/version-v1.15.0/integrations/zeek/data-type-compatibility.md @@ -0,0 +1,270 @@ +--- +sidebar_position: 2 +sidebar_label: Zed/Zeek Data Type Compatibility +--- + +# Zed/Zeek Data Type Compatibility + +As the Zed data model was in many ways inspired by the +[Zeek TSV log format](https://docs.zeek.org/en/master/log-formats.html#zeek-tsv-format-logs), +the rich Zed storage formats ([ZSON](../../formats/zson.md), +[ZNG](../../formats/zng.md), etc.) maintain comprehensive interoperability +with Zeek. When Zeek is configured to output its logs in +NDJSON format, much of the rich type information is lost in translation, but +this can be restored by following the guidance for [shaping Zeek NDJSON](shaping-zeek-ndjson.md). +On the other hand, Zeek TSV can be converted to Zed storage formats and back to +Zeek TSV without any loss of information. + +This document describes how the Zed type system is able to represent each of +the types that may appear in Zeek logs. + +Tools like [`zq`](../../commands/zq.md) and +[Zui](https://zui.brimdata.io/) maintain an internal Zed-typed +representation of any Zeek data that is read or imported. Therefore, knowing +the equivalent types will prove useful when performing operations in the +[Zed language](../../language/README.md) such as +[type casting](../../language/data-types.md) or looking at the data +when output as ZSON. + +## Equivalent Types + +The following table summarizes which Zed data type corresponds to each +[Zeek data type](https://docs.zeek.org/en/current/script-reference/types.html) +that may appear in a Zeek TSV log. While most types have a simple 1-to-1 +mapping from Zeek to Zed and back to Zeek again, the sections linked from the +**Additional Detail** column describe cosmetic differences and other subtleties +applicable to handling certain types. + +| Zeek Type | Zed Type | Additional Detail | +|------------|------------|-------------------| +| [`bool`](https://docs.zeek.org/en/current/script-reference/types.html#type-bool) | [`bool`](../../formats/zson.md#23-primitive-values) | | +| [`count`](https://docs.zeek.org/en/current/script-reference/types.html#type-count) | [`uint64`](../../formats/zson.md#23-primitive-values) | | +| [`int`](https://docs.zeek.org/en/current/script-reference/types.html#type-int) | [`int64`](../../formats/zson.md#23-primitive-values) | | +| [`double`](https://docs.zeek.org/en/current/script-reference/types.html#type-double) | [`float64`](../../formats/zson.md#23-primitive-values) | See [`double` details](#double) | +| [`time`](https://docs.zeek.org/en/current/script-reference/types.html#type-time) | [`time`](../../formats/zson.md#23-primitive-values) | | +| [`interval`](https://docs.zeek.org/en/current/script-reference/types.html#type-interval) | [`duration`](../../formats/zson.md#23-primitive-values) | | +| [`string`](https://docs.zeek.org/en/current/script-reference/types.html#type-string) | [`string`](../../formats/zson.md#23-primitive-values) | See [`string` details about escaping](#string) | +| [`port`](https://docs.zeek.org/en/current/script-reference/types.html#type-port) | [`uint16`](../../formats/zson.md#23-primitive-values) | See [`port` details](#port) | +| [`addr`](https://docs.zeek.org/en/current/script-reference/types.html#type-addr) | [`ip`](../../formats/zson.md#23-primitive-values) | | +| [`subnet`](https://docs.zeek.org/en/current/script-reference/types.html#type-subnet) | [`net`](../../formats/zson.md#23-primitive-values) | | +| [`enum`](https://docs.zeek.org/en/current/script-reference/types.html#type-enum) | [`string`](../../formats/zson.md#23-primitive-values) | See [`enum` details](#enum) | +| [`set`](https://docs.zeek.org/en/current/script-reference/types.html#type-set) | [`set`](../../formats/zson.md#243-set-value) | See [`set` details](#set) | +| [`vector`](https://docs.zeek.org/en/current/script-reference/types.html#type-vector) | [`array`](../../formats/zson.md#242-array-value) | | +| [`record`](https://docs.zeek.org/en/current/script-reference/types.html#type-record) | [`record`](../../formats/zson.md#241-record-value) | See [`record` details](#record) | + +> **Note:** The [Zeek data type](https://docs.zeek.org/en/current/script-reference/types.html) +> page describes the types in the context of the +> [Zeek scripting language](https://docs.zeek.org/en/master/scripting/index.html). +> The Zeek types available in scripting are a superset of the data types that +> may appear in Zeek log files. The encodings of the types also differ in some +> ways between the two contexts. However, we link to this reference because +> there is no authoritative specification of the Zeek TSV log format. + +## Example + +The following example shows a TSV log that includes each Zeek data type, how +it's output as ZSON by `zq`, and then how it's written back out again as a Zeek +log. You may find it helpful to refer to this example when reading the +[Type-Specific Details](#type-specific-details) sections. + +#### Viewing the TSV log: + +``` +cat zeek_types.log +``` + +#### Output: + +```mdtest-input zeek_types.log +#separator \x09 +#set_separator , +#empty_field (empty) +#unset_field - +#fields my_bool my_count my_int my_double my_time my_interval my_printable_string my_bytes_string my_port my_addr my_subnet my_enum my_set my_vector my_record.name my_record.age +#types bool count int double time interval string string port addr subnet enum set[string] vector[string] string count +T 123 456 123.4560 1592502151.123456 123.456 smile😁smile \x09\x07\x04 80 127.0.0.1 10.0.0.0/8 tcp things,in,a,set order,is,important Jeanne 122 +``` + +#### Reading the TSV log, outputting as ZSON, and saving a copy: + +```mdtest-command +zq -Z zeek_types.log | tee zeek_types.zson +``` + +#### Output: + +```mdtest-output +{ + my_bool: true, + my_count: 123 (uint64), + my_int: 456, + my_double: 123.456, + my_time: 2020-06-18T17:42:31.123456Z, + my_interval: 2m3.456s, + my_printable_string: "smile😁smile", + my_bytes_string: "\t\u0007\u0004", + my_port: 80 (port=uint16), + my_addr: 127.0.0.1, + my_subnet: 10.0.0.0/8, + my_enum: "tcp" (=zenum), + my_set: |[ + "a", + "in", + "set", + "things" + ]|, + my_vector: [ + "order", + "is", + "important" + ], + my_record: { + name: "Jeanne", + age: 122 (uint64) + } +} +``` + +#### Reading the saved ZSON output and outputting as Zeek TSV: + +```mdtest-command +zq -f zeek zeek_types.zson +``` + +#### Output: +```mdtest-output +#separator \x09 +#set_separator , +#empty_field (empty) +#unset_field - +#fields my_bool my_count my_int my_double my_time my_interval my_printable_string my_bytes_string my_port my_addr my_subnet my_enum my_set my_vector my_record.name my_record.age +#types bool count int double time interval string string port addr subnet enum set[string] vector[string] string count +T 123 456 123.456 1592502151.123456 123.456000 smile😁smile \x09\x07\x04 80 127.0.0.1 10.0.0.0/8 tcp a,in,set,things order,is,important Jeanne 122 +``` + +## Type-Specific Details + +As `zq` acts as a reference implementation for Zed storage formats such as +ZSON and ZNG, it's helpful to understand how it reads the following Zeek data +types into readable text equivalents in the ZSON format, then writes them back +out again in the Zeek TSV log format. Other implementations of the Zed storage +formats (should they exist) may handle these differently. + +Multiple Zeek types discussed below are represented via a +[type definition](../../formats/zson.md#22-type-decorators) to one of Zed's +[primitive types](../../formats/zson.md#23-primitive-values). The Zed type +definitions maintain the history of the field's original Zeek type name +such that `zq` may restore it if the field is later output in +Zeek format. Knowledge of its original Zeek type may also enable special +operations in Zed that are unique to values known to have originated as a +specific Zeek type, though no such operations are currently implemented in +`zq`. + +### `double` + +As they do not affect accuracy, "trailing zero" decimal digits on Zeek `double` +values will _not_ be preserved when they are formatted into a string, such as +via the ZSON/Zeek/table output options in `zq` (e.g., `123.4560` becomes +`123.456`). + +### `enum` + +As they're encountered in common programming languages, enum variables +typically hold one of a set of predefined values. While this is +how Zeek's `enum` type behaves inside the Zeek scripting language, +when the `enum` type is output in a Zeek log, the log does not communicate +any such set of "allowed" values as they were originally defined. Therefore, +these values are represented with a ZSON type name bound to the Zed `string` +type. See the text above regarding [type definitions](#type-specific-details) +for more details. + +### `port` + +The numeric values that appear in Zeek logs under this type are represented +in ZSON with a type name of `port` bound to the `uint16` type. See the text +above regarding [type names](#type-specific-details) for more details. + +### `set` + +Because order within sets is not significant, no attempt is made to maintain +the order of `set` elements as they originally appeared in a Zeek log. + +### `string` + +Zeek's `string` data type is complicated by its ability to hold printable ASCII +and UTF-8 as well as arbitrary unprintable bytes represented as `\x` escapes. +Because such binary data may need to legitimately be captured (e.g. to record +the symptoms of DNS exfiltration), it's helpful that Zeek has a mechanism to +log it. Unfortunately, Zeek's use of the single `string` type for these +multiple uses leaves out important details about the intended interpretation +and presentation of the bytes that make up the value. For instance, one Zeek +`string` field may hold arbitrary network data that _coincidentally_ sometimes +form byte sequences that could be interpreted as printable UTF-8, but they are +_not_ intended to be read or presented as such. Meanwhile, another Zeek +`string` field may be populated such that it will _only_ ever contain printable +UTF-8. These details are currently only captured within the Zeek source code +itself that defines how these values are generated. + +Zed includes a [primitive type](../../formats/zson.md#23-primitive-values) +called `bytes` that's suited to storing the former "always binary" case and a +`string` type for the latter "always printable" case. However, Zeek logs do +not currently communicate details that would allow an implementation to know +which Zeek `string` fields to store as which of these two Zed data types. +Instead, the Zed system does what the Zeek system does when writing strings +to JSON: any `\x` escapes used in Zeek TSV strings are translated into valid +Zed UTF-8 strings by escaping the backslash before the `x.` In this way, +you can still see binary-corrupted strings that are generated by Zeek in +the Zed data formats. + +Unfortunately there is no way to distinguish whether a `\x` escape occurred +or whether that string pattern happen to occur in the original data. A nice +solution would be to convert Zeek strings that are valid UTF-8 strings into +Zed strings and convert invalid strings into a Zed `bytes` type, or we could +covert both of them into a Zed union of `string` and `bytes`. If you have +interest in a capability like this, please let us know and we can elevate +the priority. + +If Zeek were to provide an option to output logs directly in one or more of +Zed's richer storage storage formats, this would create an opportunity to +assign the appropriate Zed `bytes` or `string` type at the point of origin, +depending on what's known about how the field's value is intended to be +populated and used. + +### `record` + +Zeek's `record` type is unique in that every Zeek log line effectively _is_ a +record, with its schema defined via the `#fields` and `#types` directives in +the headers of each log file. The word "record" never appears explicitly in +the schema definition in Zeek logs. + +Embedded records also subtly appear within Zeek log lines in the form of +dot-separated field names. A common example in Zeek is the +[`id`](https://docs.zeek.org/en/current/scripts/base/init-bare.zeek.html#type-conn_id) +record, which captures the source and destination IP addresses and ports for a +network connection as fields `id.orig_h`, `id.orig_p`, `id.resp_h`, and +`id.resp_p`. When reading such fields into their Zed equivalent, `zq` restores +the hierarchical nature of the record as it originally existed inside of Zeek +itself before it was output by its logging system. This enables operations in +Zed that refer to the record at a higher level but affect all values lower +down in the record hierarchy. + +Revisiting the data from our example, we can output all fields within +`my_record` via a Zed [`cut`](../../language/operators/cut.md) operation. + +#### Command: + +```mdtest-command +zq -f zeek 'cut my_record' zeek_types.zson +``` + +#### Output: + +```mdtest-output +#separator \x09 +#set_separator , +#empty_field (empty) +#unset_field - +#fields my_record.name my_record.age +#types string count +Jeanne 122 +``` diff --git a/versioned_docs/version-v1.15.0/integrations/zeek/reading-zeek-log-formats.md b/versioned_docs/version-v1.15.0/integrations/zeek/reading-zeek-log-formats.md new file mode 100644 index 00000000..67b0df48 --- /dev/null +++ b/versioned_docs/version-v1.15.0/integrations/zeek/reading-zeek-log-formats.md @@ -0,0 +1,173 @@ +--- +sidebar_position: 1 +sidebar_label: Reading Zeek Log Formats +--- + +# Reading Zeek Log Formats + +Zed is capable of reading both common Zeek log formats. This document +provides guidance for what to expect when reading logs of these formats using +the Zed tools such as `zq`. + +## Zeek TSV + +[Zeek TSV](https://docs.zeek.org/en/master/log-formats.html#zeek-tsv-format-logs) +is Zeek's default output format for logs. This format can be read automatically +(i.e., no `-i` command line flag is necessary to indicate the input format) +with the Zed tools such as `zq`. + +The following example shows a TSV `conn.log` being read via `zq` and +output as [ZSON](../../formats/zson.md). + +#### conn.log + +```mdtest-input conn.log +#separator \x09 +#set_separator , +#empty_field (empty) +#unset_field - +#path conn +#open 2019-11-08-11-44-16 +#fields ts uid id.orig_h id.orig_p id.resp_h id.resp_p proto service duration orig_bytes resp_bytes conn_state local_orig local_resp missed_bytes history orig_pkts orig_ip_bytes resp_pkts resp_ip_bytes tunnel_parents +#types time string addr port addr port enum string interval count count string bool bool count string count count count count set[string] +1521911721.255387 C8Tful1TvM3Zf5x8fl 10.164.94.120 39681 10.47.3.155 3389 tcp - 0.004266 97 19 RSTR - - 0 ShADTdtr 10 730 6 342 - +``` + +#### Example + +```mdtest-command +zq -Z 'head 1' conn.log +``` + +#### Output +```mdtest-output +{ + _path: "conn", + ts: 2018-03-24T17:15:21.255387Z, + uid: "C8Tful1TvM3Zf5x8fl", + id: { + orig_h: 10.164.94.120, + orig_p: 39681 (port=uint16), + resp_h: 10.47.3.155, + resp_p: 3389 (port) + }, + proto: "tcp" (=zenum), + service: null (string), + duration: 4.266ms, + orig_bytes: 97 (uint64), + resp_bytes: 19 (uint64), + conn_state: "RSTR", + local_orig: null (bool), + local_resp: null (bool), + missed_bytes: 0 (uint64), + history: "ShADTdtr", + orig_pkts: 10 (uint64), + orig_ip_bytes: 730 (uint64), + resp_pkts: 6 (uint64), + resp_ip_bytes: 342 (uint64), + tunnel_parents: null (|[string]|) +} +``` + +Other than Zed, Zeek provides one of the richest data typing systems available +and therefore such records typically need no adjustment to their data types +once they've been read in as is. The +[Zed/Zeek Data Type Compatibility](data-type-compatibility.md) document +provides further detail on how the rich data types in Zeek TSV map to the +equivalent [rich types in Zed](../../formats/zson.md#23-primitive-values). + +## Zeek NDJSON + +As an alternative to the default TSV format, there are two common ways that +Zeek may instead generate logs in [NDJSON](https://en.wikipedia.org/wiki/JSON_streaming#NDJSON) format. + +1. Using the [JSON Streaming Logs](https://github.com/corelight/json-streaming-logs) + package (recommended for use with Zed) +2. Using the built-in [ASCII logger](https://docs.zeek.org/en/current/scripts/base/frameworks/logging/writers/ascii.zeek.html) + configured with `redef LogAscii::use_json = T;` + +In both cases, Zed tools such as `zq` can read these NDJSON logs automatically +as is, but with caveats. + +Let's revisit the same `conn` record we just examined from the Zeek TSV +log, but now as NDJSON generated using the JSON Streaming Logs package. + +#### conn.ndjson + +```mdtest-input conn.ndjson +{"_path":"conn","_write_ts":"2018-03-24T17:15:21.400275Z","ts":"2018-03-24T17:15:21.255387Z","uid":"C8Tful1TvM3Zf5x8fl","id.orig_h":"10.164.94.120","id.orig_p":39681,"id.resp_h":"10.47.3.155","id.resp_p":3389,"proto":"tcp","duration":0.004266023635864258,"orig_bytes":97,"resp_bytes":19,"conn_state":"RSTR","missed_bytes":0,"history":"ShADTdtr","orig_pkts":10,"orig_ip_bytes":730,"resp_pkts":6,"resp_ip_bytes":342} +``` + +#### Example + +```mdtest-command +zq -Z 'head 1' conn.ndjson +``` + +#### Output +```mdtest-output +{ + _path: "conn", + _write_ts: "2018-03-24T17:15:21.400275Z", + ts: "2018-03-24T17:15:21.255387Z", + uid: "C8Tful1TvM3Zf5x8fl", + "id.orig_h": "10.164.94.120", + "id.orig_p": 39681, + "id.resp_h": "10.47.3.155", + "id.resp_p": 3389, + proto: "tcp", + duration: 0.004266023635864258, + orig_bytes: 97, + resp_bytes: 19, + conn_state: "RSTR", + missed_bytes: 0, + history: "ShADTdtr", + orig_pkts: 10, + orig_ip_bytes: 730, + resp_pkts: 6, + resp_ip_bytes: 342 +} +``` + +When we compare this to the TSV example, we notice a few things right away that +all follow from the records having been previously output as JSON. + +1. The timestamps like `_write_ts` and `ts` are printed as strings rather than + the ZSON `time` type. +2. The IP addresses such as `id.orig_h` and `id.resp_h` are printed as strings + rather than the ZSON `ip` type. +3. The connection `duration` is printed as a floating point number rather than + the ZSON `duration` type. +4. The keys for the null-valued fields in the record read from + TSV are not present in the record read from NDJSON. + +If you're familiar with the limitations of the JSON data types, it makes sense +that Zeek chose to output these values in NDJSON as it did. Furthermore, if +you were just seeking to do quick searches on the string values or simple math +on the numbers, these limitations may be acceptable. However, if you intended +to perform operations like +[aggregations with time-based grouping](../../language/functions/bucket.md) +or [CIDR matches](../../language/functions/network_of.md) +on IP addresses, you would likely want to restore the rich Zed data types as +the records are being read. The document on [Shaping Zeek NDJSON](shaping-zeek-ndjson.md) +provides details on how this can be done. + +## The Role of `_path` + +Zeek's `_path` field plays an important role in differentiating between its +different [log types](https://docs.zeek.org/en/master/script-reference/log-files.html) +(`conn`, `dns`, etc.) For instance, +[shaping Zeek NDJSON](shaping-zeek-ndjson.md) relies on the value of +the `_path` field to know which Zed type to apply to an input NDJSON +record. + +If reading Zeek TSV logs or logs generated by the JSON Streaming Logs +package, this `_path` value is provided within the Zeek logs. However, if the +log was generated by Zeek's built-in ASCII logger when using the +`redef LogAscii::use_json = T;` configuration, the value that would be used for +`_path` is present in the log _file name_ but is not in the NDJSON log +records. In this case you could adjust your Zeek configuration by following the +[Log Extension Fields example](https://docs.zeek.org/en/master/frameworks/logging.html#log-extension-fields) +from the Zeek docs. If you enter `path` in the locations where the example +shows `stream`, you will see the field named `_path` populated just like was +shown for the JSON Streaming Logs output. diff --git a/versioned_docs/version-v1.15.0/integrations/zeek/shaping-zeek-ndjson.md b/versioned_docs/version-v1.15.0/integrations/zeek/shaping-zeek-ndjson.md new file mode 100644 index 00000000..2b62cecd --- /dev/null +++ b/versioned_docs/version-v1.15.0/integrations/zeek/shaping-zeek-ndjson.md @@ -0,0 +1,412 @@ +--- +sidebar_position: 3 +sidebar_label: Shaping Zeek NDJSON +--- + +# Shaping Zeek NDJSON + +As described in [Reading Zeek Log Formats](reading-zeek-log-formats.md), +logs output by Zeek in NDJSON format lose much of their rich data typing that +was originally present inside Zeek. This detail can be restored using a Zed +shaper, such as the reference `shaper.zed` described below. + +A full description of all that's possible with shapers is beyond the scope of +this doc. However, this example for shaping Zeek NDJSON is quite simple and +is described below. + +## Zeek Version/Configuration + +The fields and data types in the reference `shaper.zed` reflect the default +NDJSON-format logs output by Zeek releases up to the version number referenced +in the comments at the top of that file. They have been revisited periodically +as new Zeek versions have been released. + +Most changes we've observed in Zeek logs between versions have involved only the +addition of new fields. Because of this, we expect the shaper should be usable +as is for Zeek releases older than the one most recently tested, since fields +in the shaper not present in your environment would just be filled in with +`null` values. + +[Zeek v4.1.0](https://github.com/zeek/zeek/releases/tag/v4.1.0) is the first +release we've seen since starting to maintain this reference shaper where +field names for the same log type have _changed_ between releases. Because +of this, as shown below, the shaper includes `switch` logic that applies +different type definitions based on the observed field names that are known +to be specific to newer Zeek releases. + +All attempts will be made to update this reference shaper in a timely manner +as new Zeek versions are released. However, if you have modified your Zeek +installation with [packages](https://packages.zeek.org/) +or other customizations, or if you are using a [Corelight Sensor](https://corelight.com/products/appliance-sensors/) +that produces Zeek logs with many fields and logs beyond those found in open +source Zeek, the reference shaper will not cover all the fields in your logs. +[As described below](#zed-pipeline), the reference shaper will assign +inferred types to such additional fields. By exploring your data, you can then +iteratively enhance your shaper to match your environment. If you need +assistance, please speak up on our [public Slack](https://www.brimdata.io/join-slack/). + +## Reference Shaper Contents + +The following reference `shaper.zed` may seem large, but ultimately it follows a +fairly simple pattern that repeats across the many [Zeek log types](https://docs.zeek.org/en/master/script-reference/log-files.html). + +```mdtest-input shaper.zed +// This reference Zed shaper for Zeek NDJSON logs was most recently tested with +// Zeek v4.1.0. The fields and data types reflect the default NDJSON +// logs output by that Zeek version when using the JSON Streaming Logs package. + +type port=uint16 +type zenum=string +type conn_id={orig_h:ip,orig_p:port,resp_h:ip,resp_p:port} + +// This first block of type definitions covers the fields we've observed in +// an out-of-the-box Zeek v4.0.3 and earlier, as well as all out-of-the-box +// Zeek v4.1.0 log types except for "ssl" and "x509". + +type broker={_path:string,ts:time,ty:zenum,ev:string,peer:{address:string,bound_port:port},message:string,_write_ts:time} +type capture_loss={_path:string,ts:time,ts_delta:duration,peer:string,gaps:uint64,acks:uint64,percent_lost:float64,_write_ts:time} +type cluster={_path:string,ts:time,node:string,message:string,_write_ts:time} +type config={_path:string,ts:time,id:string,old_value:string,new_value:string,location:string,_write_ts:time} +type conn={_path:string,ts:time,uid:string,id:conn_id,proto:zenum,service:string,duration:duration,orig_bytes:uint64,resp_bytes:uint64,conn_state:string,local_orig:bool,local_resp:bool,missed_bytes:uint64,history:string,orig_pkts:uint64,orig_ip_bytes:uint64,resp_pkts:uint64,resp_ip_bytes:uint64,tunnel_parents:|[string]|,_write_ts:time} +type dce_rpc={_path:string,ts:time,uid:string,id:conn_id,rtt:duration,named_pipe:string,endpoint:string,operation:string,_write_ts:time} +type dhcp={_path:string,ts:time,uids:|[string]|,client_addr:ip,server_addr:ip,mac:string,host_name:string,client_fqdn:string,domain:string,requested_addr:ip,assigned_addr:ip,lease_time:duration,client_message:string,server_message:string,msg_types:[string],duration:duration,_write_ts:time} +type dnp3={_path:string,ts:time,uid:string,id:conn_id,fc_request:string,fc_reply:string,iin:uint64,_write_ts:time} +type dns={_path:string,ts:time,uid:string,id:conn_id,proto:zenum,trans_id:uint64,rtt:duration,query:string,qclass:uint64,qclass_name:string,qtype:uint64,qtype_name:string,rcode:uint64,rcode_name:string,AA:bool,TC:bool,RD:bool,RA:bool,Z:uint64,answers:[string],TTLs:[duration],rejected:bool,_write_ts:time} +type dpd={_path:string,ts:time,uid:string,id:conn_id,proto:zenum,analyzer:string,failure_reason:string,_write_ts:time} +type files={_path:string,ts:time,fuid:string,tx_hosts:|[ip]|,rx_hosts:|[ip]|,conn_uids:|[string]|,source:string,depth:uint64,analyzers:|[string]|,mime_type:string,filename:string,duration:duration,local_orig:bool,is_orig:bool,seen_bytes:uint64,total_bytes:uint64,missing_bytes:uint64,overflow_bytes:uint64,timedout:bool,parent_fuid:string,md5:string,sha1:string,sha256:string,extracted:string,extracted_cutoff:bool,extracted_size:uint64,_write_ts:time} +type ftp={_path:string,ts:time,uid:string,id:conn_id,user:string,password:string,command:string,arg:string,mime_type:string,file_size:uint64,reply_code:uint64,reply_msg:string,data_channel:{passive:bool,orig_h:ip,resp_h:ip,resp_p:port},fuid:string,_write_ts:time} +type http={_path:string,ts:time,uid:string,id:conn_id,trans_depth:uint64,method:string,host:string,uri:string,referrer:string,version:string,user_agent:string,origin:string,request_body_len:uint64,response_body_len:uint64,status_code:uint64,status_msg:string,info_code:uint64,info_msg:string,tags:|[zenum]|,username:string,password:string,proxied:|[string]|,orig_fuids:[string],orig_filenames:[string],orig_mime_types:[string],resp_fuids:[string],resp_filenames:[string],resp_mime_types:[string],_write_ts:time} +type intel={_path:string,ts:time,uid:string,id:conn_id,seen:{indicator:string,indicator_type:zenum,where:zenum,node:string},matched:|[zenum]|,sources:|[string]|,fuid:string,file_mime_type:string,file_desc:string,_write_ts:time} +type irc={_path:string,ts:time,uid:string,id:conn_id,nick:string,user:string,command:string,value:string,addl:string,dcc_file_name:string,dcc_file_size:uint64,dcc_mime_type:string,fuid:string,_write_ts:time} +type kerberos={_path:string,ts:time,uid:string,id:conn_id,request_type:string,client:string,service:string,success:bool,error_msg:string,from:time,till:time,cipher:string,forwardable:bool,renewable:bool,client_cert_subject:string,client_cert_fuid:string,server_cert_subject:string,server_cert_fuid:string,_write_ts:time} +type known_certs={_path:string,ts:time,host:ip,port_num:port,subject:string,issuer_subject:string,serial:string,_write_ts:time} +type known_hosts={_path:string,ts:time,host:ip,_write_ts:time} +type known_services={_path:string,ts:time,host:ip,port_num:port,port_proto:zenum,service:|[string]|,_write_ts:time} +type loaded_scripts={_path:string,name:string,_write_ts:time} +type modbus={_path:string,ts:time,uid:string,id:conn_id,func:string,exception:string,_write_ts:time} +type mysql={_path:string,ts:time,uid:string,id:conn_id,cmd:string,arg:string,success:bool,rows:uint64,response:string,_write_ts:time} +type netcontrol={_path:string,ts:time,rule_id:string,category:zenum,cmd:string,state:zenum,action:string,target:zenum,entity_type:string,entity:string,mod:string,msg:string,priority:int64,expire:duration,location:string,plugin:string,_write_ts:time} +type netcontrol_drop={_path:string,ts:time,rule_id:string,orig_h:ip,orig_p:port,resp_h:ip,resp_p:port,expire:duration,location:string,_write_ts:time} +type netcontrol_shunt={_path:string,ts:time,rule_id:string,f:{src_h:ip,src_p:port,dst_h:ip,dst_p:port},expire:duration,location:string,_write_ts:time} +type notice={_path:string,ts:time,uid:string,id:conn_id,fuid:string,file_mime_type:string,file_desc:string,proto:zenum,note:zenum,msg:string,sub:string,src:ip,dst:ip,p:port,n:uint64,peer_descr:string,actions:|[zenum]|,email_dest:|[string]|,suppress_for:duration,remote_location:{country_code:string,region:string,city:string,latitude:float64,longitude:float64},_write_ts:time} +type notice_alarm={_path:string,ts:time,uid:string,id:conn_id,fuid:string,file_mime_type:string,file_desc:string,proto:zenum,note:zenum,msg:string,sub:string,src:ip,dst:ip,p:port,n:uint64,peer_descr:string,actions:|[zenum]|,email_dest:|[string]|,suppress_for:duration,remote_location:{country_code:string,region:string,city:string,latitude:float64,longitude:float64},_write_ts:time} +type ntlm={_path:string,ts:time,uid:string,id:conn_id,username:string,hostname:string,domainname:string,server_nb_computer_name:string,server_dns_computer_name:string,server_tree_name:string,success:bool,_write_ts:time} +type ntp={_path:string,ts:time,uid:string,id:conn_id,version:uint64,mode:uint64,stratum:uint64,poll:duration,precision:duration,root_delay:duration,root_disp:duration,ref_id:string,ref_time:time,org_time:time,rec_time:time,xmt_time:time,num_exts:uint64,_write_ts:time} +type ocsp={_path:string,ts:time,id:string,hashAlgorithm:string,issuerNameHash:string,issuerKeyHash:string,serialNumber:string,certStatus:string,revoketime:time,revokereason:string,thisUpdate:time,nextUpdate:time,_write_ts:time} +type openflow={_path:string,ts:time,dpid:uint64,match:{in_port:uint64,dl_src:string,dl_dst:string,dl_vlan:uint64,dl_vlan_pcp:uint64,dl_type:uint64,nw_tos:uint64,nw_proto:uint64,nw_src:net,nw_dst:net,tp_src:uint64,tp_dst:uint64},flow_mod:{cookie:uint64,table_id:uint64,command:zenum=string,idle_timeout:uint64,hard_timeout:uint64,priority:uint64,out_port:uint64,out_group:uint64,flags:uint64,actions:{out_ports:[uint64],vlan_vid:uint64,vlan_pcp:uint64,vlan_strip:bool,dl_src:string,dl_dst:string,nw_tos:uint64,nw_src:ip,nw_dst:ip,tp_src:uint64,tp_dst:uint64}},_write_ts:time} +type packet_filter={_path:string,ts:time,node:string,filter:string,init:bool,success:bool,_write_ts:time} +type pe={_path:string,ts:time,id:string,machine:string,compile_ts:time,os:string,subsystem:string,is_exe:bool,is_64bit:bool,uses_aslr:bool,uses_dep:bool,uses_code_integrity:bool,uses_seh:bool,has_import_table:bool,has_export_table:bool,has_cert_table:bool,has_debug_data:bool,section_names:[string],_write_ts:time} +type radius={_path:string,ts:time,uid:string,id:conn_id,username:string,mac:string,framed_addr:ip,tunnel_client:string,connect_info:string,reply_msg:string,result:string,ttl:duration,_write_ts:time} +type rdp={_path:string,ts:time,uid:string,id:conn_id,cookie:string,result:string,security_protocol:string,client_channels:[string],keyboard_layout:string,client_build:string,client_name:string,client_dig_product_id:string,desktop_width:uint64,desktop_height:uint64,requested_color_depth:string,cert_type:string,cert_count:uint64,cert_permanent:bool,encryption_level:string,encryption_method:string,_write_ts:time} +type reporter={_path:string,ts:time,level:zenum,message:string,location:string,_write_ts:time} +type rfb={_path:string,ts:time,uid:string,id:conn_id,client_major_version:string,client_minor_version:string,server_major_version:string,server_minor_version:string,authentication_method:string,auth:bool,share_flag:bool,desktop_name:string,width:uint64,height:uint64,_write_ts:time} +type signatures={_path:string,ts:time,uid:string,src_addr:ip,src_port:port,dst_addr:ip,dst_port:port,note:zenum,sig_id:string,event_msg:string,sub_msg:string,sig_count:uint64,host_count:uint64,_write_ts:time} +type sip={_path:string,ts:time,uid:string,id:conn_id,trans_depth:uint64,method:string,uri:string,date:string,request_from:string,request_to:string,response_from:string,response_to:string,reply_to:string,call_id:string,seq:string,subject:string,request_path:[string],response_path:[string],user_agent:string,status_code:uint64,status_msg:string,warning:string,request_body_len:uint64,response_body_len:uint64,content_type:string,_write_ts:time} +type smb_files={_path:string,ts:time,uid:string,id:conn_id,fuid:string,action:zenum,path:string,name:string,size:uint64,prev_name:string,times:{modified:time,accessed:time,created:time,changed:time},_write_ts:time} +type smb_mapping={_path:string,ts:time,uid:string,id:conn_id,path:string,service:string,native_file_system:string,share_type:string,_write_ts:time} +type smtp={_path:string,ts:time,uid:string,id:conn_id,trans_depth:uint64,helo:string,mailfrom:string,rcptto:|[string]|,date:string,from:string,to:|[string]|,cc:|[string]|,reply_to:string,msg_id:string,in_reply_to:string,subject:string,x_originating_ip:ip,first_received:string,second_received:string,last_reply:string,path:[ip],user_agent:string,tls:bool,fuids:[string],is_webmail:bool,_write_ts:time} +type snmp={_path:string,ts:time,uid:string,id:conn_id,duration:duration,version:string,community:string,get_requests:uint64,get_bulk_requests:uint64,get_responses:uint64,set_requests:uint64,display_string:string,up_since:time,_write_ts:time} +type socks={_path:string,ts:time,uid:string,id:conn_id,version:uint64,user:string,password:string,status:string,request:{host:ip,name:string},request_p:port,bound:{host:ip,name:string},bound_p:port,_write_ts:time} +type software={_path:string,ts:time,host:ip,host_p:port,software_type:zenum,name:string,version:{major:uint64,minor:uint64,minor2:uint64,minor3:uint64,addl:string},unparsed_version:string,_write_ts:time} +type ssh={_path:string,ts:time,uid:string,id:conn_id,version:uint64,auth_success:bool,auth_attempts:uint64,direction:zenum,client:string,server:string,cipher_alg:string,mac_alg:string,compression_alg:string,kex_alg:string,host_key_alg:string,host_key:string,remote_location:{country_code:string,region:string,city:string,latitude:float64,longitude:float64},_write_ts:time} +type ssl={_path:string,ts:time,uid:string,id:conn_id,version:string,cipher:string,curve:string,server_name:string,resumed:bool,last_alert:string,next_protocol:string,established:bool,cert_chain_fuids:[string],client_cert_chain_fuids:[string],subject:string,issuer:string,client_subject:string,client_issuer:string,validation_status:string,_write_ts:time} +type stats={_path:string,ts:time,peer:string,mem:uint64,pkts_proc:uint64,bytes_recv:uint64,pkts_dropped:uint64,pkts_link:uint64,pkt_lag:duration,events_proc:uint64,events_queued:uint64,active_tcp_conns:uint64,active_udp_conns:uint64,active_icmp_conns:uint64,tcp_conns:uint64,udp_conns:uint64,icmp_conns:uint64,timers:uint64,active_timers:uint64,files:uint64,active_files:uint64,dns_requests:uint64,active_dns_requests:uint64,reassem_tcp_size:uint64,reassem_file_size:uint64,reassem_frag_size:uint64,reassem_unknown_size:uint64,_write_ts:time} +type syslog={_path:string,ts:time,uid:string,id:conn_id,proto:zenum,facility:string,severity:string,message:string,_write_ts:time} +type tunnel={_path:string,ts:time,uid:string,id:conn_id,tunnel_type:zenum,action:zenum,_write_ts:time} +type weird={_path:string,ts:time,uid:string,id:conn_id,name:string,addl:string,notice:bool,peer:string,source:string,_write_ts:time} +type x509={_path:string,ts:time,id:string,certificate:{version:uint64,serial:string,subject:string,issuer:string,not_valid_before:time,not_valid_after:time,key_alg:string,sig_alg:string,key_type:string,key_length:uint64,exponent:string,curve:string},san:{dns:[string],uri:[string],email:[string],ip:[ip]},basic_constraints:{ca:bool,path_len:uint64},_write_ts:time} + +// This second block of type definitions represent changes needed to cover +// an out-of-the-box Zeek v4.1.0. In other Zeek revisions, we were accustomed +// to only seeing new fields added, but this represented the first time fields +// have changed, e.g., in SSL logs, "cert_chain_fuids" became "cert_chain_fps". +// Therefore we have wholly separate type definitions for this revision so we +// can cover 100% of the expected fields. + +type ssl_4_1_0={_path:string,ts:time,uid:string,id:conn_id,version:string,cipher:string,curve:string,server_name:string,resumed:bool,last_alert:string,next_protocol:string,established:bool,ssl_history:string,cert_chain_fps:[string],client_cert_chain_fps:[string],subject:string,issuer:string,client_subject:string,client_issuer:string,sni_matches_cert:bool,validation_status:string,_write_ts:time} +type x509_4_1_0={_path:string,ts:time,fingerprint:string,certificate:{version:uint64,serial:string,subject:string,issuer:string,not_valid_before:time,not_valid_after:time,key_alg:string,sig_alg:string,key_type:string,key_length:uint64,exponent:string,curve:string},san:{dns:[string],uri:[string],email:[string],ip:[ip]},basic_constraints:{ca:bool,path_len:uint64},host_cert:bool,client_cert:bool,_write_ts:time} + +const schemas = |{ + "broker": , + "capture_loss": , + "cluster": , + "config": , + "conn": , + "dce_rpc": , + "dhcp": , + "dnp3": , + "dns": , + "dpd": , + "files": , + "ftp": , + "http": , + "intel": , + "irc": , + "kerberos": , + "known_certs": , + "known_hosts": , + "known_services": , + "loaded_scripts": , + "modbus": , + "mysql": , + "netcontrol": , + "netcontrol_drop": , + "netcontrol_shunt": , + "notice": , + "notice_alarm": , + "ntlm": , + "ntp": , + "ocsp": , + "openflow": , + "packet_filter": , + "pe": , + "radius": , + "rdp": , + "reporter": , + "rfb": , + "signatures": , + "sip": , + "smb_files": , + "smb_mapping": , + "smtp": , + "snmp": , + "socks": , + "software": , + "ssh": , + "ssl": , + "stats": , + "syslog": , + "tunnel": , + "weird": , + "x509": +}| + +// We'll check for the presence of fields we know are unique to records that +// changed in Zeek v4.1.0 and shape those with special v4.1.0-specific config. +// For everything else we'll apply the default type definitions. + +yield nest_dotted(this) | switch ( + case _path=="ssl" and has(ssl_history) => yield shape() + case _path=="x509" and has(fingerprint) => yield shape() + default => yield shape(schemas[_path]) +) +``` + +### Leading Type Definitions + +The top three lines define types that are referenced further below in the main +portion of the Zed shaper. + +``` +type port=uint16; +type zenum=string; +type conn_id={orig_h:ip,orig_p:port,resp_h:ip,resp_p:port}; +``` +The `port` and `zenum` types are described further in the [Zed/Zeek Data Type Compatibility](data-type-compatibility.md) +doc. The `conn_id` type will just save us from having to repeat these fields +individually in the many Zeek record types that contain an embedded `id` +record. + +### Default Type Definitions Per Zeek Log `_path` + +The bulk of this Zed shaper consists of detailed per-field data type +definitions for each record in the default set of NDJSON logs output by Zeek. +These type definitions reference the types we defined above, such as `port` +and `conn_id`. The syntax for defining primitive and complex types follows the +relevant sections of the [ZSON Format](../../formats/zson.md#2-the-zson-format) +specification. + +``` +... +type conn={_path:string,ts:time,uid:string,id:conn_id,proto:zenum,service:string,duration:duration,orig_bytes:uint64,resp_bytes:uint64,conn_state:string,local_orig:bool,local_resp:bool,missed_bytes:uint64,history:string,orig_pkts:uint64,orig_ip_bytes:uint64,resp_pkts:uint64,resp_ip_bytes:uint64,tunnel_parents:|[string]|,_write_ts:time}; +type dce_rpc={_path:string,ts:time,uid:string,id:conn_id,rtt:duration,named_pipe:string,endpoint:string,operation:string,_write_ts:time}; +... +``` + +> **Note:** See [the role of `_path`](reading-zeek-log-formats.md#the-role-of-_path) +> for important details if you're using Zeek's built-in [ASCII logger](https://docs.zeek.org/en/current/scripts/base/frameworks/logging/writers/ascii.zeek.html) +> to generate NDJSON rather than the [JSON Streaming Logs](https://github.com/corelight/json-streaming-logs) package. + +### Version-Specific Type Definitions + +The next block of type definitions are exceptions for Zeek v4.1.0 where the +names of fields for certain log types have changed from prior releases. + +``` +type ssl_4_1_0={_path:string,ts:time,uid:string,id:conn_id,version:string,cipher:string,curve:string,server_name:string,resumed:bool,last_alert:string,next_protocol:string,established:bool,ssl_history:string,cert_chain_fps:[string],client_cert_chain_fps:[string],subject:string,issuer:string,client_subject:string,client_issuer:string,sni_matches_cert:bool,validation_status:string,_write_ts:time}; +type x509_4_1_0={_path:string,ts:time,fingerprint:string,certificate:{version:uint64,serial:string,subject:string,issuer:string,not_valid_before:time,not_valid_after:time,key_alg:string,sig_alg:string,key_type:string,key_length:uint64,exponent:string,curve:string},san:{dns:[string],uri:[string],email:[string],ip:[ip]},basic_constraints:{ca:bool,path_len:uint64},host_cert:bool,client_cert:bool,_write_ts:time}; +``` + +### Mapping From `_path` Values to Types + +The next section is just simple mapping from the string values typically found +in the Zeek `_path` field to the name of one of the types we defined above. + +``` +const schemas = |{ + "broker": broker, + "capture_loss": capture_loss, + "cluster": cluster, + "config": config, + "conn": conn, + "dce_rpc": dce_rpc, +... +``` + +### Zed Pipeline + +The Zed shaper ends with a pipeline that stitches together everything we've defined +so far. + +``` +put this := unflatten(this) | switch ( + _path=="ssl" has(ssl_history) => put this := shape(ssl_4_1_0); + _path=="x509" has(fingerprint) => put this := shape(x509_4_1_0); + default => put this := shape(schemas[_path]); +) +``` + +Picking this apart, it transforms reach record as it's being read, in three +steps: + +1. `unflatten()` reverses the Zeek NDJSON logger's "flattening" of nested + records, e.g., how it populates a field named `id.orig_h` rather than + creating a field `id` with sub-field `orig_h` inside it. Restoring the + original nesting now gives us the option to reference the record named `id` + in the Zed language and access the entire 4-tuple of values, but still + access the individual values using the same dotted syntax like `id.orig_h` + when needed. + +2. The `switch()` detects if fields specific to Zeek v4.1.0 are present for the + two log types for which the [version-specific type definitions](#version-specific-type-definitions) + should be applied. For all log lines and types other than these exceptions, + the [default type definitions](#default-type-definitions-per-zeek-log-_path) + are applied. + +3. Each `shape()` call applies an appropriate type definition based on the + nature of the incoming record. The logic of `shape()` includes: + + * For any fields referenced in the type definition that aren't present in + the input record, the field is added with a `null` value. (Note: This + could be performed separately via the `fill()` function.) + + * The data type of each field in the type definition is applied to the + field of that name in the input record. (Note: This could be performed + separately via the `cast()` function.) + + * The fields in the input record are ordered to match the order in which + they appear in the type definition. (Note: This could be performed + separately via the `order()` function.) + + Any fields that appear in the input record that are not present in the + type definition are kept and assigned an inferred data type. If you would + prefer to have such additional fields dropped (i.e., to maintain strict + adherence to the shape), append a call to the `crop()` function to the + Zed pipeline, e.g.: + + ``` + ... | put this := shape(schemas[_path]) | put this := crop(schemas[_path]) + ``` + + Open issues [zed/2585](https://github.com/brimdata/zed/issues/2585) and + [zed/2776](https://github.com/brimdata/zed/issues/2776) both track planned + future improvements to this part of Zed shapers. + +## Invoking the Shaper From `zq` + +A shaper is typically invoked via the `-I` option of `zq`. + +For example, if we assume this input file `weird.ndjson` + +```mdtest-input weird.ndjson +{ + "_path": "weird", + "_write_ts": "2018-03-24T17:15:20.600843Z", + "ts": "2018-03-24T17:15:20.600843Z", + "uid": "C1zOivgBT6dBmknqk", + "id.orig_h": "10.47.1.152", + "id.orig_p": 49562, + "id.resp_h": "23.217.103.245", + "id.resp_p": 80, + "name": "TCP_ack_underflow_or_misorder", + "notice": false, + "peer": "zeek" +} +``` + +applying the reference shaper via + +```mdtest-command +zq -Z -I shaper.zed weird.ndjson +``` + +produces + +```mdtest-output +{ + _path: "weird", + ts: 2018-03-24T17:15:20.600843Z, + uid: "C1zOivgBT6dBmknqk", + id: { + orig_h: 10.47.1.152, + orig_p: 49562 (port=uint16), + resp_h: 23.217.103.245, + resp_p: 80 (port) + } (=conn_id), + name: "TCP_ack_underflow_or_misorder", + addl: null (string), + notice: false, + peer: "zeek", + source: null (string), + _write_ts: 2018-03-24T17:15:20.600843Z +} (=weird) +``` + +If working in a directory containing many NDJSON logs, the +reference shaper can be applied to all the records they contain and +output them all in a single binary [ZNG](../../formats/zng.md) file as +follows: + +``` +zq -I shaper.zed *.log > /tmp/all.zng +``` + +If you wish to apply the shaper and then perform additional +operations on the richly-typed records, the Zed query on the command line +should begin with a `|`, as this appends it to the pipeline at the bottom of +the shaper from the included file. + +For example, to count Zeek `conn` records into CIDR-based buckets based on +originating IP address: + +``` +zq -I shaper.zed -f table '| count() by network_of(id.orig_h) | sort -r' conn.log +``` + +[zed/2584](https://github.com/brimdata/zed/issues/2584) tracks a planned +improvement for this use of `zq -I`. + +If you intend to frequently shape the same NDJSON data, you may want to create +an alias in your +shell to always invoke `zq` with the necessary `-I` flag pointing to the path +of your finalized shaper. [zed/1059](https://github.com/brimdata/zed/issues/1059) +tracks a planned enhancement to persist such settings within Zed itself rather +than relying on external mechanisms such as shell aliases. + +## Importing Shaped Data Into Zui + +If you wish to browse your shaped data with [Zui](https://zui.brimdata.io/), +the best way to accomplish this at the moment would be to use `zq` to convert +it to ZNG [as shown above](#invoking-the-shaper-from-zq), then drag the ZNG +into Zui as you would any other log. An enhancement [zed/2695](https://github.com/brimdata/zed/issues/2695) +is planned that will soon make it possible to attach your shaper to a +Pool. This will allow you to drag the original NDJSON logs directly into the +Pool in Zui and have the shaping applied as the records are being committed to +the Pool. + +## Contact us! + +If you're having difficulty, interested in shaping other data sources, or +just have feedback, please join our [public Slack](https://www.brimdata.io/join-slack/) +and speak up or [open an issue](https://github.com/brimdata/zed/issues/new/choose). +Thanks! diff --git a/versioned_docs/version-v1.15.0/lake/_category_.yaml b/versioned_docs/version-v1.15.0/lake/_category_.yaml new file mode 100644 index 00000000..5beb193f --- /dev/null +++ b/versioned_docs/version-v1.15.0/lake/_category_.yaml @@ -0,0 +1,2 @@ +position: 7 +label: Lake diff --git a/versioned_docs/version-v1.15.0/lake/api.md b/versioned_docs/version-v1.15.0/lake/api.md new file mode 100644 index 00000000..23027111 --- /dev/null +++ b/versioned_docs/version-v1.15.0/lake/api.md @@ -0,0 +1,560 @@ +--- +sidebar_position: 1 +sidebar_label: API +--- + +# Zed lake API + +## _Status_ + +> This is a brief sketch of the functionality exposed in the +> Zed API. More detailed documentation of the API will be forthcoming. + +## Endpoints + +### Pools + +#### Create pool + +Create a new lake pool. + +``` +POST /pool +``` + +**Params** + +| Name | Type | In | Description | +| ---- | ---- | -- | ----------- | +| name | string | body | **Required.** Name of the pool. Must be unique to lake. | +| layout.order | string | body | Order of storage by primary key(s) in pool. Possible values: desc, asc. Default: asc. | +| layout.keys | [[string]] | body | Primary key(s) of pool. The element of each inner string array should reflect the hierarchical ordering of named fields within indexed records. Default: [[ts]]. | +| thresh | int | body | The size in bytes of each seek index. | +| Content-Type | string | header | [MIME type](#mime-types) of the request payload. | +| Accept | string | header | Preferred [MIME type](#mime-types) of the response. | + +**Example Request** + +``` +curl -X POST \ + -H 'Accept: application/json' \ + -H 'Content-Type: application/json' \ + -d '{"name": "inventory", "layout": {"keys": [["product","serial_number"]]}}' \ + http://localhost:9867/pool +``` + +**Example Response** + +``` +{ + "pool": { + "ts": "2022-07-13T21:23:05.323016Z", + "name": "inventory", + "id": "0x0f5ce9b9b6202f3883c9db8ff58d8721a075d1e4", + "layout": { + "order": "asc", + "keys": [ + [ + "product", + "serial_number" + ] + ] + }, + "seek_stride": 65536, + "threshold": 524288000 + }, + "branch": { + "ts": "2022-07-13T21:23:05.367365Z", + "name": "main", + "commit": "0x0000000000000000000000000000000000000000" + } +} +``` + +--- + +#### Rename pool + +Change a pool's name. + +``` +PUT /pool/{pool} +``` + +**Params** + +| Name | Type | In | Description | +| ---- | ---- | -- | ----------- | +| pool | string | path | **Required.** ID or name of the requested pool. | +| name | string | body | **Required.** The desired new name of the pool. Must be unique to lake. | +| Content-Type | string | header | [MIME type](#mime-types) of the request payload. | + +**Example Request** + +``` +curl -X PUT \ + -H 'Content-Type: application/json' \ + -d '{"name": "catalog"}' \ + http://localhost:9867/pool/inventory +``` + +On success, HTTP 204 is returned with no response payload. + +--- + +#### Delete pool + +Permanently delete a pool. + +``` +DELETE /pool/{pool} +``` + +**Params** + +| Name | Type | In | Description | +| ---- | ---- | -- | ----------- | +| pool | string | path | **Required.** ID or name of the requested pool. | + +**Example Request** + +``` +curl -X DELETE \ + http://localhost:9867/pool/inventory +``` + +On success, HTTP 204 is returned with no response payload. + +--- + +#### Vacuum pool + +Free storage space by permanently removing underlying data objects that have +previously been subject to a [delete](#delete-data) operation. + +``` +POST /pool/{pool}/revision/{revision}/vacuum +``` + +**Params** + +| Name | Type | In | Description | +| ---- | ---- | -- | ----------- | +| pool | string | path | **Required.** ID or name of the requested pool. | +| revision | string | path | **Required.** The starting point for locating objects that can be vacuumed. Can be the name of a branch (whose tip would be used) or a commit ID. | +| dryrun | string | query | Set to "T" to return the list of objects that could be vacuumed, but don't actually remove them. Defaults to "F". | + +**Example Request** + +``` +curl -X POST \ + -H 'Accept: application/json' \ + http://localhost:9867/pool/inventory/revision/main/vacuum +``` + +**Example Response** + +``` +{"object_ids":["0x10f5a24253887eaf179ee385532ee411c2ed8050","0x10f5a2410ccd08f72e5d98f6d054477173b4f13f"]} +``` + +--- + +### Branches + +#### Load Data + +Add data to a pool and return a reference commit ID. + +``` +POST /pool/{pool}/branch/{branch} +``` + +**Params** + +| Name | Type | In | Description | +| ---- | ---- | -- | ----------- | +| pool | string | path | **Required.** ID or name of the pool. | +| branch | string | path | **Required.** Name of branch to which data will be loaded. | +| | various | body | **Required.** Contents of the posted data. | +| csv.delim | string | query | Exactly one character specifying the field delimiter for CSV data. Defaults to ",". | +| Content-Type | string | header | [MIME type](#mime-types) of the posted content. If undefined, the service will attempt to introspect the data and determine type automatically. | +| Accept | string | header | Preferred [MIME type](#mime-types) of the response. | + +**Example Request** + +``` +curl -X POST \ + -H 'Accept: application/json' \ + -H 'Content-Type: application/json' \ + -d '{"product": {"serial_number": 12345, "name": "widget"}, "warehouse": "chicago"} + {"product": {"serial_number": 12345, "name": "widget"}, "warehouse": "miami"} + {"product": {"serial_number": 12346, "name": "gadget"}, "warehouse": "chicago"}' \ + http://localhost:9867/pool/inventory/branch/main +``` + +**Example Response** + +``` +{"commit":"0x0ed4f42da5763a9500ee71bc3fa5c69f306872de","warnings":[]} +``` + +--- + +#### Get Branch + +Get information about a branch. + +``` +GET /pool/{pool}/branch/{branch} +``` + +**Params** + +| Name | Type | In | Description | +| ---- | ---- | -- | ----------- | +| pool | string | path | **Required.** ID or name of the pool. | +| branch | string | path | **Required.** Name of branch. | +| Accept | string | header | Preferred [MIME type](#mime-types) of the response. | + +**Example Request** + +``` +curl -X GET \ + -H 'Accept: application/json' \ + http://localhost:9867/pool/inventory/branch/main +``` + +**Example Response** + +``` +{"commit":"0x0ed4fa21616ecd8fec9d6fd395ad876db98a5dae","warnings":null} +``` + +--- + +#### Delete Branch + +Delete a branch. + +``` +DELETE /pool/{pool}/branch/{branch} +``` + +**Params** + +| Name | Type | In | Description | +| ---- | ---- | -- | ----------- | +| pool | string | path | **Required.** ID or name of the pool. | +| branch | string | path | **Required.** Name of branch. | + +**Example Request** + +``` +curl -X DELETE \ + http://localhost:9867/pool/inventory/branch/staging +``` + +On success, HTTP 204 is returned with no response payload. + +--- + +#### Delete Data + +Create a commit that reflects the deletion of some data in the branch. The data +to delete can be specified via a list of object IDs or +as a filter expression (see [limitations](../commands/zed.md#delete)). + +This simply removes the data from the branch without actually removing the +underlying data objects thereby allowing [time travel](../commands/zed.md#time-travel) to work in the face +of deletes. Permanent removal of underlying data objects is handled by a +separate [vacuum](#vacuum-pool) operation. + +``` +POST /pool/{pool}/branch/{branch}/delete +``` + +**Params** + +| Name | Type | In | Description | +| ---- | ---- | -- | ----------- | +| pool | string | path | **Required.** ID of the pool. | +| branch | string | path | **Required.** Name of branch. | +| object_ids | [string] | body | Object IDs to be deleted. | +| where | string | body | Filter expression (see [limitations](../commands/zed.md#delete)). | +| Content-Type | string | header | [MIME type](#mime-types) of the request payload. | +| Accept | string | header | Preferred [MIME type](#mime-types) of the response. | + +**Example Request** + +``` +curl -X POST \ + -H 'Accept: application/json' \ + -H 'Content-Type: application/json' \ + -d '{"object_ids": ["274Eb1Kn8MTM6qxPyBpVTvYhLLa", "274EavbXt546VNelRLNXrzWShNh"]}' \ + http://localhost:9867/pool/inventory/branch/main/delete +``` + +**Example Response** + +``` +{"commit":"0x0ed4fee861e8fb61568783205a46a218182eba6c","warnings":null} +``` + +**Example Request** + +``` +curl -X POST \ + -H 'Accept: application/json' \ + -H 'Content-Type: application/json' \ + -d '{"where": "product.serial_number > 12345"}' \ + http://localhost:9867/pool/inventory/branch/main/delete +``` + +**Example Response** + +``` +{"commit":"0x0f5ceaeaaec7b4c33cfdece9f2e8577ad89d21e2","warnings":null} +``` + +--- + +#### Merge Branches + +Create a commit with the difference of the child branch added to the selected +branch. + +``` +POST /pool/{pool}/branch/{branch}/merge/{child} +``` + +**Params** + +| Name | Type | In | Description | +| ---- | ---- | -- | ----------- | +| pool | string | path | **Required.** ID of the pool. | +| branch | string | path | **Required.** Name of branch selected as merge destination. | +| child | string | path | **Required.** Name of child branch selected as source of merge. | +| Accept | string | header | Preferred [MIME type](#mime-types) of the response. | + +**Example Request** + +``` +curl -X POST \ + -H 'Accept: application/json' \ + http://localhost:9867/pool/inventory/branch/main/merge/staging +``` + +**Example Response** + +``` +{"commit":"0x0ed4ffc2566b423ee444c1c8e6bf964515290f4c","warnings":null} +``` + +--- + +#### Revert + +Create a revert commit of the specified commit. + +``` +POST /pool/{pool}/branch/{branch}/revert/{commit} +``` + +**Params** + +| Name | Type | In | Description | +| ---- | ---- | -- | ----------- | +| pool | string | path | **Required.** ID of the pool. | +| branch | string | path | **Required.** Name of branch on which to revert commit. | +| commit | string | path | **Required.** ID of commit to be reverted. | +| Accept | string | header | Preferred [MIME type](#mime-types) of the response. | + +**Example Request** + +``` +curl -X POST \ + -H 'Accept: application/json' \ + http://localhost:9867/pool/inventory/branch/main/revert/27D22ifDw3Ms2NMzo8jXpDfpgjc +``` + +**Example Response** + +``` +{"commit":"0x0ed500ab6f80e5ac8a1b871bddd88c57fe963ab1","warnings":null} +``` + +--- + +### Query + +Execute a Zed query against data in a data lake. + +``` +POST /query +``` + +**Params** + +| Name | Type | In | Description | +| ---- | ---- | -- | ----------- | +| query | string | body | Zed query to execute. All data is returned if not specified. || +| head.pool | string | body | Pool to query against Not required if pool is specified in query. | +| head.branch | string | body | Branch to query against. Defaults to "main". | +| ctrl | string | query | Set to "T" to include control messages in ZNG or ZJSON responses. Defaults to "F". | +| Content-Type | string | header | [MIME type](#mime-types) of the request payload. | +| Accept | string | header | Preferred [MIME type](#mime-types) of the response. | + +**Example Request** + +``` +curl -X POST \ + -H 'Accept: application/x-zson' \ + -H 'Content-Type: application/json' \ + http://localhost:9867/query -d '{"query":"from inventory@main | count() by warehouse"}' +``` + +**Example Response** + +``` +{warehouse:"chicago",count:2(uint64)} +{warehouse:"miami",count:1(uint64)} +``` + +**Example Request** + +``` +curl -X POST \ + -H 'Accept: application/x-zjson' \ + -H 'Content-Type: application/json' \ + http://localhost:9867/query?ctrl=T -d '{"query":"from inventory@main | count() by warehouse"}' +``` + +**Example Response** + +``` +{"type":"QueryChannelSet","value":{"channel_id":0}} +{"type":{"kind":"record","id":30,"fields":[{"name":"warehouse","type":{"kind":"primitive","name":"string"}},{"name":"count","type":{"kind":"primitive","name":"uint64"}}]},"value":["miami","1"]} +{"type":{"kind":"ref","id":30},"value":["chicago","2"]} +{"type":"QueryChannelEnd","value":{"channel_id":0}} +{"type":"QueryStats","value":{"start_time":{"sec":1658193276,"ns":964207000},"update_time":{"sec":1658193276,"ns":964592000},"bytes_read":55,"bytes_matched":55,"records_read":3,"records_matched":3}} +``` + +#### Query Status + +Retrieve any runtime errors from a specific query. This endpoint only responds +after the query has exited and is only available for a limited time afterwards. + +``` +GET /query/status/{request_id} +``` + +**Params** + +| Name | Type | In | Description | +| ---- | ---- | -- | ----------- | +| request_id | string | path | **Required.** The value of the response header `X-Request-Id` of the target query. | + +**Example Request** + +``` +curl -X GET \ + -H 'Accept: application/json' \ + http://localhost:9867/query/status/2U1oso7btnCXfDenqFOSExOBEIv +``` + +**Example Response** + +``` +{"error":"parquetio: unsupported type: empty record"} +``` + +--- + +### Events + +Subscribe to an events feed, which returns an event stream in the format of +[server-sent events](https://html.spec.whatwg.org/multipage/server-sent-events.html). +The [MIME type](#mime-types) specified in the request's Accept HTTP header determines the format +of `data` field values in the event stream. + +``` +GET /events +``` + +**Params** + +None + +**Example Request** + +``` +curl -X GET \ + -H 'Accept: application/json' \ + http://localhost:9867/events +``` + +**Example Response** + +``` +event: pool-new +data: {"pool_id": "1sMDXpVwqxm36Rc2vfrmgizc3jz"} + +event: pool-update +data: {"pool_id": "1sMDXpVwqxm36Rc2vfrmgizc3jz"} + +event: pool-commit +data: {"pool_id": "1sMDXpVwqxm36Rc2vfrmgizc3jz", "commit_id": "1tisISpHoWI7MAZdFBiMERXeA2X"} + +event: pool-delete +data: {"pool_id": "1sMDXpVwqxm36Rc2vfrmgizc3jz"} +``` + +--- + +## Media Types + +For both request and response payloads, the service supports a variety of +formats. + +### Request Payloads + +When sending request payloads, include the MIME type of the format in the +request's Content-Type header. If the Content-Type header is not specified, the +service will expect ZSON as the payload format. + +An exception to this is when [loading data](#load-data) and Content-Type is not +specified. In this case the service will attempt to introspect the data and may +determine the type automatically. The +[input formats](../commands/zq.md#input-formats) table describes which +formats may be successfully auto-detected. + +### Response Payloads + +To receive successful (2xx) responses in a preferred format, include the MIME +type of the format in the request's Accept HTTP header. If the Accept header is +not specified, the service will return ZSON as the default response format. A +different default response format can be specified by invoking the +`-defaultfmt` option when running [`zed serve`](../commands/zed.md#serve). + +For non-2xx responses, the content type of the response will be +`application/json` or `text/plain`. + +### MIME Types + +The following table shows the supported MIME types and where they can be used. + +| Format | Request | Response | MIME Type | +| ---------------- | --------- | -------- | ------------------------------------- | +| Arrow IPC Stream | yes | yes | `application/vnd.apache.arrow.stream` | +| CSV | yes | yes | `text/csv` | +| JSON | yes | yes | `application/json` | +| Line | yes | no | `application/x-line` | +| NDJSON | no | yes | `application/x-ndjson` | +| Parquet | yes | yes | `application/x-parquet` | +| TSV | yes | yes | `text/tab-separated-values` | +| VNG | yes | yes | `application/x-vng` | +| Zeek | yes | yes | `application/x-zeek` | +| ZJSON | yes | yes | `application/x-zjson` | +| ZSON | yes | yes | `application/x-zson` | +| ZNG | yes | yes | `application/x-zng` | diff --git a/versioned_docs/version-v1.15.0/lake/format.md b/versioned_docs/version-v1.15.0/lake/format.md new file mode 100644 index 00000000..d19ea7d3 --- /dev/null +++ b/versioned_docs/version-v1.15.0/lake/format.md @@ -0,0 +1,276 @@ +--- +sidebar_position: 2 +sidebar_label: Format +--- + +# Zed Lake Format + +## _Status_ + +>This document is a rough draft and work in progress. We plan to +soon bring it up to date with the current implementation and maintain it +as we add new capabilities to the system. + +## Introduction + +To support the client-facing [Zed lake semantics](../commands/zed.md#the-lake-model) +implemented by the [`zed` command](../commands/zed.md), we are developing +an open specification for the Zed lake storage format described in this document. +As we make progress on the Zed lake model, we will update this document +as we go. + +The Zed lake storage format is somewhat analagous the emerging +cloud table formats like [Iceberg](https://iceberg.apache.org/spec/), +but differs but differs in a fundamental way: there are no tables in a Zed lake. + +On the contrary, we believe a better approach for organizing modern, eclectic +data is based on a type system rather than a collection of tables +and relational schemas. Since relations, tables, schemas, data frames, +Parquet files, Avro files, JSON, CSV, XML, and so forth are all subsets of the +Zed's super-structured type system, a data lake based on Zed holds the promise +to provide a universal data representation for all of these different approaches to data. + +Also, while we are not currently focused on building a SQL engine for the Zed lake, +it is most certainly possible to do so, as a Zed record type +[is analagous to](../formats/README.md#2-zed-a-super-structured-pattern) +a SQL table definition. SQL tables can essentially be dynamically projected +via a table virtualization layer built on top of the Zed lake model. + +All data and metadata in a Zed lake conforms to the Zed data model, which materially +simplifies development, test, introspection, and so forth. + +## Cloud Object Model + +Every data element in a Zed lake is either of two fundamental object types: +* a single-writer _immutable object_, or +* a multi-writer _transaction journal_. + +### Immutable Objects + +All imported data in a data pool is composed of immutable objects, which are organized +around a primary data object. Each data object is composed of one or more immutable objects +all of which share a common, globally unique identifier, +which is referred to below generically as `` below. + +These identifiers are [KSUIDs](https://github.com/segmentio/ksuid). +The KSUID allocation scheme +provides a decentralized solution for creating globally unique IDs. +KSUIDs have embedded timestamps so the creation time of +any object named in this way can be derived. Also, a simple lexicographic +sort of the KSUIDs results in a creation-time ordering (though this ordering +is not relied on for causal relationships since clock skew can violate +such an assumption). + +> While a Zed lake is defined in terms of a cloud object store, it may also +> be realized on top of a file system, which provides a convenient means for +> local, small-scale deployments for test/debug workflows. Thus, for simple use cases, +> the complexity of running an object-store service may be avoided. + +#### Data Objects + +A data object is created by a single writer using a globally unique name +with an embedded KSUID. + +New objects are written in their entirety. No updates, appends, or modifications +may be made once an object exists. Given these semantics, any such object may be +trivially cached as neither its name nor content ever change. + +Since the object's name is globally unique and the +resulting object is immutable, there is no possible write concurrency to manage +with respect to a given object. + +A data object is composed of the primary data object stored as one or two objects +(for sequence and/or vector layout) and an optional seek index. + +Data objects may be either in sequence form (i.e., ZNG) or vector form (i.e., VNG), +or both forms may be present as a query optimizer may choose to use whatever +representation is more efficient. +When both sequence and vector data objects are present, they must contain the same +underlying Zed data. + +Immutable objects are named as follows: + +|object type|name| +|-----------|----| +|vector data|`/data/.vng`| +|sequence data|`/data/.zng`| +|sequence seek index|`/data/-seek.zng`| + +`` is the KSUID of the data object. + +The seek index maps pool key values to seek offsets in the ZNG file thereby +allowing a scan to do a byte-range retrieval of the ZNG object when +processing only a subset of data. + +> Note the VNG format allows individual vector segments to be read in isolation +> and the in-memory VNG representation supports random access so there is +> no need to have a seek index for the vector object. + +#### Commit History + +A branch's commit history is the definitive record of the evolution of data in +that pool in a transactionally consistent fashion. + +Each commit object entry is identified with its `commit ID`. +Objects are immutable and uniquely named so there is never a concurrent write +condition. + +The "add" and "commit" operations are transactionally stored +in a chain of commit objects. Any number of adds (and deletes) may appear +in a commit object. All of the operations that belong to a commit are +identified with a commit identifier (ID). + +As each commit object points to its parent (except for the initial commit +in main), the collection of commit objects in a pool forms a tree. + +Each commit object contains a sequence of _actions_: + +* `Add` to add a data object reference to a pool, +* `Delete` to delete a data object reference from a pool, +* `Commit` for providing metadata about each commit. + +The actions are not grouped directly by their commit ID but instead each +action serialization includes its commit ID. + +The chain of commit objects starting at any commit and following +the parent pointers to the original commit is called the "commit log". +This log represents the definitive record of a branch's present +and historical content, and accessing its complete detail can provide +insights about data layout, provenance, history, and so forth. + +### Transaction Journal + +State that is mutable is built upon a transaction journal of immutable +collections of entries. In this way, there are no objects in the +storage footprint that are ever modified. Instead, the journal captures +changes and journal snapshots are used to provide synchronization points +for efficient access to the journal (so the entire journal need not be +read to create the current state) and old journal entries may be removed +based on retention policy. + +The journal may be updated concurrently by multiple writers so concurrency +controls are included (see [Journal Concurrency Control](#journal-concurrency-control) +below) to provide atomic updates. + +A journal entry simply contains actions that modify the visible "state" of +the pool by changing branch name to commit object mappings. Note that +adding a commit object to a pool changes nothing until a branch pointer +is mutated to point at that object. + +Each atomic journal commit object is a ZNG file numbered 1 to the end of journal (HEAD), +e.g., `1.zng`, `2.zng`, etc., each number corresponding to a journal ID. +The 0 value is reserved as the null journal ID. +The journal's TAIL begins at 1 and is increased as journal entries are purged. +Entries are added at the HEAD and removed from the TAIL. +Once created, a journal entry is never modified but it may be deleted and +never again allocated. +There may be 1 or more entries in each commit object. + +Each journal entry implies a snapshot of the data in a pool. A snapshot +is computed by applying the transactions in sequence from entry TAIL to +the journal entry in question, up to HEAD. This gives the set of commit IDs +that comprise a snapshot. + +The set of branch pointers in a pool is assembled at any point in the journal's history +by scanning a journal that includes ADD, UPDATE, and DELETE actions for the +mapping of a branch name to a commit object. A timestamp is recorded in +each action to provide for time travel. + +For efficiency, a journal entry's snapshot may be stored as a "cached snapshot" +alongside the journal entry. This way, the snapshot at HEAD may be +efficiently computed by locating the most recent cached snapshot and scanning +forward to HEAD. + +#### Journal Concurrency Control + +To provide for atomic commits, a writer must be able to atomically update +the HEAD of the log. There are three strategies for doing so. + +First, if the cloud service offers "put-if-missing" semantics, then a writer +can simply read the HEAD file and use put-if-missing to write to the +journal at position HEAD+1. If this fails because of a race, then the writer +can simply write at position HEAD+2 and so forth until it succeeds (and +then update the HEAD object). Note that there can be a race in updating +HEAD, but HEAD is always less than or equal to the real end of journal, +and this condition can be self-corrected by probing for HEAD+1 whenever +the HEAD of the journal is accessed. + +> Note that put-if-missing can be emulated on a local file system by opening +> a file for exclusive access and checking that it has zero length after +> a successful open. + +Second, strong read/write ordering semantics (as exists in [Amazon S3](../integrations/amazon-s3.md)) +can be used to implement transactional journal updates as follows: +* _TBD: this is worked out but needs to be written up_ + +Finally, since the above algorithm requires many round trips to the storage +system and such round trips can be tens of milliseconds, another approach +is to simply run a lock service as part of a cloud deployment that manages +a mutex lock for each pool's journal. + +#### Configuration State + +Configuration state describing a lake or pool is also stored in mutable objects. +Zed lakes simply use a commit journal to store configuration like the +list of pools and pool attributes. +Here, a generic interface to a commit journal manages any configuration +state simply as a key-value store of snapshots providing time travel over +the configuration history. + +### Merge on Read + +To support _sorted scans_, +data objects are store in a sorted order defined by the pool's sort key. +The sort key may be a composite key compised of primary, secondary, etc +component keys. + +When the key range of objects overlap, they may be read in parallel +in merged in sorted order. +This is called the _merge scan_. + +If many overlapping data objects arise, performing a merge scan +on every read can be inefficient. +This can arise when +many random data `load` operations involving perhaps "late" data +(e.g., the pool key is a timestamp and records with old timestamp values regularly +show up and need to be inserted into the past). The data layout can become +fragmented and less efficient to scan, requiring a scan to merge data +from a potentially large number of different objects. + +To solve this problem, the Zed lake format follows the +[LSM](https://en.wikipedia.org/wiki/Log-structured_merge-tree) design pattern. +Since records in each data object are stored in sorted order, a total order over +a collection of objects (e.g., the collection coming from a specific set of commits) +can be produced by executing a sorted scan and rewriting the results back to the pool +in a new commit. In addition, the objects comprising the total order +do not overlap. This is just the basic LSM algorithm at work. + +### Object Naming + +``` +/ + lake.zng + pools/ + HEAD + TAIL + 1.zng + 2.zng + ... + / + branches/ + HEAD + TAIL + 1.zng + 2.zng + ... + commits/ + .zng + .zng + ... + data/ + .{zng,vng} + .{zng,vng} + ... + / + ... +``` diff --git a/versioned_docs/version-v1.15.0/language/README.md b/versioned_docs/version-v1.15.0/language/README.md new file mode 100644 index 00000000..39ff6268 --- /dev/null +++ b/versioned_docs/version-v1.15.0/language/README.md @@ -0,0 +1,12 @@ +# The Zed Language + +The language documents: +* provide an [overview](overview.md) of the Zed language, +* describe Zed's [dataflow model](dataflow-model.md), +* explain Zed's [data types](data-types.md), +* show the syntax of [statements](statements.md) that define constants, functions, operators, and named types, +* describe the syntax of [expressions](expressions.md) and [search expressions](search-expressions.md), +* explain [lateral subqueries](lateral-subqueries.md), +* describe [shaping and type fusion](shaping.md), and +* enumerate the [operators](operators/README.md), [functions](functions/README.md), +and [aggregate functions](aggregates/README.md) in reference format. diff --git a/versioned_docs/version-v1.15.0/language/_category_.yaml b/versioned_docs/version-v1.15.0/language/_category_.yaml new file mode 100644 index 00000000..8ef8329b --- /dev/null +++ b/versioned_docs/version-v1.15.0/language/_category_.yaml @@ -0,0 +1,2 @@ +position: 5 +label: Language diff --git a/versioned_docs/version-v1.15.0/language/aggregates/README.md b/versioned_docs/version-v1.15.0/language/aggregates/README.md new file mode 100644 index 00000000..ec47ddd0 --- /dev/null +++ b/versioned_docs/version-v1.15.0/language/aggregates/README.md @@ -0,0 +1,21 @@ +# Aggregate Functions + +--- + +Aggregate functions appear in either [summarization](../operators/summarize.md) +or [expression](../expressions.md) context and produce an aggregate +value for a sequence of input values. + +- [and](and.md) - logical AND of input values +- [any](any.md) - select an arbitrary value from its input +- [avg](avg.md) - average value +- [collect](collect.md) - aggregate values into array +- [collect_map](collect_map.md) - aggregate map values into a single map +- [count](count.md) - count input values +- [dcount](dcount.md) - count distinct input values +- [fuse](fuse.md) - compute a fused type of input values +- [max](max.md) - maximum value of input values +- [min](min.md) - minimum value of input values +- [or](or.md) - logical OR of input values +- [sum](sum.md) - sum of input values +- [union](union.md) - set union of input values diff --git a/versioned_docs/version-v1.15.0/language/aggregates/_category_.yaml b/versioned_docs/version-v1.15.0/language/aggregates/_category_.yaml new file mode 100644 index 00000000..b1b9aa6e --- /dev/null +++ b/versioned_docs/version-v1.15.0/language/aggregates/_category_.yaml @@ -0,0 +1,2 @@ +position: 12 +label: Aggregate Functions diff --git a/versioned_docs/version-v1.15.0/language/aggregates/and.md b/versioned_docs/version-v1.15.0/language/aggregates/and.md new file mode 100644 index 00000000..eaab5a91 --- /dev/null +++ b/versioned_docs/version-v1.15.0/language/aggregates/and.md @@ -0,0 +1,46 @@ +### Aggregate Function + +  **and** — logical AND of input values + +### Synopsis +``` +and(bool) -> bool +``` + +### Description + +The _and_ aggregate function computes the logical AND over all of its input. + +### Examples + +Anded value of simple sequence: +```mdtest-command +echo 'true false true' | zq -z 'and(this)' - +``` +=> +```mdtest-output +false +``` + +Continuous AND of simple sequence: +```mdtest-command +echo 'true false true' | zq -z 'yield and(this)' - +``` +=> +```mdtest-output +true +false +false +``` +Unrecognized types are ignored and not coerced for truthiness: +```mdtest-command +echo 'true "foo" 0 false true' | zq -z 'yield and(this)' - +``` +=> +```mdtest-output +true +true +true +false +false +``` diff --git a/versioned_docs/version-v1.15.0/language/aggregates/any.md b/versioned_docs/version-v1.15.0/language/aggregates/any.md new file mode 100644 index 00000000..895f269c --- /dev/null +++ b/versioned_docs/version-v1.15.0/language/aggregates/any.md @@ -0,0 +1,44 @@ +### Aggregate Function + +  **any** — select an arbitrary input value + +### Synopsis +``` +any(any) -> any +``` + +### Description + +The _any_ aggregate function returns an arbitrary element from its input. +The semantics of how the item is selected is not defined. + +### Examples + +Any picks the first one in this scenario but this behavior is undefined: +```mdtest-command +echo '1 2 3 4' | zq -z 'any(this)' - +``` +=> +```mdtest-output +1 +``` + +Continuous any over a simple sequence: +```mdtest-command +echo '1 2 3 4' | zq -z 'yield any(this)' - +``` +=> +```mdtest-output +1 +1 +1 +1 +``` +Any is not sensitive to mixed types as it just picks one: +```mdtest-command +echo '"foo" 1 2 3 ' | zq -z 'any(this)' - +``` +=> +```mdtest-output +"foo" +``` diff --git a/versioned_docs/version-v1.15.0/language/aggregates/avg.md b/versioned_docs/version-v1.15.0/language/aggregates/avg.md new file mode 100644 index 00000000..4200f490 --- /dev/null +++ b/versioned_docs/version-v1.15.0/language/aggregates/avg.md @@ -0,0 +1,43 @@ +### Aggregate Function + +  **avg** — average value + +### Synopsis +``` +avg(number) -> number +``` + +### Description + +The _avg_ aggregate function computes the mathematical average value of its input. + +### Examples + +Average value of simple sequence: +```mdtest-command +echo '1 2 3 4' | zq -z 'avg(this)' - +``` +=> +```mdtest-output +2.5 +``` + +Continuous average of simple sequence: +```mdtest-command +echo '1 2 3 4' | zq -z 'yield avg(this)' - +``` +=> +```mdtest-output +1. +1.5 +2. +2.5 +``` +Unrecognized types are ignored: +```mdtest-command +echo '1 2 3 4 "foo"' | zq -z 'avg(this)' - +``` +=> +```mdtest-output +2.5 +``` diff --git a/versioned_docs/version-v1.15.0/language/aggregates/collect.md b/versioned_docs/version-v1.15.0/language/aggregates/collect.md new file mode 100644 index 00000000..07ff564e --- /dev/null +++ b/versioned_docs/version-v1.15.0/language/aggregates/collect.md @@ -0,0 +1,45 @@ +### Aggregate Function + +  **collect** — aggregate values into array + +### Synopsis +``` +collect(any) -> [any] +``` + +### Description + +The _collect_ aggregate function organizes its input into an array. +If the input values vary in type, the return type will be an array +of union of the types encountered. + +### Examples + +Simple sequence collected into an array: +```mdtest-command +echo '1 2 3 4' | zq -z 'collect(this)' - +``` +=> +```mdtest-output +[1,2,3,4] +``` + +Continuous collection over a simple sequence: +```mdtest-command +echo '1 2 3 4' | zq -z 'yield collect(this)' - +``` +=> +```mdtest-output +[1] +[1,2] +[1,2,3] +[1,2,3,4] +``` +Mixed types create a union type for the array elements: +```mdtest-command +echo '1 2 3 4 "foo"' | zq -z 'collect(this)' - +``` +=> +```mdtest-output +[1,2,3,4,"foo"] +``` diff --git a/versioned_docs/version-v1.15.0/language/aggregates/collect_map.md b/versioned_docs/version-v1.15.0/language/aggregates/collect_map.md new file mode 100644 index 00000000..9bae505f --- /dev/null +++ b/versioned_docs/version-v1.15.0/language/aggregates/collect_map.md @@ -0,0 +1,37 @@ +### Aggregate Function + +  **collect_map** — aggregate map values into a single map + +### Synopsis +``` +collect_map(|{any:any}|) -> |{any:any}| +``` + +### Description + +The _collect_map_ aggregate function combines map inputs into a single map output. +If _collect_map_ receives multiple values for the same key, the last value received is +retained. If the input keys or values vary in type, the return type will be a map +of union of those types. + +### Examples + +Combine a sequence of records into a map: +```mdtest-command +echo '{stock:"APPL",price:145.03} {stock:"GOOG",price:87.07}' | zq -z 'collect_map(|{stock:price}|)' - +``` +=> +```mdtest-output +|{"APPL":145.03,"GOOG":87.07}| +``` + +Continuous collection over a simple sequence: +```mdtest-command +echo '|{"APPL":145.03}| |{"GOOG":87.07}| |{"APPL":150.13}|' | zq -z 'yield collect_map(this)' - +``` +=> +```mdtest-output +|{"APPL":145.03}| +|{"APPL":145.03,"GOOG":87.07}| +|{"APPL":150.13,"GOOG":87.07}| +``` diff --git a/versioned_docs/version-v1.15.0/language/aggregates/count.md b/versioned_docs/version-v1.15.0/language/aggregates/count.md new file mode 100644 index 00000000..0d79e50e --- /dev/null +++ b/versioned_docs/version-v1.15.0/language/aggregates/count.md @@ -0,0 +1,53 @@ +### Aggregate Function + +  **count** — count input values + +### Synopsis +``` +count() -> uint64 +``` + +### Description + +The _count_ aggregate function computes the number of values in its input. + +### Examples + +Anded value of simple sequence: +```mdtest-command +echo '1 2 3' | zq -z 'count()' - +``` +=> +```mdtest-output +3(uint64) +``` + +Continuous count of simple sequence: +```mdtest-command +echo '1 2 3' | zq -z 'yield count()' - +``` +=> +```mdtest-output +1(uint64) +2(uint64) +3(uint64) +``` +Mixed types are handled: +```mdtest-command +echo '1 "foo" 10.0.0.1' | zq -z 'yield count()' - +``` +=> +```mdtest-output +1(uint64) +2(uint64) +3(uint64) +``` + +Note that the number of input values are counted, unlike the [`len` function](../functions/len.md) which counts the number of elements in a given value: +```mdtest-command +echo '[1,2,3]' | zq -z 'count()' - +``` +=> +```mdtest-output +1(uint64) +``` diff --git a/versioned_docs/version-v1.15.0/language/aggregates/dcount.md b/versioned_docs/version-v1.15.0/language/aggregates/dcount.md new file mode 100644 index 00000000..6c56e533 --- /dev/null +++ b/versioned_docs/version-v1.15.0/language/aggregates/dcount.md @@ -0,0 +1,46 @@ +### Aggregate Function + +  **dcount** — count distinct input values + +### Synopsis +``` +dcount() -> uint64 +``` + +### Description + +The _dcount_ aggregation function uses hyperloglog to estimate distinct values +of the input in a memory efficient manner. + +### Examples + +Anded value of simple sequence: +```mdtest-command +echo '1 2 2 3' | zq -z 'dcount(this)' - +``` +=> +```mdtest-output +3(uint64) +``` + +Continuous count of simple sequence: +```mdtest-command +echo '1 2 2 3' | zq -z 'yield dcount(this)' - +``` +=> +```mdtest-output +1(uint64) +2(uint64) +2(uint64) +3(uint64) +``` +Mixed types are handled: +```mdtest-command +echo '1 "foo" 10.0.0.1' | zq -z 'yield dcount(this)' - +``` +=> +```mdtest-output +1(uint64) +2(uint64) +3(uint64) +``` diff --git a/versioned_docs/version-v1.15.0/language/aggregates/fuse.md b/versioned_docs/version-v1.15.0/language/aggregates/fuse.md new file mode 100644 index 00000000..c3a28a6e --- /dev/null +++ b/versioned_docs/version-v1.15.0/language/aggregates/fuse.md @@ -0,0 +1,37 @@ +### Aggregate Function + +  **fuse** — compute a fused type of input values + +### Synopsis +``` +fuse(any) -> type +``` + +### Description + +The _fuse_ aggregate function applies [type fusion](../shaping.md#type-fusion) +to its input and returns the fused type. + +This aggregation is useful with group-by for data exploration and discovery +when searching for shaping rules to cluster a large number of varied input +types to a smaller number of fused types each from a set of interrelated types. + +### Examples + +Fuse two records: +```mdtest-command +echo '{a:1,b:2}{a:2,b:"foo"}' | zq -z 'fuse(this)' - +``` +=> +```mdtest-output +<{a:int64,b:(int64,string)}> +``` +Fuse records with a group-by key: +```mdtest-command +echo '{a:1,b:"bar"}{a:2.1,b:"foo"}{a:3,b:"bar"}' | zq -z 'fuse(this) by b | sort' - +``` +=> +```mdtest-output +{b:"bar",fuse:<{a:int64,b:string}>} +{b:"foo",fuse:<{a:float64,b:string}>} +``` diff --git a/versioned_docs/version-v1.15.0/language/aggregates/max.md b/versioned_docs/version-v1.15.0/language/aggregates/max.md new file mode 100644 index 00000000..aabc6352 --- /dev/null +++ b/versioned_docs/version-v1.15.0/language/aggregates/max.md @@ -0,0 +1,43 @@ +### Aggregate Function + +  **max** — maximum value of input values + +### Synopsis +``` +max(number) -> number +``` + +### Description + +The _max_ aggregate function computes the maximum value of its input. + +### Examples + +Maximum value of simple sequence: +```mdtest-command +echo '1 2 3 4' | zq -z 'max(this)' - +``` +=> +```mdtest-output +4 +``` + +Continuous maximum of simple sequence: +```mdtest-command +echo '1 2 3 4' | zq -z 'yield max(this)' - +``` +=> +```mdtest-output +1 +2 +3 +4 +``` +Unrecognized types are ignored: +```mdtest-command +echo '1 2 3 4 "foo"' | zq -z 'max(this)' - +``` +=> +```mdtest-output +4 +``` diff --git a/versioned_docs/version-v1.15.0/language/aggregates/min.md b/versioned_docs/version-v1.15.0/language/aggregates/min.md new file mode 100644 index 00000000..ede710f3 --- /dev/null +++ b/versioned_docs/version-v1.15.0/language/aggregates/min.md @@ -0,0 +1,43 @@ +### Aggregate Function + +  **min** — minimum value of input values + +### Synopsis +``` +min(...number) -> number +``` + +### Description + +The _min_ aggregate function computes the minimum value of its input. + +### Examples + +Minimum value of simple sequence: +```mdtest-command +echo '1 2 3 4' | zq -z 'min(this)' - +``` +=> +```mdtest-output +1 +``` + +Continuous minimum of simple sequence: +```mdtest-command +echo '1 2 3 4' | zq -z 'yield min(this)' - +``` +=> +```mdtest-output +1 +1 +1 +1 +``` +Unrecognized types are ignored: +```mdtest-command +echo '1 2 3 4 "foo"' | zq -z 'min(this)' - +``` +=> +```mdtest-output +1 +``` diff --git a/versioned_docs/version-v1.15.0/language/aggregates/or.md b/versioned_docs/version-v1.15.0/language/aggregates/or.md new file mode 100644 index 00000000..ec2c838e --- /dev/null +++ b/versioned_docs/version-v1.15.0/language/aggregates/or.md @@ -0,0 +1,46 @@ +### Aggregate Function + +  **or** — logical OR of input values + +### Synopsis +``` +or(bool) -> bool +``` + +### Description + +The _or_ aggregate function computes the logical OR over all of its input. + +### Examples + +Ored value of simple sequence: +```mdtest-command +echo 'false true false' | zq -z 'or(this)' - +``` +=> +```mdtest-output +true +``` + +Continuous OR of simple sequence: +```mdtest-command +echo 'false true false' | zq -z 'yield or(this)' - +``` +=> +```mdtest-output +false +true +true +``` +Unrecognized types are ignored and not coerced for truthiness: +```mdtest-command +echo 'false "foo" 1 true false' | zq -z 'yield or(this)' - +``` +=> +```mdtest-output +false +false +false +true +true +``` diff --git a/versioned_docs/version-v1.15.0/language/aggregates/sum.md b/versioned_docs/version-v1.15.0/language/aggregates/sum.md new file mode 100644 index 00000000..6f4e6dc6 --- /dev/null +++ b/versioned_docs/version-v1.15.0/language/aggregates/sum.md @@ -0,0 +1,43 @@ +### Aggregate Function + +  **sum** — sum of input values + +### Synopsis +``` +sum(number) -> number +``` + +### Description + +The _sum_ aggregate function computes the mathematical sum of its input. + +### Examples + +Sume of simple sequence: +```mdtest-command +echo '1 2 3 4' | zq -z 'sum(this)' - +``` +=> +```mdtest-output +10 +``` + +Continuous sum of simple sequence: +```mdtest-command +echo '1 2 3 4' | zq -z 'yield sum(this)' - +``` +=> +```mdtest-output +1 +3 +6 +10 +``` +Unrecognized types are ignored: +```mdtest-command +echo '1 2 3 4 "foo"' | zq -z 'sum(this)' - +``` +=> +```mdtest-output +10 +``` diff --git a/versioned_docs/version-v1.15.0/language/aggregates/union.md b/versioned_docs/version-v1.15.0/language/aggregates/union.md new file mode 100644 index 00000000..4f19fc61 --- /dev/null +++ b/versioned_docs/version-v1.15.0/language/aggregates/union.md @@ -0,0 +1,47 @@ +### Aggregate Function + +  **union** — set union of input values + +### Synopsis +``` +union(any) -> |[any]| +``` + +### Description + +The _union_ aggregate function computes a set union of its input values. +If the values are of uniform type, then the output is a set of that type. +If the values are of mixed typs, the the output is a set of union of the +types encountered. + +### Examples + +Average value of simple sequence: +```mdtest-command +echo '1 2 3 3' | zq -z 'union(this)' - +``` +=> +```mdtest-output +|[1,2,3]| +``` + +Continuous average of simple sequence: +```mdtest-command +echo '1 2 3 3' | zq -z 'yield union(this)' - +``` +=> +```mdtest-output +|[1]| +|[1,2]| +|[1,2,3]| +|[1,2,3]| +``` +Mixed types create a union type for the set elements: +```mdtest-command-issue-3610 +echo '1 2 3 "foo"' | zq -z 'set:=union(this) | yield this,typeof(set)' - +``` +=> +```mdtest-output-issue-3610 +{set:|[1,2,3,"foo"]|} +<|[(int64,string)]|> +``` diff --git a/versioned_docs/version-v1.15.0/language/conventions.md b/versioned_docs/version-v1.15.0/language/conventions.md new file mode 100644 index 00000000..43260136 --- /dev/null +++ b/versioned_docs/version-v1.15.0/language/conventions.md @@ -0,0 +1,19 @@ +--- +sidebar_position: 9 +sidebar_label: Conventions +--- + +# Type Conventions + +Function arguments and operator input values are all dynamically typed, +yet certain functions expect certain specific [data types](data-types.md) +or classes of data types. To this end, the function and operator prototypes +in the Zed documentation include several type classes as follows: +* _any_ - any Zed data type +* _float_ - any floating point Zed type +* _int_ - any signed or unsigned Zed integer type +* _number_ - either float or int + +Note that there is no "any" type in Zed as all super-structured data is +comprehensively typed; "any" here simply refers to a value that is allowed +to take on any Zed type. diff --git a/versioned_docs/version-v1.15.0/language/data-types.md b/versioned_docs/version-v1.15.0/language/data-types.md new file mode 100644 index 00000000..fc7b0111 --- /dev/null +++ b/versioned_docs/version-v1.15.0/language/data-types.md @@ -0,0 +1,348 @@ +--- +sidebar_position: 3 +sidebar_label: Data Types +--- + +# Data Types + +The Zed language includes most data types of a typical programming language +as defined in the [Zed data model](../formats/zed.md). + +The syntax of individual literal values generally follows +the [ZSON syntax](../formats/zson.md) with the exception that +[type decorators](../formats/zson.md#22-type-decorators) +are not included in the language. Instead, a +[type cast](expressions.md#casts) may be used in any expression for explicit +type conversion. + +In particular, the syntax of primitive types follows the +[primitive-value definitions](../formats/zson.md#23-primitive-values) in ZSON +as well as the various [complex value definitions](../formats/zson.md#24-complex-values) +like records, arrays, sets, and so forth. However, complex values are not limited to +constant values like ZSON and can be composed from [literal expressions](expressions.md#literals). + +## First-class Types + +Like the Zed data model, the Zed language has first-class types: +any Zed type may be used as a value. + +The primitive types are listed in the +[data model specification](../formats/zed.md#1-primitive-types) +and have the same syntax in the Zed language. Complex types also follow +the ZSON syntax. Note that the type of a type value is simply `type`. + +As in ZSON, _when types are used as values_, e.g., in a Zed expression, +they must be referenced within angle brackets. That is, the integer type +`int64` is expressed as a type value using the syntax ``. + +Complex types in the Zed language follow the ZSON syntax as well. Here are +a few examples: +* a simple record type - `{x:int64,y:int64}` +* an array of integers - `[int64]` +* a set of strings - `|[string]|` +* a map of strings keys to integer values - `{[string,int64]}` +* a union of string and integer - `(string,int64)` + +Complex types may be composed, as in `[({s:string},{x:int64})]` which is +an array of type `union` of two types of records. + +The [`typeof` function](functions/typeof.md) returns a value's type as +a value, e.g., `typeof(1)` is `` and `typeof()` is ``. + +First-class types are quite powerful because types can +serve as group-by keys or be used in ["data shaping"](shaping.md) logic. +A common workflow for data introspection is to first perform a search of +exploratory data and then count the shapes of each type of data as follows: +``` +search ... | count() by typeof(this) +``` +For example, +```mdtest-command +echo '1 2 "foo" 10.0.0.1 ' | zq -z 'count() by typeof(this) | sort this' - +``` +produces +```mdtest-output +{typeof:,count:2(uint64)} +{typeof:,count:1(uint64)} +{typeof:,count:1(uint64)} +{typeof:,count:1(uint64)} +``` +When running such a query over complex, semi-structured data, the results can +be quite illuminating and can inform the design of "data shaping" Zed queries +to transform raw, messy data into clean data for downstream tooling. + +Note the somewhat subtle difference between a record value with a field `t` of +type `type` whose value is type `string` +``` +{t:} +``` +and a record type used as a value +``` +<{t:string}> +``` + +## Named Types + +As in any modern programming language, types can be named and the type names +persist into the data model and thus into the serialized input and output. + +Named types may be defined in three ways: +* with a [`type` statement](statements.md#type-statements), +* with a definition inside of another type, or +* by the input data itself. + +Type names that are embedded in another type have the form +``` +name=type +``` +and create a binding between the indicated string `name` and the specified type. +For example, +``` +type socket = {addr:ip,port:port=uint16} +``` +defines a named type `socket` that is a record with field `addr` of type `ip` +and field `port` of type "port", where type "port" is a named type for type `uint16` . + +Named types may also be defined by the input data itself, as Zed data is +comprehensively self describing. +When named types are defined in the input data, there is no need to declare their +type in a query. +In this case, a Zed expression may refer to the type by the name that simply +appears to the runtime as a side effect of operating upon the data. If the type +name referred to in this way does not exist, then the type value reference +results in `error("missing")`. For example, +```mdtest-command +echo '1(=foo) 2(=bar) 3(=foo)' | zq -z 'typeof(this)==' - +``` +results in +```mdtest-output +1(=foo) +3(=foo) +``` +and +```mdtest-command +echo '1(=foo)' | zq -z 'yield ' - +``` +results in +```mdtest-output + +``` +but +```mdtest-command +zq -z 'yield ' +``` +gives +```mdtest-output +error("missing") +``` +Each instance of a named type definition overrides any earlier definition. +In this way, types are local in scope. + +Each value that references a named type retains its local definition of the +named type retaining the proper type binding while accommodating changes in a +particular named type. For example, +```mdtest-command +echo '1(=foo) 2(=bar) "hello"(=foo) 3(=foo)' | zq -z 'count() by typeof(this) | sort this' - +``` +results in +```mdtest-output +{typeof:,count:1(uint64)} +{typeof:,count:2(uint64)} +{typeof:,count:1(uint64)} +``` +Here, the two versions of type "foo" are retained in the group-by results. + +In general, it is bad practice to define multiple versions of a single named type, +though the Zed system and Zed data model accommodate such dynamic bindings. +Managing and enforcing the relationship between type names and their type definitions +on a global basis (e.g., across many different data pools in a Zed lake) is outside +the scope of the Zed data model and language. That said, Zed provides flexible +building blocks so systems can define their own schema versioning and schema +management policies on top of these Zed primitives. + +Zed's [super-structured data model](../formats/README.md#2-zed-a-super-structured-pattern) +is a superset of relational tables and +the Zed language's type system can easily make this connection. +As an example, consider this type definition for "employee": +``` +type employee = {id:int64,first:string,last:string,job:string,salary:float64} +``` +In SQL, you might find the top five salaries by last name with +``` +SELECT last,salary +FROM employee +ORDER BY salary +LIMIT 5 +``` +In Zed, you would say +``` +from anywhere | typeof(this)== | cut last,salary | sort salary | head 5 +``` +and since type comparisons are so useful and common, the [`is` function](functions/is.md) +can be used to perform the type match: +``` +from anywhere | is() | cut last,salary | sort salary | head 5 +``` +The power of Zed is that you can interpret data on the fly as belonging to +a certain schema, in this case "employee", and those records can be intermixed +with other relevant data. There is no need to create a table called "employee" +and put the data into the table before that data can be queried as an "employee". +And if the schema or type name for "employee" changes, queries still continue +to work. + +## First-class Errors + +As with types, errors in Zed are first-class: any value can be transformed +into an error by wrapping it in the Zed [`error` type](../formats/zed.md#27-error). + +In general, expressions and functions that result in errors simply return +a value of type `error` as a result. This encourages a powerful flow-style +of error handling where errors simply propagate from one operation to the +next and land in the output alongside non-error values to provide a very helpful +context and rich information for tracking down the source of errors. There is +no need to check for error conditions everywhere or look through auxiliary +logs to find out what happened. + +For example, +input values can be transformed to errors as follows: +```mdtest-command +echo '0 "foo" 10.0.0.1' | zq -z 'error(this)' - +``` +produces +```mdtest-output +error(0) +error("foo") +error(10.0.0.1) +``` +More practically, errors from the runtime show up as error values. +For example, +```mdtest-command +echo 0 | zq -z '1/this' - +``` +produces +```mdtest-output +error("divide by zero") +``` +And since errors are first-class and just values, they have a type. +In particular, they are a complex type where the error value's type is the +complex type `error` containing the type of the value. For example, +```mdtest-command +echo 0 | zq -z 'typeof(1/this)' - +``` +produces +```mdtest-output + +``` +First-class errors are particularly useful for creating structured errors. +When a Zed query encounters a problematic condition, +instead of silently dropping the problematic error +and logging an error obscurely into some hard-to-find system log as so many +ETL pipelines do, the Zed logic can +preferably wrap the offending value as an error and propagate it to its output. + +For example, suppose a bad value shows up: +``` +{kind:"bad", stuff:{foo:1,bar:2}} +``` +A Zed [shaper](shaping.md) could catch the bad value (e.g., as a default +case in a [`switch`](operators/switch.md) topology) and propagate it as +an error using the Zed expression: +``` +yield error({message:"unrecognized input",input:this}) +``` +then such errors could be detected and searched for downstream with the +[`is_error` function](functions/is_error.md). +For example, +``` +is_error(this) +``` +on the wrapped error from above produces +``` +error({message:"unrecognized input",input:{kind:"bad", stuff:{foo:1,bar:2}}}) +``` +There is no need to create special tables in a complex warehouse-style ETL +to land such errors as they can simply land next to the output values themselves. + +And when transformations cascade one into the next as different stages of +an ETL pipeline, errors can be wrapped one by one forming a "stack trace" +or lineage of where the error started and what stages it traversed before +landing at the final output stage. + +Errors will unfortunately and inevitably occur even in production, +but having a first-class data type to manage them all while allowing them to +peacefully coexist with valid production data is a novel and +useful approach that Zed enables. + +### Missing and Quiet + +Zed's heterogeneous data model allows for queries +that operate over different types of data whose structure and type +may not be known ahead of time, e.g., different +types of records with different field names and varying structure. +Thus, a reference to a field, e.g., `this.x` may be valid for some values +that include a field called `x` but not valid for those that do not. + +What is the value of `x` when the field `x` does not exist? + +A similar question faced SQL when it was adapted in various different forms +to operate on semi-structured data like JSON or XML. SQL already had the `NULL` value +so perhaps a reference to a missing value could simply be `NULL`. + +But JSON also has `null`, so a reference to `x` in the JSON value +``` +{"x":null} +``` +and a reference to `x` in the JSON value +``` +{} +``` +would have the same value of `NULL`. Furthermore, an expression like `x==NULL` +could not differentiate between these two cases. + +To solve this problem, the `MISSING` value was proposed to represent the value that +results from accessing a field that is not present. Thus, `x==NULL` and +`x==MISSING` could disambiguate the two cases above. + +Zed, instead, recognizes that the SQL value `MISSING` is a paradox: +I'm here but I'm not. + +In reality, a `MISSING` value is not a value. It's an error condition +that resulted from trying to reference something that didn't exist. + +So why should we pretend that this is a bona fide value? SQL adopted this +approach because it lacks first-class errors. + +But Zed has first-class errors so +a reference to something that does not exist is an error of type +`error(string)` whose value is `error("missing")`. For example, +```mdtest-command +echo "{x:1} {y:2}" | zq -z 'yield x' - +``` +produces +```mdtest-output +1 +error("missing") +``` +Sometimes you want missing errors to show up and sometimes you don't. +The [`quiet` function](functions/quiet.md) transforms missing errors into +"quiet errors". A quiet error is the value `error("quiet")` and is ignored +by most operators, in particular `yield`. For example, +```mdtest-command +echo "{x:1} {y:2}" | zq -z "yield quiet(x)" - +``` +produces +```mdtest-output +1 +``` + +And what if you want a default value instead of a missing error? The +[`coalesce` function](functions/coalesce.md) returns the first value that is not +null, `error("missing")`, or `error("quiet")`. For example, +```mdtest-command +echo "{x:1} {y:2}" | zq -z "yield coalesce(x, 0)" - +``` +produces +```mdtest-output +1 +0 +``` diff --git a/versioned_docs/version-v1.15.0/language/dataflow-model.md b/versioned_docs/version-v1.15.0/language/dataflow-model.md new file mode 100644 index 00000000..92144a0a --- /dev/null +++ b/versioned_docs/version-v1.15.0/language/dataflow-model.md @@ -0,0 +1,293 @@ +--- +sidebar_position: 2 +sidebar_label: Dataflow Model +--- + +# The Dataflow Model + +In Zed, each operator takes its input from the output of its upstream operator beginning +either with a data source or with an implied source. + +All available operators are listed on the [reference page](operators/README.md). + +## Dataflow Sources + +In addition to the data sources specified as files on the `zq` command line, +a source may also be specified with the [`from` operator](operators/from.md). + +When running on the command-line, `from` may refer to a file, an HTTP +endpoint, or an [S3](../integrations/amazon-s3.md) URI. When running in a [Zed lake](../commands/zed.md), `from` typically +refers to a collection of data called a "data pool" and is referenced using +the pool's name much as SQL references database tables by their name. + +For more detail, see the reference page of the [`from` operator](operators/from.md), +but as an example, you might use the `get` form of `from` to fetch data from an +HTTP endpoint and process it with Zed, in this case, to extract the description +and license of a GitHub repository: +``` +zq -f text "get https://api.github.com/repos/brimdata/zed | yield description,license.name" +``` +When a Zed query is run on the command-line with `zq`, the `from` source is +typically omitted and implied instead by the command-line file arguments. +The input may be stdin via `-` as in +``` +echo '"hello, world"' | zq - +``` +The examples throughout the language documentation use this "echo pattern" +to standard input of `zq -` to illustrate language semantics. +Note that in these examples, the input values are expressed as Zed values serialized +in the [ZSON text format](../formats/zson.md) +and the `zq` query text expressed as the first argument of the `zq` command +is expressed in the syntax of the Zed language described here. + +## Dataflow Operators + +Each operator is identified by name and performs a specific operation +on a stream of records. + +Some operators, like +[`summarize`](operators/summarize.md) or [`sort`](operators/sort.md), +read all of their input before producing output, though +`summarize` can produce incremental results when the group-by key is +aligned with the order of the input. + +For large queries that process all of their input, time may pass before +seeing any output. + +On the other hand, most operators produce incremental output by operating +on values as they are produced. For example, a long running query that +produces incremental output will stream results as they are produced, i.e., +running `zq` to standard output will display results incrementally. + +The [`search`](operators/search.md) and [`where`](operators/where.md) +operators "find" values in their input and drop +the ones that do not match what is being looked for. + +The [`yield` operator](operators/yield.md) emits one or more output values +for each input value based on arbitrary [expressions](expressions.md), +providing a convenient means to derive arbitrary output values as a function +of each input value, much like the map concept in the MapReduce framework. + +The [`fork` operator](operators/fork.md) copies its input to parallel +legs of a query. The output of these parallel paths can be combined +in a number of ways: +* merged in sorted order using the [`merge` operator](operators/merge.md), +* joined using the [`join` operator](operators/join.md), or +* combined in an undefined order using the implied [`combine` operator](operators/combine.md). + +A path can also be split to multiple query legs using the +[`switch` operator](operators/switch.md), in which data is routed to only one +corresponding leg (or dropped) based on the switch clauses. + +Switch operators typically +involve multiline Zed programs, which are easiest to edit in a file. For example, +suppose this text is in a file called `switch.zed`: +```mdtest-input switch.zed +switch this ( + case 1 => yield {val:this,message:"one"} + case 2 => yield {val:this,message:"two"} + default => yield {val:this,message:"many"} +) | merge val +``` +Then, running `zq` with `-I switch.zed` like so: +```mdtest-command +echo '1 2 3 4' | zq -z -I switch.zed - +``` +produces +```mdtest-output +{val:1,message:"one"} +{val:2,message:"two"} +{val:3,message:"many"} +{val:4,message:"many"} +``` +Note that the output order of the switch legs is undefined (indeed they run +in parallel on multiple threads). To establish a consistent sequence order, +a [`merge` operator](operators/merge.md) +may be applied at the output of the switch specifying a sort key upon which +to order the upstream data. Often such order does not matter (e.g., when the output +of the switch hits an [aggregator](aggregates/README.md)), in which case it is typically more performant +to omit the merge (though the Zed system will often delete such unnecessary +operations automatically as part optimizing queries when they are compiled). + +If no `merge` or `join` is indicated downstream of a `fork` or `switch`, +then the implied `combine` operator is presumed. In this case, values are +forwarded from the switch to the downstream operator in an undefined order. + +## The Special Value `this` + +In Zed, there are no looping constructs and variables are limited to binding +values between [lateral scopes](lateral-subqueries.md#lateral-scope). +Instead, the input sequence +to an operator is produced continuously and any output values are derived +from input values. + +In contrast to SQL, where a query may refer to input tables by name, +there are no explicit tables and a Zed operator instead refers +to its input values using the special identifier `this`. + +For example, sorting the following input +```mdtest-command +echo '"foo" "bar" "BAZ"' | zq -z sort - +``` +produces this case-sensitive output: +```mdtest-output +"BAZ" +"bar" +"foo" +``` +But we can make the sort case-insensitive by applying a [function](functions/README.md) to the +input values with the expression `lower(this)`, which converts +each value to lower-case for use in in the sort without actually modifying +the input value, e.g., +``` +echo '"foo" "bar" "BAZ"' | zq -z 'sort lower(this)' - +``` +produces +``` +"bar" +"BAZ" +"foo" +``` + +## Implied Field References + +A common use case for Zed is to process sequences of record-oriented data +(e.g., arising from formats like JSON or Avro) in the form of events +or structured logs. In this case, the input values to the operators +are Zed [records](../formats/zed.md#21-record) and the fields of a record are referenced with the dot operator. + +For example, if the input above were a sequence of records instead of strings +and perhaps contained a second field, e.g., +``` +{s:"foo",x:1} +{s:"bar",x:2} +{s:"BAZ",x:3} +``` +Then we could refer to the field `s` using `this.s` and sort the records +as above with `sort this.s`, which would give +``` +{s:"BAZ",x:3} +{s:"bar",x:2} +{s:"foo",x:1} +``` +This pattern is so common that field references to `this` may be shortened +by simply referring to the field by name wherever a Zed expression is expected, +e.g., +``` +sort s +``` +is shorthand for `sort this.s` + +## Field Assignments + +A typical operation in records involves +adding or changing the fields of a record using the [`put` operator](operators/put.md) +or extracting a subset of fields using the [`cut` operator](operators/cut.md). +Also, when aggregating data using group-by keys, the group-by assignments +create new named record fields. + +In all of these cases, the Zed language uses the token `:=` to denote +field assignment. For example, +``` +put x:=y+1 +``` +or +``` +summarize salary:=sum(income) by address:=lower(address) +``` +This style of "assignment" to a record value is distinguished from the `=` +token which binds a locally scoped name to a value that can be referenced +in later expressions. + +## Implied Operators + +When Zed is run in an application like [Zui](https://zui.brimdata.io), +queries are often composed interactively in a "search bar" experience. +The language design here attempts to support both this "lean forward" pattern of usage +along with a "coding style" of query writing where the queries might be large +and complex, e.g., to perform transformations in a data pipeline, where +the Zed queries are stored under source-code control perhaps in GitHub or +in Zui's query library. + +To facilitate both a programming-like model as well as an ad hoc search +experience, Zed has a canonical, long form that can be abbreviated +using syntax that supports an agile, interactive query workflow. +To this end, Zed allows certain operator names to be optionally omitted when +they can be inferred from context. For example, the expression following +the [`summarize` operator](operators/summarize.md) +``` +summarize count() by id +``` +is unambiguously an aggregation and can be shortened to +``` +count() by id +``` +Likewise, a very common lean-forward use pattern is "searching" so by default, +expressions are interpreted as keyword searches, e.g., +``` +search foo bar or x > 100 +``` +is abbreviated +``` +foo bar or x > 100 +``` +Furthermore, if an operator-free expression is not valid syntax for +a search expression but is a valid [Zed expression](expressions.md), +then the abbreviation is treated as having an implied `yield` operator, e.g., +``` +{s:lower(s)} +``` +is shorthand for +``` +yield {s:lower(s)} +``` +When operator names are omitted, `search` has precedence over `yield`, so +``` +foo +``` +is interpreted as a search for the string "foo" rather than a yield of +the implied record field named `foo`. + +Another common query pattern involves adding or mutating fields of records +where the input is presumed to be a sequence of records. +The [`put` operator](operators/put.md) provides this mechanism and the `put` +keyword is implied by the [field assignment](dataflow-model.md#field-assignments) syntax `:=`. + +For example, the operation +``` +put y:=2*x+1 +``` +can be expressed simply as +``` +y:=2*x+1 +``` +When composing long-form queries that are shared via Zui or managed in GitHub, +it is best practice to include all operator names in the Zed source text. + +In summary, if no operator name is given, the implied operator is determined +from the operator-less source text, in the order given, as follows: +* If the text can be interpreted as a search expression, then the operator is `search`. +* If the text can be interpreted as a boolean expression, then the operator is `where`. +* If the text can be interpreted as one or more field assignments, then the operator is `put`. +* If the text can be interpreted as an aggregation, then the operator is `summarize`. +* If the text can be interpreted as an expression, then the operator is `yield`. +* Otherwise, the text causes a compile-time error. + +When in doubt, you can always check what the compiler is doing under the hood +by running `zq` with the `-C` flag to print the parsed query in "canonical form", e.g., +```mdtest-command +zq -C foo +zq -C 'is()' +zq -C 'count()' +zq -C '{a:x+1,b:y-1}' +zq -C 'a:=x+1,b:=y-1' +``` +produces +```mdtest-output +search foo +where is() +summarize + count() +yield {a:x+1,b:y-1} +put a:=x+1,b:=y-1 +``` diff --git a/versioned_docs/version-v1.15.0/language/expressions.md b/versioned_docs/version-v1.15.0/language/expressions.md new file mode 100644 index 00000000..6e4c563f --- /dev/null +++ b/versioned_docs/version-v1.15.0/language/expressions.md @@ -0,0 +1,560 @@ +--- +sidebar_position: 5 +sidebar_label: Expressions +--- + +# Expressions + +Zed expressions follow the typical patterns in programming languages. +Expressions are typically used within data flow operators +to perform computations on input values and are typically evaluated once per each +input value [`this`](dataflow-model.md#the-special-value-this). + +For example, `yield`, `where`, `cut`, `put`, `sort` and so forth all take +various expressions as part of their operation. + +## Arithmetic + +Arithmetic operations (`*`, `/`, `%`, `+`, `-`) follow customary syntax +and semantics and are left-associative with multiplication and division having +precedence over addition and subtraction. `%` is the modulo operator. + +For example, +```mdtest-command +zq -z 'yield 2*3+1, 11%5, 1/0, "foo"+"bar"' +``` +produces +```mdtest-output +7 +1 +error("divide by zero") +"foobar" +``` + +## Comparisons + +Comparison operations (`<`, `<=`, `==`, `!=`, `>`, `>=`) follow customary syntax +and semantics and result in a truth value of type `bool` or an error. +A comparison expression is any valid Zed expression compared to any other +valid Zed expression using a comparison operator. + +When the operands are coercible to like types, the result is the truth value +of the comparison. Otherwise, the result is `false`. + +If either operand to a comparison +is `error("missing")`, then the result is `error("missing")`. + +For example, +```mdtest-command +zq -z 'yield 1 > 2, 1 < 2, "b" > "a", 1 > "a", 1 > x' + +``` +produces +```mdtest-output +false +true +true +false +error("missing") +``` + +## Containment + +The `in` operator has the form +``` + in +``` +and is true if the `` expression results in a value that +appears somewhere in the `` as an exact match of the item. +The right-hand side value can be any Zed value and complex values are +recursively traversed to determine if the item is present anywhere within them. + +For example, +```mdtest-command +echo '{a:[1,2]}{b:{c:3}}{d:{e:1}}' | zq -z '1 in this' - +``` +produces +```mdtest-output +{a:[1,2]} +{d:{e:1}} +``` +You can also use this operator with a static array: +```mdtest-command +echo '{accounts:[{id:1},{id:2},{id:3}]}' | zq -z 'over accounts | where id in [1,2]' - +``` +produces +```mdtest-output +{id:1} +{id:2} +``` + +## Logic + +The keywords `and`, `or`, `not`, and `!` perform logic on operands of type `bool`. +The binary operators `and` and `or` operate on Boolean values and result in +an error value if either operand is not a Boolean. Likewise, `not` (and its +equivalent `!`) operates on its unary operand and results in an error if its +operand is not type `bool`. Unlike many other languages, non-Boolean values are +not automatically converted to Boolean type using "truthiness" heuristics. + +## Field Dereference + +Record fields are dereferenced with the dot operator `.` as is customary +in other languages and have the form +``` + . +``` +where `` is an identifier representing the field name referenced. +If a field name is not representable as an identifier, then [indexing](#indexing) +may be used with a quoted string to represent any valid field name. +Such field names can be accessed using +[`this`](dataflow-model.md#the-special-value-this) and an array-style reference, e.g., +`this["field with spaces"]`. + +If the dot operator is applied to a value that is not a record +or if the record does not have the given field, then the result is +`error("missing")`. + +## Indexing + +The index operation can be applied to various data types and has the form: +``` + [ ] +``` +If the `` expression is a record, then the `` operand +must be coercible to a string and the result is the record's field +of that name. + +If the `` expression is an array, then the `` operand +must be coercible to an integer and the result is the +value in the array of that index. + +If the `` expression is a set, then the `` operand +must be coercible to an integer and the result is the +value in the set of that index ordered by total order of Zed values. + +If the `` expression is a map, then the `` operand +is presumed to be a key and the corresponding value for that key is +the result of the operation. If no such key exists in the map, then +the result is `error("missing")`. + +If the `` expression is a string, then the `` operand +must be coercible to an integer and the result is an integer representing +the unicode code point at that offset in the string. + +If the `` expression is type `bytes`, then the `` operand +must be coercible to an integer and the result is an unsigned 8-bit integer +representing the byte value at that offset in the bytes sequence. + +## Slices + +The slice operation can be applied to various data types and has the form: +``` + [ : ] +``` +The `` and `` terms must be expressions that are coercible +to integers and represent a range of index values to form a subset of elements +from the `` term provided. The range begins at the `` position +and ends one before the `` position. A negative +value of `` or `` represents a position relative to the +end of the value being sliced. + +If the `` expression is an array, then the result is an array of +elements comprising the indicated range. + +If the `` expression is a set, then the result is a set of +elements comprising the indicated range ordered by total order of Zed values. + +If the `` expression is a string, then the result is a substring +consisting of unicode code points comprising the given range. + +If the `` expression is type `bytes`, then the result is a bytes sequence +consisting of bytes comprising the given range. + +## Conditional + +A conditional expression has the form +``` + ? : +``` +The `` expression is evaluated and must have a result of type `bool`. +If not, an error results. + +If the result is true, then the first `` expression is evaluated and becomes +the result. Otherwise, the second `` expression is evaluated and +becomes the result. + +For example, +```mdtest-command +echo '{s:"foo",v:1}{s:"bar",v:2}' | zq -z 'yield (s=="foo") ? v : -v' - +``` +produces +```mdtest-output +1 +-2 +``` + +Note that if the expression has side effects, +as with [aggregate function calls](expressions.md#aggregate-function-calls), only the selected expression +will be evaluated. + +For example, +```mdtest-command +echo '"foo" "bar" "foo"' | zq -z 'yield this=="foo" ? {foocount:count()} : {barcount:count()}' - +``` +produces +```mdtest-output +{foocount:1(uint64)} +{barcount:1(uint64)} +{foocount:2(uint64)} +``` + +## Function Calls + +Functions perform stateless transformations of their input value to their return +value and utilize call-by value semantics with positional and unnamed arguments. + +For example, +```mdtest-command +zq -z 'yield pow(2,3), lower("ABC")+upper("def"), typeof(1)' +``` +produces +```mdtest-output +8. +"abcDEF" + +``` + +Zed includes many [built-in functions](functions/README.md), some of which take +a variable number of arguments. + +Zed also allows you to create [user-defined functions](statements.md#func-statements). + +## Aggregate Function Calls + +[Aggregate functions](aggregates/README.md) may be called within an expression. +Unlike the aggregation context provided by a [summarizing group-by](operators/summarize.md), such calls +in expression context yield an output value for each input value. + +Note that because aggregate functions carry state which is typically +dependent on the order of input values, their use can prevent the runtime +optimizer from parallelizing a query. + +That said, aggregate function calls can be quite useful in a number of contexts. +For example, a unique ID can be assigned to the input quite easily: +```mdtest-command +echo '"foo" "bar" "baz"' | zq -z 'yield {id:count(),value:this}' - +``` +produces +```mdtest-output +{id:1(uint64),value:"foo"} +{id:2(uint64),value:"bar"} +{id:3(uint64),value:"baz"} +``` +In contrast, calling aggregate functions within the [`summarize` operator](operators/summarize.md) +```mdtest-command +echo '"foo" "bar" "baz"' | zq -z 'summarize count(),union(this)' - +``` +produces just one output value +```mdtest-output +{count:3(uint64),union:|["bar","baz","foo"]|} +``` + +## Literals + +Any of the [data types](data-types.md) may be used in expressions +as long as it is compatible with the semantics of the expression. + +String literals are enclosed in either single quotes or double quotes and +must conform to UTF-8 encoding and follow the JavaScript escaping +conventions and unicode escape syntax. Also, if the sequence `${` appears +in a string the `$` character must be escaped, i.e., `\$`. + +### String Interpolation + +Strings may include interpolation expressions, which has the form +``` +${ } +``` +In this case, the characters starting with `$` and ending at `}` are substituted +with the result of evaluating the expression ``. If this result is not +a string, it is implicitly cast to a string. + +For example, +```mdtest-command +echo '{numerator:22.0, denominator:7.0}' | zq -z 'yield "pi is approximately ${numerator / denominator}"' - +``` +produces +```mdtest-output +"pi is approximately 3.142857142857143" +``` + +If any template expression results in an error, then the value of the template +literal is the first error encountered in left-to-right order. + +> TBD: we could improve an error result here by creating a structured error +> containing the string template text along with a list of values/errors of +> the expressions. + +String interpolation may be nested, where `` contains additional strings +with interpolated expressions. + +For example, +```mdtest-command +echo '{foo:"hello", bar:"world", HELLOWORLD:"hi!"}' | zq -z 'yield "oh ${this[upper("${foo + bar}")]}"' - +``` +produces +```mdtest-output +"oh hi!" +``` + +### Record Expressions + +Record literals have the form +``` +{ , , ... } +``` +where a `` has one of three forms: +``` + : + +... +``` +The first form is a customary colon-separated field and value similar to JavaScript, +where `` may be an identifier or quoted string. +The second form is an [implied field reference](dataflow-model.md#implied-field-references) +``, which is shorthand for `:`. The third form is the `...` +spread operator which expects a record value as the result of `` and +inserts all of the fields from the resulting record. +If a spread expression results in a non-record type (e.g., errors), then that +part of the record is simply elided. + +The fields of a record expression are evaluated left to right and when +field names collide the rightmost instance of the name determines that +field's value. + +For example, +```mdtest-command +echo '{x:1,y:2,r:{a:1,b:2}}' | zq -z 'yield {a:0},{x}, {...r}, {a:0,...r,b:3}' - +``` +produces +```mdtest-output +{a:0} +{x:1} +{a:1,b:2} +{a:1,b:3} +``` + +### Array Expressions + +Array literals have the form +``` +[ , , ... ] +``` +where a `` has one of two forms: +``` + +... +``` + +The first form is simply an element in the array, the result of ``. The +second form is the `...` spread operator which expects an array or set value as +the result of `` and inserts all of the values from the result. If a spread +expression results in neither an array nor set, then the value is elided. + +When the expressions result in values of non-uniform type, then the implied +type of the array is an array of type `union` of the types that appear. + +For example, +```mdtest-command +zq -z 'yield [1,2,3],["hello","world"]' +``` +produces +```mdtest-output +[1,2,3] +["hello","world"] +``` + +Arrays can be concatenated using the spread operator, +```mdtest-command +echo '{a:[1,2],b:[3,4]}' | zq -z 'yield [...a,...b,5]' - +``` +produces +```mdtest-output +[1,2,3,4,5] +``` + +### Set Expressions + +Set literals have the form +``` +|[ , , ... ]| +``` +where a `` has one of two forms: +``` + +... +``` + +The first form is simply an element in the set, the result of ``. The +second form is the `...` spread operator which expects an array or set value as +the result of `` and inserts all of the values from the result. If a spread +expression results in neither an array nor set, then the value is elided. + +When the expressions result in values of non-uniform type, then the implied +type of the set is a set of type `union` of the types that appear. + +Set values are always organized in their "natural order" independent of the order +they appear in the set literal. + +For example, +```mdtest-command +zq -z 'yield |[3,1,2]|,|["hello","world","hello"]|' +``` +produces +```mdtest-output +|[1,2,3]| +|["hello","world"]| +``` + +Arrays and sets can be concatenated using the spread operator, +```mdtest-command +echo '{a:[1,2],b:|[2,3]|}' | zq -z 'yield |[...a,...b,4]|' - +``` +produces +```mdtest-output +|[1,2,3,4]| +``` + +### Map Expressions + +Map literals have the form +``` +|{ :, :, ... }| +``` +where the first expression of each colon-separated entry is the key value +and the second expression is the value. +When the key and/or value expressions result in values of non-uniform type, +then the implied type of the map has a key type and/or value type that is +a union of the types that appear in each respective category. + +For example, +```mdtest-command +zq -z 'yield |{"foo":1,"bar"+"baz":2+3}|' +``` +produces +```mdtest-output +|{"foo":1,"barbaz":5}| +``` + +### Union Values + +A union value can be created with a [cast](expressions.md#casts). For example, a union of types `int64` +and `string` is expressed as `(int64,string)` and any value that has a type +that appears in the union type may be cast to that union type. +Since 1 is an `int64` and "foo" is a `string`, they both can be +values of type `(int64,string)`, e.g., +```mdtest-command +echo '1 "foo"' | zq -z 'yield cast(this,<(int64,string)>)' - +``` +produces +```mdtest-output +1((int64,string)) +"foo"((int64,string)) +``` +The value underlying a union-tagged value is accessed with the +[`under` function](functions/under.md): +```mdtest-command +echo '1((int64,string))' | zq -z 'yield under(this)' - +``` +produces +```mdtest-output +1 +``` +Union values are powerful because they provide a mechanism to precisely +describe the type of any nested, semi-structured value composed of elements +of different types. For example, the type of the value `[1,"foo"]` in JavaScript +is simply a generic JavaScript "object". But in Zed, the type of this +value is an array of union of string and integer, e.g., +```mdtest-command +echo '[1,"foo"]' | zq -z 'typeof(this)' - +``` +produces +```mdtest-output +<[(int64,string)]> +``` + +## Casts + +Type conversion is performed with casts and the built-in [`cast` function](functions/cast.md). + +Casts for primitive types have a function-style syntax of the form +``` + ( ) +``` +where `` is a [Zed type](data-types.md#first-class-types) and `` is any Zed expression. +In the case of primitive types, the type-value angle brackets +may be omitted, e.g., `(1)` is equivalent to `string(1)`. +If the result of `` cannot be converted +to the indicated type, then the cast's result is an error value. + +For example, +```mdtest-command +echo '1 200 "123" "200"' | zq -z 'yield int8(this)' - +``` +produces +```mdtest-output +1(int8) +error({message:"cannot cast to int8",on:200}) +123(int8) +error({message:"cannot cast to int8",on:"200"}) +``` + +Casting attempts to be fairly liberal in conversions. For example, values +of type `time` can be created from a diverse set of date/time input strings +based on the [Go Date Parser library](https://github.com/araddon/dateparse). + +```mdtest-command +echo '"May 8, 2009 5:57:51 PM" "oct 7, 1970"' | zq -z 'yield time(this)' - +``` +produces +```mdtest-output +2009-05-08T17:57:51Z +1970-10-07T00:00:00Z +``` + +Casts of complex or [named types](data-types.md#named-types) may be performed using type values +either in functional form or with `cast`: +``` + ( ) +cast(, ) +``` +For example +```mdtest-command +echo '80 8080' | zq -z 'type port = uint16 yield (this)' - +``` +produces +```mdtest-output +80(port=uint16) +8080(port=uint16) +``` + +Casts may be used with complex types as well. As long as the target type can +accommodate the value, the case will be recursively applied to the components +of a nested value. For example, +```mdtest-command +echo '["10.0.0.1","10.0.0.2"]' | zq -z 'cast(this,<[ip]>)' - +``` +produces +```mdtest-output +[10.0.0.1,10.0.0.2] +``` +and +```mdtest-command +echo '{ts:"1/1/2022",r:{x:"1",y:"2"}} {ts:"1/2/2022",r:{x:3,y:4}}' | zq -z 'cast(this,<{ts:time,r:{x:float64,y:float64}}>)' - +``` +produces +```mdtest-output +{ts:2022-01-01T00:00:00Z,r:{x:1.,y:2.}} +{ts:2022-01-02T00:00:00Z,r:{x:3.,y:4.}} +``` diff --git a/versioned_docs/version-v1.15.0/language/functions/README.md b/versioned_docs/version-v1.15.0/language/functions/README.md new file mode 100644 index 00000000..aa6752a1 --- /dev/null +++ b/versioned_docs/version-v1.15.0/language/functions/README.md @@ -0,0 +1,69 @@ +# Functions + +--- + +Functions appear in [expression](../expressions.md) context and +take Zed values as arguments and produce a value as a result. In addition to +the built-in functions listed below, Zed also allows for the creation of +[user-defined functions](../statements.md#func-statements). + +A function-style syntax is also available for converting values to each of +Zed's [primitive types](../../formats/zed.md#1-primitive-types), e.g., +`uint8()`, `time()`, etc. For details and examples, read about the +[`cast` function](cast.md) and how it is [used in expressions](../expressions.md#casts). + +* [abs](abs.md) - absolute value of a number +* [base64](base64.md) - encode/decode base64 strings +* [bucket](bucket.md) - quantize a time or duration value into buckets of equal widths +* [cast](cast.md) - coerce a value to a different type +* [ceil](ceil.md) - ceiling of a number +* [cidr_match](cidr_match.md) - test if IP is in a network +* [compare](compare.md) - return an int comparing two values +* [coalesce](coalesce.md) - return first value that is not null, a "missing" error, or a "quiet" error +* [crop](crop.md) - remove fields from a value that are missing in a specified type +* [error](error.md) - wrap a value as an error +* [every](every.md) - bucket `ts` using a duration +* [fields](fields.md) - return the flattened path names of a record +* [fill](fill.md) - add null values for missing record fields +* [flatten](flatten.md) - transform a record into a flattened map +* [floor](floor.md) - floor of a number +* [grep](grep.md) - search strings inside of values +* [grok](grok.md) - parse a string into a structured record +* [has](has.md) - test existence of values +* [hex](hex.md) - encode/decode hexadecimal strings +* [has_error](has_error.md) - test if a value has an error +* [is](is.md) - test a value's type +* [is_error](is_error.md) - test if a value is an error +* [join](join.md) - concatenate array of strings with a separator +* [kind](kind.md) - return a value's type category +* [ksuid](ksuid.md) - encode/decode KSUID-style unique identifiers +* [len](len.md) - the type-dependent length of a value +* [levenshtein](levenshtein.md) Levenshtein distance +* [log](log.md) - natural logarithm +* [lower](lower.md) - convert a string to lower case +* [map](map.md) - apply a function to each element of an array or set +* [missing](missing.md) - test for the "missing" error +* [nameof](nameof.md) - the name of a named type +* [nest_dotted](nest_dotted.md) - transform fields in a record with dotted names to nested records +* [network_of](network_of.md) - the network of an IP +* [now](now.md) - the current time +* [order](order.md) - reorder record fields +* [parse_uri](parse_uri.md) - parse a string URI into a structured record +* [parse_zson](parse_zson.md) - parse ZSON text into a Zed value +* [pow](pow.md) - exponential function of any base +* [quiet](quiet.md) - quiet "missing" errors +* [regexp](regexp.md) - perform a regular expression search on a string +* [regexp_replace](regexp_replace.md) - replace regular expression matches in a string +* [replace](replace.md) - replace one string for another +* [round](round.md) - round a number +* [rune_len](rune_len.md) - length of a string in Unicode code points +* [shape](shape.md) - apply cast, fill, and order +* [split](split.md) - slice a string into an array of strings +* [sqrt](sqrt.md) - square root of a number +* [trim](trim.md) - strip leading and trailing whitespace +* [typename](typename.md) - look up and return a named type +* [typeof](typeof.md) - the type of a value +* [typeunder](typeunder.md) - the underlying type of a value +* [under](under.md) - the underlying value +* [unflatten](unflatten.md) - transform a record with dotted names to a nested record +* [upper](upper.md) - convert a string to upper case diff --git a/versioned_docs/version-v1.15.0/language/functions/_category_.yaml b/versioned_docs/version-v1.15.0/language/functions/_category_.yaml new file mode 100644 index 00000000..db920ebc --- /dev/null +++ b/versioned_docs/version-v1.15.0/language/functions/_category_.yaml @@ -0,0 +1,2 @@ +position: 11 +label: Functions diff --git a/versioned_docs/version-v1.15.0/language/functions/abs.md b/versioned_docs/version-v1.15.0/language/functions/abs.md new file mode 100644 index 00000000..beaf55dc --- /dev/null +++ b/versioned_docs/version-v1.15.0/language/functions/abs.md @@ -0,0 +1,31 @@ +### Function + +  **abs** — absolute value of a number + +### Synopsis + +``` +abs(n: number) -> number +``` + +### Description + +The _abs_ function returns the absolute value of its argument `n`, which +must be a numeric type. + +### Examples + +Absolute value of a various numbers: +```mdtest-command +echo '1 -1 0 -1.0 -1(int8) 1(uint8) "foo"' | zq -z 'yield abs(this)' - +``` +=> +```mdtest-output +1 +1 +0 +1. +1(int8) +1(uint8) +error({message:"abs: not a number",on:"foo"}) +``` diff --git a/versioned_docs/version-v1.15.0/language/functions/base64.md b/versioned_docs/version-v1.15.0/language/functions/base64.md new file mode 100644 index 00000000..22067bc8 --- /dev/null +++ b/versioned_docs/version-v1.15.0/language/functions/base64.md @@ -0,0 +1,51 @@ +### Function + +  **base64** — encode/decode Base64 strings + +### Synopsis + +``` +base64(b: bytes) -> string +base64(s: string) -> bytes +``` + +### Description + +The _base64_ function encodes a Zed bytes value `b` as a +a [Base64](https://en.wikipedia.org/wiki/Base64) string, +or decodes a Base64 string `s` into a Zed bytes value. + +### Examples + +Encode byte sequence `0x010203` into its Base64 string: +```mdtest-command +echo '0x010203' | zq -z 'yield base64(this)' - +``` +=> +```mdtest-output +"AQID" +``` +Decode "AQID" into byte sequence `0x010203`: +```mdtest-command +echo '"AQID"' | zq -z 'yield base64(this)' - +``` +=> +```mdtest-output +0x010203 +``` +Encode ASCII string into Base64-encoded string: +```mdtest-command +echo '"hello, world"' | zq -z 'yield base64(bytes(this))' - +``` +=> +```mdtest-output +"aGVsbG8sIHdvcmxk" +``` +Decode a Base64 string and cast the decoded bytes to a string: +```mdtest-command +echo '"aGVsbG8gd29ybGQ="' | zq -z 'yield string(base64(this))' - +``` +=> +```mdtest-output +"hello world" +``` diff --git a/versioned_docs/version-v1.15.0/language/functions/bucket.md b/versioned_docs/version-v1.15.0/language/functions/bucket.md new file mode 100644 index 00000000..295bf4c3 --- /dev/null +++ b/versioned_docs/version-v1.15.0/language/functions/bucket.md @@ -0,0 +1,29 @@ +### Function + +  **bucket** — quantize a time or duration value into buckets of equal time spans + +### Synopsis + +``` +bucket(val: time, span: duration|number) -> time +bucket(val: duration, span: duration|number) -> duration +``` + +### Description + +The _bucket_ function quantizes a time or duration `val` +(or value that can be coerced to time) into buckets that +are equally spaced as specified by `span` where the bucket boundary +aligns with 0. + +### Examples + +Bucket a couple times to hour intervals: +```mdtest-command +echo '2020-05-26T15:27:47Z "5/26/2020 3:27pm"' | zq -z 'yield bucket(time(this), 1h)' - +``` +=> +```mdtest-output +2020-05-26T15:00:00Z +2020-05-26T15:00:00Z +``` diff --git a/versioned_docs/version-v1.15.0/language/functions/cast.md b/versioned_docs/version-v1.15.0/language/functions/cast.md new file mode 100644 index 00000000..1a650521 --- /dev/null +++ b/versioned_docs/version-v1.15.0/language/functions/cast.md @@ -0,0 +1,87 @@ +### Function + +  **cast** — coerce a value to a different type + +### Synopsis + +``` +cast(val: any, t: type) -> any +cast(val: any, name: string) -> any +``` + +### Description + +The _cast_ function performs type casts but handles both primitive types and +complex types. If the input type `t` is a primitive type, then the result +is equivalent to +``` +t(val) +``` +e.g., the result of `cast(1, )` is the same as `string(1)` which is `"1"`. +In the second form, where the `name` argument is a string, cast creates +a new named type where the name for the type is given by `name` and its +type is given by `typeof(val)`. This provides a convenient mechanism +to create new named types from the input data itself without having to +hard code the type in the Zed source text. + +For complex types, the cast function visits each leaf value in `val` and +casts that value to the corresponding type in `t`. +When a complex value has multiple levels of nesting, +casting is applied recursively down the tree. For example, cast is recursively +applied to each element in array of records and recursively applied to each record. + +If `val` is a record (or if any of its nested value is a record): +* absent fields are ignored and omitted from the result, +* extra input fields are passed through unmodified to the result, and +* fields are matched by name and are order independent and the _input_ order is retained. + +In other words, `cast` does not rearrange the order of fields in the input +to match the output type's order but rather just modifies the leaf values. + +If a cast fails, an error is returned when casting to primitive types +and the input value is returned when casting to complex types. + +### Examples + +_Cast primitives to type `ip`_ +```mdtest-command +echo '"10.0.0.1" 1 "foo"' | zq -z 'cast(this, )' - +``` +produces +```mdtest-output +10.0.0.1 +error({message:"cannot cast to ip",on:1}) +error({message:"cannot cast to ip",on:"foo"}) +``` + +_Cast a record to a different record type_ +```mdtest-command +echo '{a:1,b:2}{a:3}{b:4}' | zq -z 'cast(this, <{b:string}>)' - +``` +produces +```mdtest-output +{a:1,b:"2"} +{a:3} +{b:"4"} +``` + +_Create a name a typed and cast value to the new type_ +```mdtest-command +echo '{a:1,b:2}{a:3,b:4}' | zq -z 'cast(this, "foo")' - +``` +produces +```mdtest-output +{a:1,b:2}(=foo) +{a:3,b:4}(=foo) +``` + +_Name data based its properties_ +```mdtest-command +echo '{x:1,y:2}{r:3}{x:4,y:5}' | zq -z 'switch ( case has(x) => cast(this, "point") default => cast(this, "radius") ) | sort this' - +``` +produces +```mdtest-output +{r:3}(=radius) +{x:1,y:2}(=point) +{x:4,y:5}(=point) +``` diff --git a/versioned_docs/version-v1.15.0/language/functions/ceil.md b/versioned_docs/version-v1.15.0/language/functions/ceil.md new file mode 100644 index 00000000..068953cc --- /dev/null +++ b/versioned_docs/version-v1.15.0/language/functions/ceil.md @@ -0,0 +1,28 @@ +### Function + +  **ceil** — ceiling of a number + +### Synopsis + +``` +ceil(n: number) -> number +``` + +### Description + +The _ceil_ function returns the smallest integer greater than or equal to its argument `n`, +which must be a numeric type. The return type retains the type of the argument. + +### Examples + +The ceiling of a various numbers: +```mdtest-command +echo '1.5 -1.5 1(uint8) 1.5(float32)' | zq -z 'yield ceil(this)' - +``` +=> +```mdtest-output +2. +-1. +1(uint8) +2.(float32) +``` diff --git a/versioned_docs/version-v1.15.0/language/functions/cidr_match.md b/versioned_docs/version-v1.15.0/language/functions/cidr_match.md new file mode 100644 index 00000000..dcbcb71b --- /dev/null +++ b/versioned_docs/version-v1.15.0/language/functions/cidr_match.md @@ -0,0 +1,50 @@ +### Function + +  **cidr_match** — test if IP is in a network + +### Synopsis + +``` +cidr_match(network: net, val: any) -> bool +``` + +### Description + +The _cidr_match_ function returns true if `val` contains an IP address that +falls within the network given by `network`. When `val` is a complex type, the +function traverses its nested structure to find any `ip` values. +If `network` is not type `net`, then an error is returned. + +### Examples + +Test whether values are IP addresses in a network: +```mdtest-command +echo '10.1.2.129 11.1.2.129 10 "foo"' | zq -z 'yield cidr_match(10.0.0.0/8, this)' - +``` +=> +```mdtest-output +true +false +false +false +``` +It also works for IPs in complex values: + +```mdtest-command +echo '[10.1.2.129,11.1.2.129] {a:10.0.0.1} {a:11.0.0.1}' | zq -z 'yield cidr_match(10.0.0.0/8, this)' - +``` +=> +```mdtest-output +true +true +false +``` + +The first argument must be a network: +```mdtest-command +echo '10.0.0.1' | zq -z 'yield cidr_match([1,2,3], this)' - +``` +=> +```mdtest-output +error({message:"cidr_match: not a net",on:[1,2,3]}) +``` diff --git a/versioned_docs/version-v1.15.0/language/functions/coalesce.md b/versioned_docs/version-v1.15.0/language/functions/coalesce.md new file mode 100644 index 00000000..6b80289b --- /dev/null +++ b/versioned_docs/version-v1.15.0/language/functions/coalesce.md @@ -0,0 +1,33 @@ +### Function + +  **coalesce** — return first value that is not null, a "missing" error, or a "quiet" error + +### Synopsis + +``` +coalesce(val: any [, ... val: any]) -> bool +``` + +### Description + +The _coalesce_ function returns the first of its arguments that is not null, +`error("missing")`, or `error("quiet")`. It returns null if all its arguments +are null, `error("missing")`, or `error("quiet")`. + +### Examples + +```mdtest-command +zq -z 'yield coalesce(null, error("missing"), error("quiet"), 1)' +``` +=> +```mdtest-output +1 +``` + +```mdtest-command +zq -z 'yield coalesce(null, error("missing"), error("quiet"))' +``` +=> +```mdtest-output +null +``` diff --git a/versioned_docs/version-v1.15.0/language/functions/compare.md b/versioned_docs/version-v1.15.0/language/functions/compare.md new file mode 100644 index 00000000..9a920249 --- /dev/null +++ b/versioned_docs/version-v1.15.0/language/functions/compare.md @@ -0,0 +1,29 @@ +### Function + +  **compare** — return an integer comparing two values + +### Synopsis + +``` +compare(a: any, b: any [, nullsMax: bool]) -> int64 +``` + +### Description + +The _compare_ function returns an integer comparing two values. The result will +be 0 if a is equal to b, +1 if a is greater than b, and -1 if a is less than b. +_compare_ differs from `<`, `>`, `<=`, `>=`, `==`, and `!=` in that it will +work for any type (e.g., `compare(1, "1")`). + +`nullsMax` is an optional value (true by default) that determines whether `null` +is treated as the minimum or maximum value. + +### Examples + +```mdtest-command +echo '{a: 2, b: "1"}' | zq -z 'yield compare(a, b)' - +``` +=> +```mdtest-output +-1 +``` diff --git a/versioned_docs/version-v1.15.0/language/functions/crop.md b/versioned_docs/version-v1.15.0/language/functions/crop.md new file mode 100644 index 00000000..d13f3a34 --- /dev/null +++ b/versioned_docs/version-v1.15.0/language/functions/crop.md @@ -0,0 +1,54 @@ +### Function + +  **crop** — remove fields from input value that are missing in a specified type + +### Synopsis + +``` +crop(val: any, t: type) -> any +``` + +### Description + +The _crop_ function operates on record values (or records within a nested value) +and returns a result such that any fields that are present in `val` but not in +record type `t` are removed. +Cropping is a useful when you want records to "fit" a schema tightly. + +If `val` is a record (or if any of its nested values is a record): +* absent fields are ignored and omitted from the result, +* fields are matched by name and are order independent and the _input_ order is retained, and +* leaf types are ignored, i.e., no casting occurs. + +If `val` is not a record, it is returned unmodified. + +### Examples + +_Crop a record_ +```mdtest-command +echo '{a:1,b:2}' | zq -z 'crop(this, <{a:int64}>)' - +``` +produces +```mdtest-output +{a:1} +``` + +_Crop an array of records_ +```mdtest-command +echo '[{a:1,b:2},{a:3,b:4}]' | zq -z 'crop(this, <[{a:int64}]>)' - +``` +produces +```mdtest-output +[{a:1},{a:3}] +``` + +_Cropped primitives are returned unmodified_ +```mdtest-command +echo '10.0.0.1 1 "foo"' | zq -z 'crop(this, <{a:int64}>)' - +``` +produces +```mdtest-output +10.0.0.1 +1 +"foo" +``` diff --git a/versioned_docs/version-v1.15.0/language/functions/error.md b/versioned_docs/version-v1.15.0/language/functions/error.md new file mode 100644 index 00000000..988eb7f4 --- /dev/null +++ b/versioned_docs/version-v1.15.0/language/functions/error.md @@ -0,0 +1,66 @@ +### Function + +  **error** — wrap a Zed value as an error + +### Synopsis + +``` +error(val: any) -> error +``` + +### Description + +The _error_ function returns an error version of a Zed value. +It wraps any Zed value `val` to turn it into an error type providing +a means to create structured and stacked errors. + +### Examples + +Wrap a record as a structured error: +```mdtest-command +echo '{foo:"foo"}' | zq -z 'yield error({message:"bad value", value:this})' - +``` +=> +```mdtest-output +error({message:"bad value",value:{foo:"foo"}}) +``` + +Wrap any value as an error: +```mdtest-command +echo '1 "foo" [1,2,3]' | zq -z 'yield error(this)' - +``` +=> +```mdtest-output +error(1) +error("foo") +error([1,2,3]) +``` + +Test if a value is an error and show its type "kind": +```mdtest-command +echo 'error("exception") "exception"' | zq -Z 'yield {this,err:is_error(this),kind:kind(this)}' - +``` +=> +```mdtest-output +{ + this: error("exception"), + err: true, + kind: "error" +} +{ + this: "exception", + err: false, + kind: "primitive" +} +``` + +Comparison of a missing error results in a missing error even if they +are the same missing errors so as to not allow field comparisons of two +missing fields to succeed: +```mdtest-command +echo '{}' | zq -z 'badfield:=x | yield badfield==error("missing")' - +``` +=> +```mdtest-output +error("missing") +``` diff --git a/versioned_docs/version-v1.15.0/language/functions/every.md b/versioned_docs/version-v1.15.0/language/functions/every.md new file mode 100644 index 00000000..37f6af05 --- /dev/null +++ b/versioned_docs/version-v1.15.0/language/functions/every.md @@ -0,0 +1,37 @@ +### Function + +  **every** — bucket `ts` using a duration + +### Synopsis + +``` +every(d: duration) -> time +``` + +### Description + +The _every_ function is a shortcut for `bucket(ts, d)`. +This provides a convenient binning function for aggregations +when analyzing time-series data like logs that have a `ts` field. + +### Examples + +Operate on a sequence of times: +```mdtest-command +echo '{ts:2021-02-01T12:00:01Z}' | zq -z 'yield {ts,val:0},{ts:ts+1s},{ts:ts+2h2s} | yield every(1h) | sort' - +``` +-> +```mdtest-output +2021-02-01T12:00:00Z +2021-02-01T12:00:00Z +2021-02-01T14:00:00Z +``` +Use as a group-by key: +```mdtest-command +echo '{ts:2021-02-01T12:00:01Z}' | zq -z 'yield {ts,val:1},{ts:ts+1s,val:2},{ts:ts+2h2s,val:5} | sum(val) by every(1h) | sort' - +``` +-> +```mdtest-output +{ts:2021-02-01T12:00:00Z,sum:3} +{ts:2021-02-01T14:00:00Z,sum:5} +``` diff --git a/versioned_docs/version-v1.15.0/language/functions/fields.md b/versioned_docs/version-v1.15.0/language/functions/fields.md new file mode 100644 index 00000000..7811887a --- /dev/null +++ b/versioned_docs/version-v1.15.0/language/functions/fields.md @@ -0,0 +1,48 @@ +### Function + +  **fields** — return the flattened path names of a record + +### Synopsis + +``` +fields(r: record) -> [[string]] +``` + +### Description + +The _fields_ function returns an array of string arrays of all the field names in record `r`. +A field's path name is representing by an array of strings since the dot +separator is an unreliable indicator of field boundaries as `.` itself +can appear in a field name. + +`error("missing")` is returned if `r` is not a record. + +### Examples + +Extract the fields of a nested record: +```mdtest-command +echo '{a:1,b:2,c:{d:3,e:4}}' | zq -z 'yield fields(this)' - +``` +=> +```mdtest-output +[["a"],["b"],["c","d"],["c","e"]] +``` +Easily convert to dotted names if you prefer: +```mdtest-command +echo '{a:1,b:2,c:{d:3,e:4}}' | zq -z 'over fields(this) | yield join(this,".")' - +``` +=> +```mdtest-output +"a" +"b" +"c.d" +"c.e" +``` +A record is expected: +```mdtest-command +echo 1 | zq -z 'yield {f:fields(this)}' - +``` +=> +```mdtest-output +{f:error("missing")} +``` diff --git a/versioned_docs/version-v1.15.0/language/functions/fill.md b/versioned_docs/version-v1.15.0/language/functions/fill.md new file mode 100644 index 00000000..e32923d7 --- /dev/null +++ b/versioned_docs/version-v1.15.0/language/functions/fill.md @@ -0,0 +1,50 @@ +### Function + +  **fill** — add null values for missing record fields + +### Synopsis + +``` +fill(val: any, t: type) -> any +``` + +### Description + +The _fill_ function adds to the input record `val` any fields that are +present in the output type `t` but not in the input. + +Filled fields are added with a `null` value. Filling is useful when +you want to be sure that all fields in a schema are present in a record. + +If `val` is not a record, it is returned unmodified. + +### Examples + +_Fill a record_ +```mdtest-command +echo '{a:1}' | zq -z 'fill(this, <{a:int64,b:string}>)' - +``` +produces +```mdtest-output +{a:1,b:null(string)} +``` + +_Fill an array of records_ +```mdtest-command +echo '[{a:1},{a:2}]' | zq -z 'fill(this, <[{a:int64,b:int64}]>)' - +``` +produces +```mdtest-output +[{a:1,b:null(int64)},{a:2,b:null(int64)}] +``` + +_Non-records are returned unmodified_ +```mdtest-command +echo '10.0.0.1 1 "foo"' | zq -z 'fill(this, <{a:int64,b:int64}>)' - +``` +produces +```mdtest-output +10.0.0.1 +1 +"foo" +``` diff --git a/versioned_docs/version-v1.15.0/language/functions/flatten.md b/versioned_docs/version-v1.15.0/language/functions/flatten.md new file mode 100644 index 00000000..b61101bb --- /dev/null +++ b/versioned_docs/version-v1.15.0/language/functions/flatten.md @@ -0,0 +1,26 @@ +### Function + +  **flatten** — transform a record into a flattened array. + +### Synopsis + +``` +flatten(val: record) -> [{key:[string],value:}] +``` + +### Description +The _flatten_ function returns an array of records `[{key:[string],value:}]` +where `key` is a string array of the path of each record field of `val` and +`value` is the corresponding value of that field. +If there are multiple types for the leaf values in `val`, then the array value +inner type is a union of the record types present. + +### Examples + +```mdtest-command +echo '{a:1,b:{c:"foo"}}' | zq -z 'yield flatten(this)' - +``` +=> +```mdtest-output +[{key:["a"],value:1},{key:["b","c"],value:"foo"}] +``` diff --git a/versioned_docs/version-v1.15.0/language/functions/floor.md b/versioned_docs/version-v1.15.0/language/functions/floor.md new file mode 100644 index 00000000..519cb811 --- /dev/null +++ b/versioned_docs/version-v1.15.0/language/functions/floor.md @@ -0,0 +1,28 @@ +### Function + +  **floor** — floor of a number + +### Synopsis + +``` +floor(n: number) -> number +``` + +### Description + +The _floor_ function returns the greatest integer less than or equal to its argument `n`, +which must be a numeric type. The return type retains the type of the argument. + +### Examples + +The floor of a various numbers: +```mdtest-command +echo '1.5 -1.5 1(uint8) 1.5(float32)' | zq -z 'yield floor(this)' - +``` +=> +```mdtest-output +1. +-2. +1(uint8) +1.(float32) +``` diff --git a/versioned_docs/version-v1.15.0/language/functions/grep.md b/versioned_docs/version-v1.15.0/language/functions/grep.md new file mode 100644 index 00000000..b34f8e12 --- /dev/null +++ b/versioned_docs/version-v1.15.0/language/functions/grep.md @@ -0,0 +1,73 @@ +### Function + +  **grep** — search strings inside of values + +### Synopsis + +``` +grep( [, e: any]) -> bool +``` + +### Description + +The _grep_ function searches all of the strings in its input value `e` +(or `this` if `e` is not given) + using the `` argument, which can be a +[regular expression](../search-expressions.md#regular-expressions), +[glob pattern](../search-expressions.md#globs), or string. +If the pattern matches for any string, then the result is `true`. Otherwise, it is `false`. + +> Note that string matches are case insensitive while regular expression +> and glob matches are case sensitive. In a forthcoming release, case sensitivity +> will be a expressible for all three pattern types. + +The entire input value is traversed: +* for records, each field name is traversed and each field value is traversed or descended +if a complex type, +* for arrays and sets, each element is traversed or descended if a complex type, and +* for maps, each key and value is traversed or descended if a complex type. + +### Examples + +_Reach into nested records_ +```mdtest-command +echo '{foo:10}{bar:{s:"baz"}}' | zq -z 'grep("baz")' - +``` +=> +```mdtest-output +{bar:{s:"baz"}} +``` +_It only matches string fields_ +```mdtest-command +echo '{foo:10}{bar:{s:"baz"}}' | zq -z 'grep("10")' - +``` +=> +```mdtest-output +``` +_Match a field name_ +```mdtest-command +echo '{foo:10}{bar:{s:"baz"}}' | zq -z 'grep("foo")' - +``` +=> +```mdtest-output +{foo:10} +``` +_Regular expression_ +```mdtest-command +echo '{foo:10}{bar:{s:"baz"}}' | zq -z 'grep(/foo|baz/)' - +``` +=> +```mdtest-output +{foo:10} +{bar:{s:"baz"}} +``` +_Glob with a second argument_ + +```mdtest-command +echo '{s:"bar"}{s:"foo"}{s:"baz"}{t:"baz"}' | zq -z 'grep(b*, s)' - +``` +=> +```mdtest-output +{s:"bar"} +{s:"baz"} +``` diff --git a/versioned_docs/version-v1.15.0/language/functions/grok.md b/versioned_docs/version-v1.15.0/language/functions/grok.md new file mode 100644 index 00000000..88552737 --- /dev/null +++ b/versioned_docs/version-v1.15.0/language/functions/grok.md @@ -0,0 +1,44 @@ +### Function + +  **grok** — parse a string using a grok pattern + +### Synopsis + +``` +grok(p: string, s: string) -> any +grok(p: string, s: string, definitions: string) -> any +``` + +### Description + +The _grok_ function parses a string `s` using grok pattern `p` and returns +a record containing the parsed fields. The syntax for pattern `p` +is `%{pattern:field_name}` where _pattern_ is the name of the pattern +to match in `s` and _field_name_ is the resultant field name of the capture +value. + +When provided with three arguments, `definitions` is a string +of named patterns in the format `PATTERN_NAME PATTERN` each separated by newlines. +The named patterns can then be referenced in argument `p`. + +#### Included Patterns + +The _grok_ function by default includes a set of builtin named patterns +that can be referenced in any pattern. The included named patterns can be seen +[here](https://raw.githubusercontent.com/brimdata/zed/main/pkg/grok/base.go). + +### Examples + +Parsing a simple log line using the builtin named patterns: +```mdtest-command +echo '"2020-09-16T04:20:42.45+01:00 DEBUG This is a sample debug log message"' | + zq -Z 'yield grok("%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} %{GREEDYDATA:message}", this)' - +``` +=> +```mdtest-output +{ + timestamp: "2020-09-16T04:20:42.45+01:00", + level: "DEBUG", + message: "This is a sample debug log message" +} +``` diff --git a/versioned_docs/version-v1.15.0/language/functions/has.md b/versioned_docs/version-v1.15.0/language/functions/has.md new file mode 100644 index 00000000..d1ddefc0 --- /dev/null +++ b/versioned_docs/version-v1.15.0/language/functions/has.md @@ -0,0 +1,49 @@ +### Function + +  **has** — test existence of values + +### Synopsis + +``` +has(val: any [, ... val: any]) -> bool +``` + +### Description + +The _has_ function returns false if any of its arguments are `error("missing")` +and otherwise returns true. +`has(e)` is a shortcut for [`!missing(e)`](missing.md). + +This function is most often used to test the existence of certain fields in an +expected record, e.g., `has(a,b)` is true when `this` is a record and has +the fields `a` and `b`, provided their values are not `error("missing")`. + +It's also useful in shaping when applying conditional logic based on the +presence of certain fields: +``` +switch ( + case has(a) => ... + case has(b) => ... + default => ... +) +``` + +### Examples + +```mdtest-command +echo '{foo:10}' | zq -z 'yield {yes:has(foo),no:has(bar)}' - +echo '{foo:[1,2,3]}' | zq -z 'yield {yes: has(foo[0]),no:has(foo[3])}' - +echo '{foo:{bar:"value"}}' | zq -z 'yield {yes:has(foo.bar),no:has(foo.baz)}' - +echo '{foo:10}' | zq -z 'yield {yes:has(foo+1),no:has(bar+1)}' - +echo 1 | zq -z 'yield has(bar)' - +echo '{x:error("missing")}' | zq -z 'yield has(x)' - +``` +=> +```mdtest-output +{yes:true,no:false} +{yes:true,no:false} +{yes:true,no:false} +{yes:true,no:false} +false +false +``` diff --git a/versioned_docs/version-v1.15.0/language/functions/has_error.md b/versioned_docs/version-v1.15.0/language/functions/has_error.md new file mode 100644 index 00000000..53fa5226 --- /dev/null +++ b/versioned_docs/version-v1.15.0/language/functions/has_error.md @@ -0,0 +1,27 @@ +### Function + +  **has_error** — test if a value is or contains an error + +### Synopsis + +``` +has_error(val: any) -> bool +``` + +### Description + +The _has_error_ function returns true if its argument is or contains an error. +_has_error_ is different from _is_error_ in that _has_error_ will recurse +into value's leaves to determine if there is an error in the value. + +### Examples + +```mdtest-command +echo '{a:{b:"foo"}}' | zq -z 'yield has_error(this)' - +echo '{a:{b:"foo"}}' | zq -z 'a.x := a.y + 1 | yield has_error(this)' - +``` +=> +```mdtest-output +false +true +``` diff --git a/versioned_docs/version-v1.15.0/language/functions/hex.md b/versioned_docs/version-v1.15.0/language/functions/hex.md new file mode 100644 index 00000000..a7b8921a --- /dev/null +++ b/versioned_docs/version-v1.15.0/language/functions/hex.md @@ -0,0 +1,50 @@ +### Function + +  **hex** — encode/decode hexadecimal strings + +### Synopsis + +``` +hex(b: bytes) -> string +hex(s: string) -> bytes +``` + +### Description + +The _hex_ function encodes a Zed bytes value `b` as +a hexadecimal string or decodes a hexadecimal string `s` into a Zed bytes value. + +### Examples + +Encode a simple bytes sequence as a hexadecimal string: +```mdtest-command +echo '0x0102ff' | zq -z 'yield hex(this)' - +``` +=> +```mdtest-output +"0102ff" +``` +Decode a simple hex string: +```mdtest-command +echo '"0102ff"' | zq -z 'yield hex(this)' - +``` +=> +```mdtest-output +0x0102ff +``` +Encode the bytes of an ASCII string as a hexadecimal string: +```mdtest-command +echo '"hello, world"' | zq -z 'yield hex(bytes(this))' - +``` +=> +```mdtest-output +"68656c6c6f2c20776f726c64" +``` +Decode hex string representing ASCII into its string form: +```mdtest-command +echo '"68656c6c6f20776f726c64"' | zq -z 'yield string(hex(this))' - +``` +=> +```mdtest-output +"hello world" +``` diff --git a/versioned_docs/version-v1.15.0/language/functions/is.md b/versioned_docs/version-v1.15.0/language/functions/is.md new file mode 100644 index 00000000..ae76a8aa --- /dev/null +++ b/versioned_docs/version-v1.15.0/language/functions/is.md @@ -0,0 +1,54 @@ +### Function + +  **is** — test a value's type + +### Synopsis +``` +is(t: type) -> bool +is(val: any, t: type) -> bool +``` + +### Description + +The _is_ function returns true if the argument `val` is of type `t`. If `val` +is omitted, it defaults to `this`. The _is_ function is shorthand for `typeof(val)==t`. + +### Examples + +Test simple types: +```mdtest-command +echo '1.' | zq -z 'yield {yes:is(),no:is()}' - +``` +=> +```mdtest-output +{yes:true,no:false} +``` + +Test for a given input's record type or "shape": +```mdtest-command +echo '{s:"hello"}' | zq -z 'yield is(<{s:string}>)' - +``` +=> +```mdtest-output +true +``` +If you test a named type with it's underlying type, the types are different, +but if you use the type name or typeunder function, there is a match: +```mdtest-command +echo '{s:"hello"}(=foo)' | zq -z 'yield is(<{s:string}>)' - +echo '{s:"hello"}(=foo)' | zq -z 'yield is()' - +``` +=> +```mdtest-output +false +true +``` + +To test the underlying type, just use `==`: +```mdtest-command +echo '{s:"hello"}(=foo)' | zq -z 'yield typeunder(this)==<{s:string}>' - +``` +=> +```mdtest-output +true +``` diff --git a/versioned_docs/version-v1.15.0/language/functions/is_error.md b/versioned_docs/version-v1.15.0/language/functions/is_error.md new file mode 100644 index 00000000..3d2801b2 --- /dev/null +++ b/versioned_docs/version-v1.15.0/language/functions/is_error.md @@ -0,0 +1,44 @@ +### Function + +  **is_error** — test if a value is an error + +### Synopsis + +``` +is_error(val: any) -> bool +``` + +### Description + +The _is_error_ function returns true if its argument's type is error. +`is_error(v)` is shortcut for `kind(v)=="error"`, + +### Examples + +A simple value is not an error: +```mdtest-command +echo 1 | zq -z 'yield is_error(this)' - +``` +=> +```mdtest-output +false +``` + +An error value is an error: +```mdtest-command +echo "error(1)" | zq -z 'yield is_error(this)' - +``` +=> +```mdtest-output +true +``` + +Convert an error string into a record with an indicator and a message: +```mdtest-command +echo '"not an error" error("an error")' | zq -z 'yield {err:is_error(this),message:under(this)}' - +``` +=> +```mdtest-output +{err:false,message:"not an error"} +{err:true,message:"an error"} +``` diff --git a/versioned_docs/version-v1.15.0/language/functions/join.md b/versioned_docs/version-v1.15.0/language/functions/join.md new file mode 100644 index 00000000..78f44501 --- /dev/null +++ b/versioned_docs/version-v1.15.0/language/functions/join.md @@ -0,0 +1,35 @@ +### Function + +  **join** — concatenate array of strings with a separator + +### Synopsis + +``` +join(val: [string], sep: string) -> string +``` + +### Description + +The _join_ function concatenates the elements of string array `val` to create a single +string. The string `sep` is placed between each value in the resulting string. + +#### Example: + +Join a symbol array of strings: +```mdtest-command +echo '["a","b","c"]' | zq -z 'yield join(this, ",")' - +``` +=> +```mdtest-output +"a,b,c" +``` + +Join non-string arrays by first casting: +```mdtest-command +echo '[1,2,3] [10.0.0.1,10.0.0.2]' | zq -z 'yield join(cast(this, <[string]>), "...")' - +``` +=> +```mdtest-output +"1...2...3" +"10.0.0.1...10.0.0.2" +``` diff --git a/versioned_docs/version-v1.15.0/language/functions/kind.md b/versioned_docs/version-v1.15.0/language/functions/kind.md new file mode 100644 index 00000000..50f7ff27 --- /dev/null +++ b/versioned_docs/version-v1.15.0/language/functions/kind.md @@ -0,0 +1,60 @@ +### Function + +  **kind** — return a value's type category + +### Synopsis + +``` +kind(val: any) -> string +``` + +### Description + +The _kind_ function returns the category of the type of `v` as a string, +e.g., "record", "set", "primitive", etc. If `v` is a type value, +then the type category of the referenced type is returned. + +#### Example: + +A primitive value's kind is "primitive": +```mdtest-command +echo '1 "a" 10.0.0.1' | zq -z 'yield kind(this)' - +``` +=> +```mdtest-output +"primitive" +"primitive" +"primitive" +``` + +A complex value's kind is it's complex type category. Try it on +these empty values of various complex types: +```mdtest-command +echo '{} [] |[]| |{}| 1((int64,string))' | zq -z 'yield kind(this)' - +``` +=> +```mdtest-output +"record" +"array" +"set" +"map" +"union" +``` + +A Zed error has kind "error": +```mdtest-command +echo null | zq -z 'yield kind(1/0)' - +``` +=> +```mdtest-output +"error" +``` + +A Zed type's kind is the kind of the type: +```mdtest-command +echo '<{s:string}>' | zq -z 'yield kind(this)' - +``` +=> +```mdtest-output +"record" +``` diff --git a/versioned_docs/version-v1.15.0/language/functions/ksuid.md b/versioned_docs/version-v1.15.0/language/functions/ksuid.md new file mode 100644 index 00000000..d225c6a0 --- /dev/null +++ b/versioned_docs/version-v1.15.0/language/functions/ksuid.md @@ -0,0 +1,30 @@ +### Function + +  **ksuid** — encode/decode KSUID-style unique identifiers + +### Synopsis + +``` +ksuid() -> bytes +ksuid(b: bytes) -> string +ksuid(s: string) -> bytes +``` + +### Description + +The _ksuid_ function either encodes a [KSUID](https://github.com/segmentio/ksuid) +(a byte sequence of length 20) `b` into a Base62 string or decodes +a KSUID Base62 string into a 20-byte Zed bytes value. + +If _ksuid_ is called with no arguments, a new KSUID is generated and +returned as a bytes value. + +#### Example: + +```mdtest-command +echo '{id:0x0dfc90519b60f362e84a3fdddd9b9e63e1fb90d1}' | zq -z 'id := ksuid(id)' - +``` +=> +```mdtest-output +{id:"1zjJzTWWCJNVrGwqB8kZwhTM2fR"} +``` diff --git a/versioned_docs/version-v1.15.0/language/functions/len.md b/versioned_docs/version-v1.15.0/language/functions/len.md new file mode 100644 index 00000000..ce7e320c --- /dev/null +++ b/versioned_docs/version-v1.15.0/language/functions/len.md @@ -0,0 +1,43 @@ +### Function + +  **len** — the type-dependent length of a value + +### Synopsis + +``` +len(v: record|array|set|map|type|bytes|string|ip|net|error) -> int64 +``` + +### Description + +The _len_ function returns the length of its argument `val`. +The semantics of this length depend on the value's type. + +Supported types include: +- record +- array +- set +- map +- error +- bytes +- string +- ip +- net +- type + +#### Example: + +Take the length of various types: + +```mdtest-command +echo '[1,2,3] |["hello"]| {a:1,b:2} "hello" 10.0.0.1 1' | zq -z 'yield {this,len:len(this)}' - +``` +=> +```mdtest-output +{this:[1,2,3],len:3} +{this:|["hello"]|,len:1} +{this:{a:1,b:2},len:2} +{this:"hello",len:5} +{this:10.0.0.1,len:4} +{this:1,len:error({message:"len: bad type",on:1})} +``` diff --git a/versioned_docs/version-v1.15.0/language/functions/levenshtein.md b/versioned_docs/version-v1.15.0/language/functions/levenshtein.md new file mode 100644 index 00000000..0786126b --- /dev/null +++ b/versioned_docs/version-v1.15.0/language/functions/levenshtein.md @@ -0,0 +1,25 @@ +### Function + +  **levenshtein** — Levenshtein distance + +### Synopsis + +``` +levenshtein(a: string, b: string) -> int64 +``` + +### Description + +The _levenshtein_ function computes the [Levenshtein +distance](https://en.wikipedia.org/wiki/Levenshtein_distance) between strings +`a` and `b`. + +### Examples + +```mdtest-command +echo '{a:"kitten",b:"sitting"}' | zq -z 'yield levenshtein(a, b)' - +``` +=> +```mdtest-output +3 +``` diff --git a/versioned_docs/version-v1.15.0/language/functions/log.md b/versioned_docs/version-v1.15.0/language/functions/log.md new file mode 100644 index 00000000..d3965aa1 --- /dev/null +++ b/versioned_docs/version-v1.15.0/language/functions/log.md @@ -0,0 +1,42 @@ +### Function + +  **log** — natural logarithm + +### Synopsis + +``` +log(val: number) -> float64 +``` + +### Description + +The _log_ function returns the natural logarithm of its argument `val`, which +must be numeric. The return value is a float64 or an error. + +### Examples + +The logarithm of various numbers: +```mdtest-command +echo '4 4.0 2.718 -1' | zq -z 'yield log(this)' - +``` +=> +```mdtest-output +1.3862943611198906 +1.3862943611198906 +0.999896315728952 +error({message:"log: illegal argument",on:-1}) +``` + +The largest power of 10 smaller than the input: +```mdtest-command +echo '9 10 20 1000 1100 30000' | zq -z 'yield int64(log(this)/log(10))' - +``` +=> +```mdtest-output +0 +1 +1 +2 +3 +4 +``` diff --git a/versioned_docs/version-v1.15.0/language/functions/lower.md b/versioned_docs/version-v1.15.0/language/functions/lower.md new file mode 100644 index 00000000..14036b1e --- /dev/null +++ b/versioned_docs/version-v1.15.0/language/functions/lower.md @@ -0,0 +1,24 @@ +### Function + +  **lower** — convert a string to lower case + +### Synopsis + +``` +lower(s: string) -> string +``` + +### Description + +The _lower_ function converts all upper case Unicode characters in `s` +to lower case and returns the result. + +### Examples + +```mdtest-command +echo '"Zed"' | zq -z 'yield lower(this)' - +``` +=> +```mdtest-output +"zed" +``` diff --git a/versioned_docs/version-v1.15.0/language/functions/map.md b/versioned_docs/version-v1.15.0/language/functions/map.md new file mode 100644 index 00000000..39b2f5b8 --- /dev/null +++ b/versioned_docs/version-v1.15.0/language/functions/map.md @@ -0,0 +1,40 @@ +### Function + +  **map** — apply a function to each element of an array or set + +### Synopsis + +``` +map(v: array|set, f: function) -> array|set +``` + +### Description + +The _map_ function applies function `f` to every element in array or set `v` and +returns an array or set of the results. Function `f` must be a function that takes +only one argument. `f` may be a [user-defined function](../statements.md#func-statements). + +### Examples + +Upper case each element of an array: + +```mdtest-command +echo '["foo","bar","baz"]' | zq -z 'yield map(this, upper)' - +``` +=> +```mdtest-output +["FOO","BAR","BAZ"] +``` + +Using a user-defined function to convert an epoch float to a time: + +```mdtest-command +echo '[1697151533.41415,1697151540.716529]' | zq -z ' + func floatToTime(x): ( cast(x*1000000000,