Multi-hop declarative data pipelines
Hoptimator is an SQL-based control plane for complex data pipelines.
Hoptimator turns high-level SQL subscriptions into multi-hop data pipelines. Pipelines may involve an auto-generated Flink job (or similar) and any arbitrary resources required for the job to run.
Hoptimator has a pluggable adapter framework, which lets you wire-up arbtitary data sources. Adapters loosely correspond to connectors in the underlying compute engine (e.g. Flink Connectors), but they may include custom control plane logic. For example, an adapter may create a cache or a CDC stream as part of a pipeline. This enables a single pipeline to span multiple "hops" across different systems (as opposed to, say, a single Flink job).
Hoptimator's pipelines tend to have the following general shape:
_________
topic1 ----------------------> | |
table2 --> CDC ---> topic2 --> | SQL job | --> topic4
table3 --> rETL --> topic3 --> |_________|
The three data sources on the left correspond to three different adapters:
topic1
can be read directly from a Flink job, so the first adapter simply configures a Flink connector.table2
is inefficient for bulk access, so the second adapter creates a CDC stream (topic2
) and configures a Flink connector to read from that.table3
is in cold storage, so the third adapter creates a reverse-ETL job to re-ingest the data into Kafka.
In order to deploy such a pipeline, you only need to write one SQL query, called a subscription. Pipelines are constructed automatically based on subscriptions.
docker
is installed and docker daemon is runningkubectl
is installed and cluster is runningminikube
can be used for a local cluster
helm
for Kubernetes is installed
$ make quickstart
... wait a while ...
$ ./bin/hoptimator
> !intro
> !q
Subscriptions are SQL views that are automatically materialized by a pipeline.
$ kubectl apply -f deploy/samples/subscriptions.yaml
In response, the operator will deploy a Flink job and other resources:
$ kubectl get subscriptions
$ kubectl get flinkdeployments
$ kubectl get kafkatopics
You can verify the job is running by inspecting the output:
$ ./bin/hoptimator
> !tables
> SELECT * FROM RAWKAFKA."products" LIMIT 5;
> !q
Hoptimator-operator is a Kubernetes operator that orchestrates multi-hop data pipelines based on Subscriptions (a custom resource). When a Subscription is deployed, the operator:
- creates a plan based on the Subscription SQL. The plan includes a set of resources that make up a pipeline.
- deploys each resource in the pipeline. This may involve creating Kafka topics, Flink jobs, etc.
- reports Subscription status, which depends on the status of each resource in the pipeline.
The operator is extensible via adapters. Among other responsibilities, adapters can implement custom control plane logic (see ControllerProvider
), or they can depend on external operators. For example, the Kafka adapter actively manages Kafka topics using a custom controller. The Flink adapter defers to flink-kubernetes-operator to manage Flink jobs.
Hoptimator includes a SQL CLI based on sqlline. This is primarily for testing and debugging purposes, but it can also be useful for runnig ad-hoc queries. The CLI leverages the same adapters as the operator, but it doesn't deploy anything. Instead, queries run as local, in-process Flink jobs.