Benchmark ClickHouse queries #700

tothandras · 2024-03-17T18:07:28Z

Preflight Checklist

I have searched the issue tracker for an issue that matches the one I want to file, without success.

Problem Description

The meters are currently implemented as materialized views on the events table. This has a drawback that meters have to be created first for the events to be processed:

Materialized views in ClickHouse are implemented more like insert triggers. If there’s some aggregation in the view query, it’s applied only to the batch of freshly inserted data. Any changes to existing data of source table (like update, delete, drop partition, etc.) does not change the materialized view.
We do not recommend using POPULATE, since data inserted in the table during the view creation will not be inserted in it.

Proposed Solution

Benchmark:

reading from the events table directly on query (optional, as we already know it's significantly slower)
use projections instead of materialized views

End goal

Back with data that projections can be used or not as an alternative to materialized views.

Additional Information

Seeder: https://github.com/openmeterio/openmeter/blob/main/etc/seed/seed.yaml (make seed)
Grafana K6: https://k6.io/ (load tester)

The text was updated successfully, but these errors were encountered:

GAlexIHU · 2024-03-23T12:39:47Z

I think there's an issue with writing a full system performance test for this, I don't think it can be done without an API change.

TLDR:
The GroupBy functionality couldn't be ported as it uses the column names defined at meter (view) creation, and these names along with the actual aggregations cannot be referenced if we use projections.

Explanation:
At meter creation (alongside the other aggregations) GroupBys are also defined in the DDL under the provided names
e.g.

# Sample meter to count LLM Token Usage
meters:
  - slug: tokens_total
    description: AI Token Usage
    eventType: prompt
    aggregation: SUM
    valueProperty: $.tokens
    groupBy:
      model: $.model # Group1
      type: $.type # Group2

turns to something like

CREATE MATERIALIZED VIEW openmeter.om_default_tokens_total
(

    ...

    `model` String,  -- GroupBy1
    `type` String    -- GroupBy2
)
...
AS SELECT
    ...
    JSON_VALUE(data,
 '$.model') AS model,

    JSON_VALUE(data,
 '$.type') AS type
    ...

The JSONPath descriptions ($.method, $.route in the example) are not passed during the query method. This is an issue because PROJECTIONS cannot store these references. If we want to utilise a PROJECTION we need to query the underlying table, and if the aggregations of the query match on of the defined PROJECTIONs then the query planner might decide to use it. For example, the api_requests_total meter from the above example could be rewritten as:

ALTER TABLE
  openmeter.om_events
ADD
  PROJECTION om_default_api_requests_total (
    SELECT
      -- Further pre-calculted fields can be added here to speed up execution, e.g. (empty(openmeter.om_events.validation_error) = 1)...
      subject,
      tumbleStart(
        time,
        toIntervalMinute(1)
      ) AS windowstart,
      tumbleEnd(
        time,
        toIntervalMinute(1)
      ) AS windowend,
      sumState(
        cast(
          JSON_VALUE(data, '$.tokens'),
          'Float64'
        )
      ) AS value,
      JSON_VALUE(data, '$.model') as model,
      JSON_VALUE(data, '$.type') as type
    /**
      * Note the missing WHERE block compared to the MATERIALIZED VIEW.
      * Where blocks are not currently supported in PROJECTIONs, with their priority on the roadmap being questionable:
      * https://github.com/ClickHouse/ClickHouse/issues/33678
      *
      * This is not necessaraly an issue as the filtering can take place in the actual query (planner still using the PROJECTION for a limited set of results), but it in itself would warrant some modifications:
      * - Meters wouldn't phisically be separate on the DB level. Meters of different namespaces would use the same underlying projections so additonal caution would be needed.
      * - The projections would be populated with a lot of nonsensical data. e.g: in the example above most of the events wouldn't have a `$.model` field. Clickhouse doesn't complain about per say right now but if the group by functionality would be extended in the future it could cause issues.
     */
    GROUP BY
      windowstart,
      windowend,
      subject,
      model,
      type
  );

A query utilising the above projection would look something like this:

-- Simple way of forcing projeciton to be used by mirroring it's SELECT block
WITH proj AS (
  subject,
      tumbleStart(
        time,
        toIntervalMinute(1)
      ) AS windowstart,
      tumbleEnd(
        time,
        toIntervalMinute(1)
      ) AS windowend,
      sumState(
        cast(
          JSON_VALUE(data, '$.tokens'),
          'Float64'
        )
      ) AS value,
      JSON_VALUE(data, '$.model') as model, -- Note how we need to reference the JSONPath here
      JSON_VALUE(data, '$.type') as type -- And here
  FROM
    openmeter.om_events
  -- Note re-introducing the WHERE block. This does not affect the query planning, it still uses the projection.
  WHERE
    (openmeter.om_events.namespace = 'default')
    AND (empty(openmeter.om_events.validation_error) = 1)
    AND (openmeter.om_events.type = 'prompt')
  GROUP BY
    windowstart,
    windowend,
    subject,
    model,
    type
) SELECT windowstart, windowend, sumMerge(value) AS value, subject, model, type
FROM proj
WHERE
  (subject = ?)
  AND windowstart >= ?
  AND windowend <= ?
GROUP BY
  windowstart,
  windowend,
  subject,
  group1,
  group2
ORDER BY
  windowstart

Where the JSONPath descriptors are required once again, but they are not part of the current API, and there's no simple way to contextually access them as they would not be stored anywhere:

/api/v1/meters/{meterIdOrSlug}/query:
  get:
    description: Query meter
    operationId: queryMeter
    tags:
      - Meters
    parameters:
      - $ref: "#/components/parameters/meterIdOrSlug"
      - $ref: "#/components/parameters/queryFrom"
      - $ref: "#/components/parameters/queryTo"
      - $ref: "#/components/parameters/queryWindowSize"
      - $ref: "#/components/parameters/queryWindowTimeZone"
      - $ref: "#/components/parameters/querySubject"
      - $ref: "#/components/parameters/queryGroupBy"
# ...
queryGroupBy:
  name: groupBy
  in: query
  required: false
  description: |
    If not specified a single aggregate will be returned for each subject and time window.
    `subject` is a reserved group by value.
  schema:
    type: array
    items:
      type: string

Based on this I have 2 suggestions in comparing performance:

Create a synthetic comparison of only the query performance with the data structure currently used. Environment setup could happen through the application (creating meters and projections from startup config) guaranteeing data schema consistency, but the benchmarking would happen outside of it by simple query executions.
Introducing a new API / breaking API change where the JSONPath descriptors are passed along with the query. This would most likely make any PR unmergable as is but would allow for full-system benchmarking. It's also a lot more time-consuming.

Of course I might be misunderstanding something about how the system works, but either way I think further clarification is needed on this.

GAlexIHU mentioned this issue Mar 24, 2024

Feat/perf testing #728

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark ClickHouse queries #700

Benchmark ClickHouse queries #700

tothandras commented Mar 17, 2024 •

edited

Loading

GAlexIHU commented Mar 23, 2024

Benchmark ClickHouse queries #700

Benchmark ClickHouse queries #700

Comments

tothandras commented Mar 17, 2024 • edited Loading

Preflight Checklist

Problem Description

Proposed Solution

End goal

Additional Information

GAlexIHU commented Mar 23, 2024

tothandras commented Mar 17, 2024 •

edited

Loading