Skip to content

Conversation

@alexluong
Copy link
Collaborator

@alexluong alexluong commented Dec 4, 2025

LogStore Refactor: Stateless Schema Design

Summary

This PR refactors the log storage layer with a stateless schema design - queries return rows directly without aggregation, GROUP BY, or complex joins. The schema is optimized for O(limit) query performance.

Changes:

  • ClickHouse LogStore implementation with stateless schema design
  • New API endpoints (/deliveries, /events) with redesigned request/response schema
  • Portal UX updated to deliveries-centric view
  • Test infrastructure improvements (faster runs, reduced flakiness)

AI Disclosure

This PR was heavily AI-assisted for implementation. However, the following were done manually (some discussed with the team across multiple sessions):

  • Interface design and API surface
  • Database schema design
  • Test suite design and conformance tests
  • QA of API endpoints and Portal UI
  • Code review of all generated code

Note on test changes: There are significant test code changes in this PR. These are intentional refactoring for:

  1. Improving the AI feedback loop (faster test runs)
  2. Adapting to the new API surface

Tests were passing before refactoring. The pattern was: feature tests pass → refactor tests → verify tests still pass.


Review Areas

1. LogStore Interface & Data Layer

Interface

type LogStore interface {
    ListEvent(ctx, ListEventRequest) (ListEventResponse, error)
    ListDeliveryEvent(ctx, ListDeliveryEventRequest) (ListDeliveryEventResponse, error)
    RetrieveEvent(ctx, RetrieveEventRequest) (*models.Event, error)
    RetrieveDeliveryEvent(ctx, RetrieveDeliveryEventRequest) (*models.DeliveryEvent, error)
    InsertManyDeliveryEvent(ctx, []*models.DeliveryEvent) error
}

Request structs support:

  • Cursor pagination (Next, Prev)
  • Time filtering (EventStart, EventEnd for events; Start, End for deliveries)
  • Sorting (SortOrder: asc|desc - sort dimension is fixed per endpoint)
  • Filtering (TenantID, DestinationIDs, Topics, Status, EventID)

ClickHouse Schema

Two tables with stateless query design - no GROUP BY, no aggregation:

-- Events table: each row = one unique event (deduplicated by ReplacingMergeTree)
CREATE TABLE IF NOT EXISTS events (
    event_id String,
    tenant_id String,
    destination_id String,
    topic String,
    eligible_for_retry Bool,
    event_time DateTime64(3),
    metadata String,
    data String,

    INDEX idx_tenant_id tenant_id TYPE bloom_filter GRANULARITY 1,
    INDEX idx_destination_id destination_id TYPE bloom_filter GRANULARITY 1,
    INDEX idx_event_id event_id TYPE bloom_filter GRANULARITY 1,
    INDEX idx_topic topic TYPE bloom_filter GRANULARITY 1
) ENGINE = ReplacingMergeTree
PARTITION BY toYYYYMM(event_time)
ORDER BY (event_time, event_id);

-- Deliveries table: each row = one delivery attempt with embedded event data
CREATE TABLE IF NOT EXISTS deliveries (
    -- Event fields
    event_id String,
    tenant_id String,
    destination_id String,
    topic String,
    eligible_for_retry Bool,
    event_time DateTime64(3),
    metadata String,
    data String,

    -- Delivery fields
    delivery_id String,
    delivery_event_id String,
    status String,
    delivery_time DateTime64(3),
    code String,
    response_data String,
    manual Bool DEFAULT false,
    attempt UInt32 DEFAULT 0,

    INDEX idx_tenant_id tenant_id TYPE bloom_filter GRANULARITY 1,
    INDEX idx_destination_id destination_id TYPE bloom_filter GRANULARITY 1,
    INDEX idx_event_id event_id TYPE bloom_filter GRANULARITY 1,
    INDEX idx_delivery_id delivery_id TYPE bloom_filter GRANULARITY 1,
    INDEX idx_topic topic TYPE bloom_filter GRANULARITY 1,
    INDEX idx_status status TYPE set(100) GRANULARITY 1
) ENGINE = ReplacingMergeTree
PARTITION BY toYYYYMM(delivery_time)
ORDER BY (delivery_time, delivery_id);

Design rationale:

  • deliveries: Denormalized delivery+event rows for O(limit) delivery queries
  • events: Separate table for O(limit) event listing without GROUP BY
  • ReplacingMergeTree: Handles duplicate inserts gracefully
  • Monthly partitions: Efficient time-range pruning
  • Bloom filters: Skip granules for point lookups

Relevant Packages

  • internal/logstore/driver - interface
  • internal/logstore/cursor - cursor encoding
  • internal/logstore/chlogstore - ClickHouse
  • internal/logstore/pglogstore - PostgreSQL
  • internal/logstore/memlogstore - in-memory (reference impl)
  • internal/logstore/drivertest - conformance tests
2. API Layer

Routes

Method Path Handler
GET /:tenantID/deliveries ListDeliveries
GET /:tenantID/deliveries/:deliveryID RetrieveDelivery
POST /:tenantID/deliveries/:deliveryID/retry RetryDelivery
GET /:tenantID/events ListEvents
GET /:tenantID/events/:eventID RetrieveEvent

Query Parameters (ListDeliveries)

destination_id, event_id, status, topic
start, end (delivery time range)
sort_order (asc|desc)
limit (default 100, max 1000), next, prev
include (event|event.data|response_data)

Include Pattern

The ?include parameter controls nested object hydration using dot notation:

include event field
(none) "event": "evt_123" (ID only)
event "event": {id, topic, time, eligible_for_retry, metadata}
event.data "event": {..., data} (includes payload)

By default, related entities return as IDs. include=event hydrates the event object without payload. include=event.data additionally includes the (potentially large) event data. This keeps responses lean while allowing single-request hydration when needed.

Legacy Routes (Deprecated)

Preserved with Deprecation: true header:

  • GET /:tenantID/destinations/:destID/events
  • GET /:tenantID/destinations/:destID/events/:eventID
  • POST /:tenantID/destinations/:destID/events/:eventID/retry

Relevant Packages

  • internal/apirouter
3. Portal UI
  • Deliveries table instead of Events table
  • Click row → delivery details sidebar
  • Retry button on deliveries
CleanShot 2026-01-18 at 02 32 56

Relevant Packages

  • internal/portal/src
4. Test Infrastructure

Performance

Test suite reduced from 1-2 min → 30-40s through polling helpers and timeout reductions.

Package Commit Change
cmd/e2e 19dadb4 Replace setup sleeps with health check polling
cmd/e2e 0651251 Replace fixed delay with polling in alert test
cmd/e2e 706db81 Reorganize test suites, fix testinfra race condition
internal/logstore 381fe6e Replace fixed sleeps with polling in log tests
internal/logstore 072c0ad Allow immediate log batch flush with threshold=0
internal/destregistry/destwebhook 1ed441c Replace fixed delays with polling
internal/destregistry/destawskinesis 9fe3bd6 Speed up Kinesis stream waiter
internal/idempotence b8be306 Reduce test timeouts
internal/mqinfra b8be306 Reduce test timeouts
internal/mqs b8be306 Reduce test timeouts
internal/models 7fdb59f Parallelize ListTenant tests

Conformance Tests

Shared test suite (drivertest) that all LogStore drivers must pass.

Relevant Packages

  • cmd/e2e
  • internal/logstore/drivertest

Migration Notes

  • No breaking changes - legacy routes preserved with deprecation headers
  • Database migrations - auto-migration handles schema updates

@vercel
Copy link

vercel bot commented Dec 4, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Review Updated (UTC)
outpost-docs Ready Ready Preview, Comment Jan 17, 2026 9:05pm
outpost-website Ready Ready Preview, Comment Jan 17, 2026 9:05pm

Request Review

@alexluong alexluong changed the title refactor: logstore refactor: logstore & log api Dec 4, 2025
@alexluong alexluong marked this pull request as draft December 4, 2025 15:48
alexluong and others added 8 commits January 17, 2026 01:03
Include last count, status code, and any errors in the timeout
message to help debug flaky test failures.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Prevents tests from hanging indefinitely if the server doesn't respond.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Remove Event ID column from deliveries table
- Remove Event Time from delivery details
- Flatten delivery details into single section (no separate Event section)
- Rename data sections: 'Data', 'Metadata', 'Response'
- Reorder fields: Status, Response Code, Attempt, Topic, Delivered at, IDs
Convert buildEventCursorCondition and buildCursorCondition to use
parameterized queries with ? placeholders instead of string interpolation
to prevent SQL injection vulnerabilities. Add parseTimestampMs helper to
validate timestamp strings before use.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
alexluong and others added 6 commits January 17, 2026 16:43
The SortBy option (event_time vs delivery_time) added complexity to cursor
pagination and query logic without significant benefit. This change simplifies
the API by always sorting by delivery_time, which is the natural ordering for
delivery-focused queries.

Changes:
- Remove SortBy field from ListDeliveryEventRequest in driver.go
- Simplify memlogstore, pglogstore, chlogstore to always use delivery_time sort
- Remove sort_by query param validation from API handlers
- Remove testCursorMismatchedSortBy and related tests from drivertest
- Update log_handlers_test.go to remove sort_by validation tests

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Change partition strategy from daily (toYYYYMMDD) to monthly (toYYYYMM)
  to reduce part count and merge pressure
- Reduce bloom filter granularity from 4 to 1 for better index selectivity
  on low-cardinality columns like tenant_id and destination_id

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants