Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Clickhouse as our only data store for analytics #110

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

timgl
Copy link

@timgl timgl commented May 30, 2023

A bit of a brain dump, but I want to start thinking about whether this is possible at all.

@timgl timgl changed the title Clickhouse as our only data store RFC: Clickhouse as our only data store May 30, 2023
- Read person properties/groups/distinct ids from Clickhouse, with some kind of cache to avoid overloading
- Materialized view that joins the person properties onto the table, perhaps at some interval like an hour or day.

**Decide/feature flags**
Copy link
Contributor

@jamesefhawkins jamesefhawkins May 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flyby, feel free to ignore:

it feels like it's ok to run postgres as a service for just feature flags and to write/read there for pure speed reasons. in the same way we may need to create (some specialist service) for (some specialist product) in future. it just happens to be its own DB.

it'd mean some minor inconsistencies between persons for feature flags versus everything else, but that's worth the tradeoff for performance here. it feels worse to have inconsistencies in analytics data / business reporting data, hence the other ideas here i think make more sense

@benjackwhite
Copy link
Contributor

"RFC: Clickhouse as our only data store for analytics" - right?

@timgl timgl changed the title RFC: Clickhouse as our only data store RFC: Clickhouse as our only data store for analytics May 31, 2023
Copy link
Contributor

@neilkakkar neilkakkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to figure out the problem here better - I don't get it yet.

Seems so far that the only real issue right now is postgres going down due to analytics load which brings down the entire app. Is this a fair summary of the problem?

Which is valid, but seems like a problem that can be solved without trying to move all data to clickhouse? (by ex using the read replicas once the postgres migration is done!)

The reason I'm very hesitant to go down this route is that it's a lot of effort to make this work for decide / feature flags and hard to get right; with currently no real benefit (as it seems to me).


## Problem statement

- Having two separate places where we store essentially the same data (person/person_distinct_id/groups) means we often have inconsistencies
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Postgres is the source of truth, and this inconsistency is that clickhouse data is sometimes wrong - we'll need to figure out a way to fix that for this to work.

## Problem statement

- Having two separate places where we store essentially the same data (person/person_distinct_id/groups) means we often have inconsistencies
- Our plans with the data warehouse are to allow customers to model all of their entities (persons/groups) in SQL in Clickhouse. For this to work we need to be able to read everything everywhere from clickhouse
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This analytics data is already in clickhouse right?

How would they access app data (non-analytics / their business logic) from the warehouse? I assumed this would be a postgres connector, rather than loading this data into the warehouse?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure you saw my demo on monday, but the idea is people will be ETL'ing all sorts of data (Stripe, CRM, production database) into Clickhouse, and then using that to basically set person/group properties. This means PostHog becomes their source of truth, which would be incredibly powerful.

For the above to be even more powerful, we'd want to give users the option to use that data to serve feature flags and experiments. For that, Clickhouse needs to be the source of truth for person/groups data. Hence this RFC.

Seems so far that the only real issue right now is postgres going down due to analytics load which brings down the entire app. Is this a fair summary of the problem?

That's one problem, yes, but the biggest opportunity is this modelling thing, which making Clickhouse our source of truth would enable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense, did see the demo!

and then using that to basically set person/group properties.

Given that whatever operation they do to set person/group properties, it ends up in our ingestion flow, which sets this on both postgres / clickhouse, using these with flags should just work ?

unless you're imagining a flow where it's not just person/group properties, but any random ETL tables they generate inside the warehouse? In which case, I guess yeah figuring out a way to query clickhouse for flags becomes integral.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unless you're imagining a flow where it's not just person/group properties, but any random ETL tables they generate inside the warehouse? In which case, I guess yeah figuring out a way to query clickhouse for flags becomes integral.

I am...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way (you're not going to like it though) might be to copy these random ETL props that are linked to persons into postgres.

We definitely can't go to clickhouse for every decide request: this is way too slow & also way too expensive for clickhouse.

Then comes caching: Either cache all person properties in Clickhouse, so access becomes cheap [but this needs a very large cache]; or cache flag responses for people [but this also needs a very large cache & isn't very useful because any random new anon user will miss the cache, and identified users are queried once anyway].

So caching doesn't seem like a viable option to me (unless we can get a Very Large Cache). Hence... copy all data to postgres & keep it there for fast flag access 🙈

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants