-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Clickhouse as our only data store for analytics #110
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
# Request for comments: Clickhouse as our only data store | ||
|
||
## Problem statement | ||
|
||
- Having two separate places where we store essentially the same data (person/person_distinct_id/groups) means we often have inconsistencies | ||
- Our plans with the data warehouse are to allow customers to model all of their entities (persons/groups) in SQL in Clickhouse. For this to work we need to be able to read everything everywhere from clickhouse | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This analytics data is already in clickhouse right? How would they access app data (non-analytics / their business logic) from the warehouse? I assumed this would be a postgres connector, rather than loading this data into the warehouse? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure you saw my demo on monday, but the idea is people will be ETL'ing all sorts of data (Stripe, CRM, production database) into Clickhouse, and then using that to basically set person/group properties. This means PostHog becomes their source of truth, which would be incredibly powerful. For the above to be even more powerful, we'd want to give users the option to use that data to serve feature flags and experiments. For that, Clickhouse needs to be the source of truth for person/groups data. Hence this RFC.
That's one problem, yes, but the biggest opportunity is this modelling thing, which making Clickhouse our source of truth would enable. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This makes sense, did see the demo!
Given that whatever operation they do to set person/group properties, it ends up in our ingestion flow, which sets this on both postgres / clickhouse, using these with flags should just work ? unless you're imagining a flow where it's not just person/group properties, but any random ETL tables they generate inside the warehouse? In which case, I guess yeah figuring out a way to query clickhouse for flags becomes integral. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I am... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One way (you're not going to like it though) might be to copy these random ETL props that are linked to persons into postgres. We definitely can't go to clickhouse for every decide request: this is way too slow & also way too expensive for clickhouse. Then comes caching: Either cache all person properties in Clickhouse, so access becomes cheap [but this needs a very large cache]; or cache flag responses for people [but this also needs a very large cache & isn't very useful because any random new anon user will miss the cache, and identified users are queried once anyway]. So caching doesn't seem like a viable option to me (unless we can get a Very Large Cache). Hence... copy all data to postgres & keep it there for fast flag access 🙈 |
||
- When postgres goes down due to load from this data, it takes down the entire app | ||
|
||
## Success criteria | ||
*How do we know if this is successful (i.e. metrics, customer feedback), what's out of scope, whats makes this ambitious?* | ||
We no longer store any person/groups data on postgres | ||
|
||
## Context | ||
I think we could revisit some of our assumptions on what our system needs to be able to do this. For example, I think an hour or days delay for analytics, as long as we still have a real time view of events coming in, even if they aren't enriched with person data. | ||
|
||
## Design | ||
|
||
There are a couple of places we use postgres | ||
|
||
**Ingestion** | ||
Options | ||
- Read person properties/groups/distinct ids from Clickhouse, with some kind of cache to avoid overloading | ||
- Materialized view that joins the person properties onto the table, perhaps at some interval like an hour or day. | ||
|
||
**Decide/feature flags** | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. flyby, feel free to ignore: it feels like it's ok to run postgres as a service for just feature flags and to write/read there for pure speed reasons. in the same way we may need to create (some specialist service) for (some specialist product) in future. it just happens to be its own DB. it'd mean some minor inconsistencies between persons for feature flags versus everything else, but that's worth the tradeoff for performance here. it feels worse to have inconsistencies in analytics data / business reporting data, hence the other ideas here i think make more sense |
||
This will be hardest, as decide does a ton of queries and they all need to be <100ms, whereas Clickhouse, even for smallest queryies takes up to a second or so. | ||
|
||
Probably need to do aggressive caching or precalculation here. | ||
|
||
**App** | ||
- To display lists of groups in the interface (this will be trivial to swap out) | ||
|
||
|
||
## Sprints | ||
*How do we break this into discrete and valuable chunks for the folks shipping it? How do we ensure it's high quality and fast?* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Postgres is the source of truth, and this inconsistency is that clickhouse data is sometimes wrong - we'll need to figure out a way to fix that for this to work.