Guidance for running Flyte at Scale #3402

mrgleeco · 2023-03-04T21:59:41Z

mrgleeco
Mar 4, 2023

I'm looking to understand the operational concerns of running Flyte at scale. I'm especially concerned about the performance characteristics of Postgres which can easily become a SPoF and scaling challenge at even modest amounts of activity over time. In particular it is not clear what data is going into PG DB rows versus object store blobs.

More generally it would be informative to have benchmarks to guide for resource usage.

To kick off some questions, consider a story where there are ~100 workflows and ~1000 versioned tasks used to run 1M tasks per day. Some questions might include...

how many tables at day.1? At day.365?
how many rows at day.1 and at day.365?
how many indexes? How many are primary vs secondary?
is there any background archival activity?
what are some of the recommended PG config settings?

Answered by kumare3

Mar 6, 2023

hi @mrgleeco,
Firstly thank you for raising this question and asking to clairfy instead of assuming :).

Let me try and answer the question in 3 parts

Understand general architecture and scaling primitives
Benchmarks and object store vs Postgres
answer your scenario

Part 1: Understand general architecture and scaling primitives**

It is crucial to understand the architectural choices to understand how things can scale.

Metadata storage - workflow/task versions and definitions

Postgres is used as the datastore for workflow versions and used to visualize various UI elements. It is assumed that even at large volume number of workflows and tasks will not exceed 50k and version frequency will …

View full answer

kumare3 · 2023-03-06T06:13:05Z

kumare3
Mar 6, 2023
Maintainer

hi @mrgleeco,
Firstly thank you for raising this question and asking to clairfy instead of assuming :).

Let me try and answer the question in 3 parts

Understand general architecture and scaling primitives
Benchmarks and object store vs Postgres
answer your scenario

Part 1: Understand general architecture and scaling primitives**

It is crucial to understand the architectural choices to understand how things can scale.

Metadata storage - workflow/task versions and definitions

Postgres is used as the datastore for workflow versions and used to visualize various UI elements. It is assumed that even at large volume number of workflows and tasks will not exceed 50k and version frequency will not exceed a daily rate. Also all primary keys are supposed to be partitioned by project & domain so theoretically it is possible to partition by project and domain. This might be a future state, but at the moment, we have not seen a need for this. Also at the point of registration of a workflow a fully closure is compiled and stored in S3 (as an immutable object - safe to cache) so that for any new execution launch of workflow visualization, database lookups are minimized.

Execution state store

If you carefully look at the architecture, Flyte uses kubernetes etcd as the runtime state store for the workflow. So this is assumed to be the source of truth. etcD itself can be scaled pretty well depending on your usage and configuration. But we have successfully run more than 100k workflows almost concurrently on Flyte clusters. the processing can become a bottleneck and hence Flytepropellers can be horizontally sharded. The simplified scaling of the system stems from the fact that, each workflow is independent from every other and hence can be horizontally distributed. It is still possible to cause hot shards due to some bad workflows, but, this is slightly alleviated using max parallelism bounds on the node traversals, round budgets and extreme caching. Again remember, Flyte assumes all metadata is immutable and this makes it possible to cache it really well - many times in memory. I do feel this is not yet fully optimized our goal is always to keep each evaluation round to be in the order of a few milliseconds.
Thus with a worker pool of x and shard size of y - you can scale to x*y concurrent workflows and throughput of (x*y*60*1000/(round latency in milliseconds))
Note this can be optimized way more in the future

Multi cluster to scale out state store

Finally we have seen problems with scaling a single KubeAPI server, but this is where Flyte supports multi cluster mode that can help scale out across multiple k8s clusters.

Execution observation store

This is where the execution information is stored for visualization and recovery. It is not required that this store be in sync with the state store, but In our opinion we feel if we do not record what has been done, what is the use of making progress :). The default flyte installation, makes this synchronous to write the "observation - events" from flyte propeller to "flyte control plane", and this is written to Postgres. The datamodel has been intentionally been built that it does not use any of an RDBMS features - no large joins, no special indexes and it is concievable posslbe to build it on dynamoDB. Another way could be to just buffer the events in an intermediate log before replicating (like kafka). Again we have not yet seen a need for this at most companies - from Lyft to Spotify to many others.

Also every execution is independent from every other execution and hence the data can be easily deleted if really so desired. Some users do partitioning based on execution start time and hence easy to reap. Flyte does not ship with any special db management systems today. This remains an active area of work.

Part 1b: Side note: Liveness/ Safety properties

A goal when designing Flyte was that executions should continue to progress even when the core db is down for scheduled maintenance, including down for minutes (this is configurable). All execution progress (starting new tasks) will be paused, but existing long running tasks - often the case in Machine learning and data processing workflows - will continue to run.

Also all metadata regarding task execution (to ensure tasks can run) is actually stored in a blob store, to ensure high availability (albeit at the loss of some performance). The performance actually can be seriously improved by simple caching.

Part 2: Benchmarks, object store vs Postgres

Currently we do not actively publish any benchmarks, but we are publishing scripts to understand performance much better.
As mentioned in previous part - Postgres is used as the default store. We invite collaboration to improve "execution events service" for more scale if needed.
As mentioned also all metadata about task inputs / outputs is infact stored in blob stores

Part 3:

how many tables at day.1? At day.365?

Number of tables should remain constant (I do not have it offhand, but if you really need, can deep dive and add info here)

how many rows at day.1 and at day.365?

In this scenario only the execution history tables will have rows corresponding to number of task executions. so 1million on day 1 and 365million on day 365

how many indexes? How many are primary vs secondary?

This also remains fairly constant. Number of secondary indexes mostly do not matter - but for the execution table. Which will be penalized for new execution updates - especially execution state. The number of indexes is less than 5 (including primary). (I do not have it offhand, but if you really need, can deep dive and add info here)

is there any background archival activity?

None out of the box, easy to add one

what are some of the recommended PG config settings?

We have run with default AWS aurora configs and some others have run with default GCP configs with low impact. Happy to collaborate.

Future

Keep an eye out to an exciting update coming soon that makes it much easier to run Flyte on mysql and sqlite and other dbs. This also optimizes the column layout. Only problem is this might have a db-migration penalty (no downtime, but potentially slow)
This will open up a lot more gates to optimize further.

Notes

We do not do any select for updates
Number of transactions is highly restricted
Transaction sizes is limited
Execution tables are indeed designed to be replaced by noSQL if needed (not directly will need some work)
state is actually in etcD
metadata is actually in blob store

0 replies

davidmirror-ops · 2023-03-06T15:58:41Z

davidmirror-ops
Mar 6, 2023
Maintainer

Thanks @kumare3
I think it's a very valid question to uncover some of the logic behind Flyte's design decisions.
I was thinking of collaborating to turn this Disc into a blog post, probably adding some visuals.

wdyt?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Guidance for running Flyte at Scale #3402

{{title}}

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Guidance for running Flyte at Scale #3402

mrgleeco Mar 4, 2023

Part 1: Understand general architecture and scaling primitives**

Metadata storage - workflow/task versions and definitions

Replies: 2 comments

kumare3 Mar 6, 2023 Maintainer

Part 1: Understand general architecture and scaling primitives**

Metadata storage - workflow/task versions and definitions

Execution state store

Multi cluster to scale out state store

Execution observation store

Part 1b: Side note: Liveness/ Safety properties

Part 2: Benchmarks, object store vs Postgres

Part 3:

Future

Notes

davidmirror-ops Mar 6, 2023 Maintainer

mrgleeco
Mar 4, 2023

kumare3
Mar 6, 2023
Maintainer

davidmirror-ops
Mar 6, 2023
Maintainer