You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We should implement a batch counterpart to withStreamInputTable. The primary use I have in mind is aggregation of streaming events. Consider a DW with two tables impressions, and clicks. The business would like to create a higher level table that includes some aggregated hourly facts such as click through rate. To accomplish this once an hour you need to run a job once an hour that scans an hours worth of clicks, and an hours worth of impressions, and outputs an s3 file with the contents of the summary:
We can provide an api withBatchInputTable that enables this sort of scenario using ECS Fargate. Mainly we can allow the user to define a function to run in a container (the code that issues a query to one or more tables, and writes the results to the correct s3 location, and even creates a partition if necessary), and define the interval that the task should run on.
We can certainly start with a prototype using lambda. I think that would be a good proof of concept to get the API surface area fleshed out. Eventually we should probably using something like ECS fargate given lambda's memore and disk limitations. ECS fargate gives you up to 30 GB of RAM which enables more flexibility in the scale of aggregations and queries that users will be able to do.
https://github.com/EvanBoyle/pulumi-serverless-db/pull/8/files#diff-f95bdcb0f919d600e736e8e9da74022dR93
We should implement a batch counterpart to
withStreamInputTable
. The primary use I have in mind is aggregation of streaming events. Consider a DW with two tablesimpressions
, andclicks
. The business would like to create a higher level table that includes some aggregated hourly facts such as click through rate. To accomplish this once an hour you need to run a job once an hour that scans an hours worth of clicks, and an hours worth of impressions, and outputs an s3 file with the contents of the summary:We can provide an api
withBatchInputTable
that enables this sort of scenario using ECS Fargate. Mainly we can allow the user to define a function to run in a container (the code that issues a query to one or more tables, and writes the results to the correct s3 location, and even creates a partition if necessary), and define the interval that the task should run on.We can certainly start with a prototype using lambda. I think that would be a good proof of concept to get the API surface area fleshed out. Eventually we should probably using something like ECS fargate given lambda's memore and disk limitations. ECS fargate gives you up to 30 GB of RAM which enables more flexibility in the scale of aggregations and queries that users will be able to do.
Some examples for fargate, which involves building a docker image: https://github.com/pulumi/examples/tree/master/aws-ts-hello-fargate
https://www.pulumi.com/blog/get-started-with-docker-on-aws-fargate-using-pulumi/
https://www.pulumi.com/docs/tutorials/aws/ecs-fargate/
The text was updated successfully, but these errors were encountered: