Here's some examples of how you can configure your Lambda so that it scales appropriately given certain constraints.
By default, AWS Lambda can scale up to 1000 concurrent executions, which is the default concurrency limit provided by AWS. This limit applies to the sum of all invocations of all functions in the region and can be increased if necessary.
AWS Lambda scales up 60 additional concurrent invocations per minute providing no errors are returned by the function.
Now, let's produce 1000 SNS events to be consumed by a Lambda with default concurrency limitations.
Producer -> SNS -> Consumer
Metric Name | Metric Value |
---|---|
Total Events | 1000 |
Events Completed | 1000 |
Events Failed | 0 |
Time Taken | 60 seconds |
Max. Concurrency | 89 |
Avg. DB Inserts Per Second | 16.6 |
Max. DB Inserts Per Second | 41 |
Min. DB Inserts Per Second | 1 |
You have an event driven system that generates large spikes of concurrent events and now you need to use those events to push data to an external service provider (e.g Intercom, Salesforce), however, they have strict API limitations.
Imagine you're limited to 1 request per second, but throughout the day you have spikes of 50 events per second. How do you ensure that you don't exceed the APIs limits?
What if we limit the number of concurrent invocations on the lambda to 1?
Producer -> SNS -> Consumer
- Lambda
- Timeout -> 20 seconds
- Reserved concurrency -> 1
Metric Name | Metric Value |
---|---|
Total Events | 1000 |
Events Completed | 1000 |
Events Failed | 0 |
Time Taken | 1905 seconds |
Max. Concurrency | 1 |
Avg. DB Inserts Per Second | 1 |
Max. DB Inserts Per Second | 1 |
Min. DB Inserts Per Second | 1 |
Pros:
- SNS will retry a message over 100,000 times.
- No SQS Queue.
Cons:
- High number of wasted throttled attempts, costing money.
- Message order is not guaranteed.
- No visibility into "in flight" messages, hard to see the load and pressure.
The results indicate this is a reasonable approach. Considering SNS will retry 100,000 times, in many cases this is reliable enough.
That said, this approach feels very inefficient, it's like trying to force data into the consumer regardless if it's busy or not.
Let's take what we've learned from scenario #2 to see if we can improve upon it's cons.
SNS supports many subscription types and SQS is one of them. We will now attempt to push the events from SNS to SQS to improve visibility and also, to benefit from it's batching capabilities.
- Lambda
- Timeout -> 10 seconds
- Reserved concurrency -> 1
- SQS
- Visibility timeout -> 10 seconds
- Batch size -> 1 record
Metric Name | Metric Value |
---|---|
Total Events | 1000 |
Events Completed | 895 |
Events Failed | 105 |
Time Taken | 1254 seconds |
Max. Concurrency | 1 |
Avg. DB Inserts Per Second | 1 |
Max. DB Inserts Per Second | 1 |
Min. DB Inserts Per Second | 1 |
- Lambda
- Timeout -> 10 seconds
- Reserved concurrency -> 1
- SQS
- Visibility timeout -> 10 seconds
- Batch size -> 1 record
Metric Name | Metric Value |
---|---|
Total Events | 1000 |
Events Completed | 1000 |
Events Failed | 1000 |
Time Taken | 1450 seconds |
Max. Concurrency | 1 |
Avg. DB Inserts Per Second | 1 |
Max. DB Inserts Per Second | 1 |
Min. DB Inserts Per Second | 1 |
Producer -> SNS -> SQS -> Consumer
Metric Name | Metric Value |
---|---|
Total Events | 1000 |
Events Completed | 1000 |
Events Failed | 1000 |
Time Taken | 261 seconds |
Max. Concurrency | 1 |
Avg. DB Inserts Per Second | 3.759 |
Max. DB Inserts Per Second | 5 |
Min. DB Inserts Per Second | 1 |
You have an event driven system that generates large spikes of concurrent events and now you need to use those events to push data to external service provider (e.g Intercom, Salesforce), however, who have strict API limitations.
Imagine you're limited to 1 request per second, but throughout the day you have spikes of 50 events per second. How do you ensure that you don't exceed the APIs limits?
Sometimes the throttle will cause messages to fail. We set a long visibilty time.... They will end up in the dead letter queue
Events will appear in any Order. FiFo will not help. SNS does not guarentee order ()
You can run this demo in your AWS account. It will cost less than $1.
Create a Postgres instance and add the connection configurations in the Consumer Lambda.
Create the following table:
-- Table representing our "External Service".
-- We will use this to see if we've exceeded our concurrency limits.
create table invocations
(
-- Order in which the message was published to SNS
id integer,
-- What time was the event created by the producer
sent_at timestamp,
-- //
started_at timestamp,
-- Which consumer created the inserted the record
group_id text,
-- Random string representing the consumers handler invocation.
-- Every time the handler is invoked, a new id is generated.
-- Useful to ensure the handler is correctly batching messages from SQS.
-- E.g 1000 / 10 (batch size). We should see 100 unique batch_id's
batch_id text,
-- Random string representing the Lambda instance.
-- Every time a new Lambda is created, a new id is generated.
-- Useful to see how many new lambda instance are required to process the batch.
-- E.g if the throttle is working as expected, we will see one unique cache id.
cache_id text,
-- What time and date was the record inserted.
created_at timestamp default now()
);
Pro Tip: If you don't fancy provisioning an RDS instance, you can host the Postgres instance on your local machine and use ngrok.
docker run -d -e POSTGRES_PASSWORD=mysecretpassword -p 5432:5432 postgres
ngrok tcp 5432
You can now connect to you local Postgres instance like this:
const client = new Client({
user: 'postgres',
database: 'postgres',
password: 'mysecretpassword',
port: 13891,
host: '0.tcp.ngrok.io'
});
Deploying to your account AWS account
- run
cd ./cdk
- Use your own AWS config here
- run
npm run build
- run
npm run diff
- run
npm run deploy
Trigger the producer lambda using a test event in the AWS lambda console:
Visualise the resutls in Grafana
docker run -d -p 3000:3000 grafana/grafana
login: User: admin Password: admin
Add Postgres
TODO: Kinesis example -- If the kinesis invocation fails put message into SQS DL Queue --> Generic SNS => KINESIS? --> SNS publish can publish to Kinesis Directly
Move "Parse event" to typescript with tests?
How do you guarantee the oder of events? SNS event ordering is random..
Look for events that take a long time to process when with SQS and reserved concurrency
Check for duplicate event id's per group
Push DL messages into the DB with a unique group