Benchmarking Lambda Cold Starts
This lab references the code in the aws-connectedcar-dotnet-serverless repository. If you're new to this course, see the introduction for information about setting up your workstation and getting the sample code.
Even when your use cases otherwise fit the capabilities of Lambdas, if you’re using language platforms like .Net or Java then you might still be deterred from using them because of the cold start problem. To come to grips with this issue, in this lab we’re going to benchmark a series of scenarios. Our goal is to pick apart what causes cold starts on these language platforms, and based on that, outline what practical measures you can take to minimize them.
Deploying Mock Lambda Code
We’re going to start our test scenarios by deploying a mock “FakeDealer” Lambda that’s hard-coded to return a response without calling DynamoDB. Our reason for doing this will be clear shortly.
Step 1: Delete the existing OpenAPI deployment from the console
It’s going to be slightly easier to deploy the mock Lambda for this lab if we work with the SAM version of the sample code. So if you still have the OpenAPI deployment running in AWS, go now to CloudFormation in the console and delete the parent stack. Don’t move on to the next step until the parent and all its child stacks have been removed, as shown below:
Step 2: Add a new MockDealer Lambda in VS Code
Next, open the AdminFunctions.cs class in VS Code, and add a new MockDealer event handler method, using this code:
public async Task<APIGatewayProxyResponse> MockDealer(APIGatewayProxyRequest request, ILambdaContext context)
{
return await Process(async () =>
{
var dealer = new Dealer
{
DealerId = Guid.NewGuid().ToString(),
Name = "Fake Dealer",
Address = new Core.Shared.Data.Attributes.Address
{
StreetAddress = "Fake Street Address",
City = "Fake City",
State = "WA",
ZipCode = "00000"
},
StateCode = StateCodeEnum.WA,
CreateDateTime = DateTime.Now,
UpdateDateTime = DateTime.Now
};
return new APIGatewayProxyResponse
{
StatusCode = (int)HttpStatusCode.OK,
Body = SerializeItem(dealer),
Headers = ContentResponseHeaders
};
}, context);
}
The added code should look like this in the AdminFunctions.cs class:
Next, you need to add a corresponding resource definition for this Lambda in the admin.yaml template, using this code:
MockDealer:
Type: 'AWS::Serverless::Function'
Properties:
FunctionName: !Sub '$\{ServiceName\}\_Admin\_MockDealer\_$\{EnvironmentName\}'
Handler: ConnectedCar.Lambda::ConnectedCar.Lambda.AdminFunctions::MockDealer
Description: Function to retrieve a mock dealer
Role: !Ref LambdaExecutionRoleArn
Events:
ApiEvent:
Type: Api
Properties:
Path: '/admin/dealers/mock'
Method: GET
RestApiId: !Ref AdminAPI
AutoPublishAlias: !Ref StageName
This added resource should look like this in the admin.yaml template for the SAM version of the code:
Step 3: Deploy the updated code from the terminal
Next, run through the steps to deploy the SAM version of the sample code. This includes setting the workspacePath and bucket variables in the config.zsh script, and then running the clean.zsh, build.zsh, and deploy.zsh scripts in sequence. As always, the deployment is complete when the stack outputs are successfully displayed in the terminal, like you see below:
Testing Lambda Cold Starts with Mock Code
Step 4: Update the global variables in Postman
As you’ve done previously after a deployment, use the query-outputs.zsh and query-attributes.zsh scripts to get the adminApi and apiKey values you need for the global variables in Postman. Set both the initial and current value columns for these two global variables:
Step 5: Send two requests in sequence for the new MockDealer Lambda from Postman
First, duplicate the “Get Dealer” test that’s in “Dealers” folder of the “Admin_API” collection. Rename the duplicated test to “Mock Dealer” and replace the “{{dealerId}}” variable at the end of the URL path with the “mock” literal, as highlighted below:
Then, send two requests for this test in sequence, the first of which will, of course, be a cold start for the newly deployed Lambda. This is illustrated in the examples shown below, where the first request took 1114ms, as measured in Postman:
The second request, sent moments later, took only 58ms:
Step 6: Review the X-Ray traces for the two Mock Dealer requests
Here’s where we start to unpack the underlying causes for Lambda cold starts. Go to the CloudWatch service in the console, and open the trace for the first of our two requests. You should see something like what’s shown below:
In this trace, you can see about a 400ms delay, from the initial call to invoke the Lambda from API Gateway to the start of the Lambda’s initialization. The initial delay is the time it takes for the service to provision an available VM for the Lambda. The initialization step, which takes about another 400ms, is the time it takes to launch the .Net runtime on this VM. Lastly, the invocation step within the Lambda indicates how long it takes for the service code to process this request, which being the first pass, includes profiling time for the .Net JIT compiler.
Open the trace for the second request, which should look like what’s shown here:
This time, there’s only a small delay between the invoking of the Lambda from API Gateway and the invocation step within the Lambda. Given that this is the second request, the VM for the Lambda is now provisioned, the .Net runtime is initialized, and the code has been JIT compiled. As a result of all this, the processing time for the Lambda is much faster.
Testing Lambda Cold Starts with DynamoDB Code
Step 7: Send a Create Dealer request to generate sample data
Now we’re going to find out how much longer cold starts are for requests that include calls to DynamoDB. We’re going to be using the “Get Dealers” test in Postman for this, and so before anything else run the “Create Dealer” test to generate some sample data. Once you’ve sent a request for this test you should, of course, see a “201 Created” response in Postman:
Step 8: Send two requests for the Get Dealers Lambda
Now, send two consecutive “Get Dealers” requests from Postman. The first request will be a cold start, which will yield a slow response time, as shown below:
The second request should produce a much faster response, as shown here:
Step 9: Review the X-Ray traces for the two Get Dealers requests
Once more, let’s dig into the X-Ray traces for these two requests. Your traces for the first request should look something like the example shown below:
In this trace, you can see that the delay prior to the initialization step and the initialization step itself are in line with what we saw in the previous cold start. Both of these steps again took about 400ms. Now, however, the processing time includes the “DescribeTable” and “Scan: ConnectedCar_Dealer_Table_Dev” DynamoDB operations. But in addition to the time that these operations took, there is a significantly longer delay before any calls to DynamoDB occurred. For the previous cold start, this delay was about 200ms, but this time it’s about twice as long. This is probably attributable to the depth of the execution path into the DynamoDB client libraries, which have to be profiled and JIT compiled, as well as the client connection setup for DynamoDB.
Moving on, here’s an example trace for the second request:
There’s nothing really surprising about this second trace. The delay after API Gateway invokes the service is only about 15ms, and there’s only a few milliseconds delay after that before the scan operation starts in DynamoDB. As in our first test, with this request the Lambda is already provisioned and initialized, and the code is profiled and JIT compiled. In this case the “DescribeTable” operation is also skipped because the table metadata has been cached in the DynamoDB client context.
Summarizing the Results
Let’s summarize what we’ve learned so far. First, real-world Lambdas that make calls to downstream services like DynamoDB have worse cold starts than mock Lambdas. Lots of comparisons of cold starts across language platforms fail to benchmark real-world code like this. Second, cold starts are not just about the challenge of provisioning compute on the fly. They’re also about how VM-based compiled language platforms are not optimized for the initial execution of code. In fact the reverse is true, since profiling and JIT compilation are about speeding up subsequent processing at the expense of initial performance.