Profiling Services with X-Ray Traces
The previous lab showed how you can uncover the root cause for a problem by searching in the CloudWatch logs. The trick to searching in these logs, however, is knowing which service to focus on. If all you have to start with is an API-level error, then X-Ray traces can help you to narrow in on the downstream service that’s the source of the problem. We’ll demonstrate how this is done in this lab, using some generated trace data for both successful and unsuccessful service calls.
Generating Sample Trace Data
Step 1: Send a “Create Timeslot” request from Postman
To keep things interesting, let’s switch to a different set of tests in the “Admin_API” collection in Postman. Open the “Timeslots” folder, select the “Create Timeslot” test, and send a request. You should see a “201 Created” result, as shown below:
Step 2: Send a valid “Get Timeslots” request from Postman
Now, select the “Get Timeslots” test and send a request. You should get a “200 OK” response along with a response body that includes JSON for the timeslot item just created:
Step 3: Send an invalid “Get Timeslots” request from Postman
Next, delete the query parameters from the URL for this test and send the request once more. You should now see a “400 Bad Request” error, as below:
Viewing Traces in the Console
Now, let’s go into X-Ray in the console to look at the traces for these last two API requests.
Step 4: Query for traces by URL in the console
If the only traffic in your AWS account is your work for this lab, then there won’t be many traces to search through. But in a busy environment there will be many traces, so it’s worth showing how you can filter the list of traces to select from. Navigate to the “Traces” page under the “X-Racy traces” menu in CloudWatch. Open the “Query refiners” panel, select “URL” from the “Refine query by” drop-down, and then type “timeslots” in the search field below. You should see a URL result in the list immediately below the search field, like this:
Scroll down the page, and you should see individual traces, at the bottom of the page, like this:
What you see above are three API requests. The first, at the very bottom, is the POST that was sent to the endpoint to create the timeslot item. The next trace, one up from the bottom, is the successful GET that was initially sent to the endpoint, while the most recent trace, at the top of the list, is the failed GET request that resulted when we removed the query parameters.
Step 5: View the successful GET trace in the console
Click on the middle trace in the list, then scroll down to view the “Segments Timeline” as shown below:
Let’s step through what’s shown here. Each node that’s expanded on the left represents a service component that’s part of the flow of execution for the request. In this case, the execution begins with the “ConnectedCar_Admin_API_Dev/api” step, which is the API Gateway stage that receives the REST API request from Postman. Each service component in the trace, represented by the nodes, lists below it the internal calls that it makes, with the call chain indicated by the indentation.
So, the API Gateway stage calls the Lambda service, as shown in the first node. The next node shows the Lambda service calling the specific Lambda resource, and whose calls are displayed in the third node. As you can see from this trace, there is some repetition in what gets displayed, like in the call to DynamoDB to describe the table that’s listed in the Lambda node and is also listed in the next DynamoDB node.
Step 6: View the unsuccessful GET trace in the console
You’ll see the benefit of these traces more clearly when you open the failed GET trace. This trace should look something like this:
At the top of this trace you can see the API Gateway stage, which returned an HTTP 400 Bad Request status, as we saw in Postman. Working your way down the trace, you can see that the execution flowed successfully to the Lambda function, but didn’t go any further. No calls were made from the Lambda to DynamoDB. Note that from the point of view of the trace, there is no error emanating from the Lambda since the Lambda caught an exception and successfully returned an HTTP response. We can infer that an exception occurred because the Lambda didn’t call DynamoDB.
This trace has provided enough diagnostic information to isolate the problem to the Lambda. Armed with this information, you can now search the associated logs to to find the root cause. The other valuable information in these traces is the duration for each call in the execution flow. We’ll find uses for this feature in the sections that follow.