Orchestrating and tracking complex Microservices in Amazon Web Services

Published

10.3.2019

Products mentioned

Application Modernization

This article and tutorial will help you build a working solution that covers AWS Lambda, Cloud Watch, Step Functions and X-Ray. It is intended to cover a wide range of subjects at a surface level to help you get everything working together. The intent is to help you think through options in architecting your Microservices, to reduce complexity and make them easier to track and debug.

The 30,000 Foot View: What is the problem we are trying to solve?

To understand the problem we are trying to solve better, let’s take a look at how the complexity of Microservices quickly gets out of hand.

Month 1: A few Microservices

Month 2: More Microservices, that are interrelated

Month 3: The number of Microservices and their dependencies quickly getting out of hand

As you can see with so many execution paths, many of which are running in parallel, knowing what is happening within your system is difficult. Furthermore, when an error occurs, tracing it back to what or why it occurred can be cumbersome. If using a notification system (like SNS) to communicate between your Microservices, a workflow may be stalled because one service acted differently than you expected, understanding which one and why is critical in keeping your system running reliably.

Let’s take for example a simple system where:

A user uploads an image to S3
The S3 bucket sends a notification saying “new file uploaded”
A Microservice receives the notification and adds a watermark to the image, placing it in a new S3 bucket viewable on your website

If the user is waiting for the file to show up on the website and it doesn’t, there are a long list of things to check; Did the S3 bucket send the notification? Was there an error in the Microservice? Is the Microservice still running? Was the image placed in the correct S3 bucket after processing? etc.

Now imagine this in a complex system where events are happening in parallel with many preconditions and post actions.

While building out your Microservices architecture, it’s important to consider how you are going to debug, track and orchestrate complex scenarios.

To setup the infrastructure to solve this problem I will walk you through:

Setting up a couple of simple Lambda functions
Logging and viewing the logs in Cloud Watch
Organizing your Lambdas into Step Functions
Instrumentation and tracing using X-Ray

Lambda

Let’s first setup an IAM role and some simple Lambda functions so we have something to work with. The IAM role allows us to give the Lambda access to Cloud Watch.

AWS Lambda lets you run code without provisioning or managing servers. You pay only for the compute time you consume — there is no charge when your code is not running. — AWS Lambda

Create an IAM role for your Lambda API

Sign into the AWS Console and go to IAM -> Roles -> Create Role
Select AWS service and then choose Lambda as your service you want to create a role for and click next
Search for and select CloudWatchLogsFullAccess, then click next to continue

Name your role cloud-watch-full-access and click create role

Create the first Lambda function

Go back to the AWS console and then to Lambda -> Create function
Make sure author from scratch is selected
Name your function fast-lambda, use Node.js 6.10 as your runtime and choose the role we just created under existing role

Finally click Create function

Add some code to your Lambda

Add the below code to your Lambda in the index.js file

exports.handler = (event, context, callback) => {
    console.log("fast lambda started executing");
    
    // set timeout is used to make the lambda wait a set 
    // amount of time before returning
    setTimeout(function () {
        console.log("fast lambda done executing");
        callback(null, 'Done');
    }, 200); // waiting 200ms
};

Click save in the top right corner
Click Test
Enter the event name “testing” and then Create using the defaults (the inputs will be ignored)
Click Test again, you should see the below execution results

Looking at the result there are a couple of things worth noting

Your Lambda executed successfully
It took over 200ms (337.61ms in my case); this was because of the 200ms set timeout we added to the code.
The log output shows the two console.log statements we added

Create another Lambda that’s slow

So that we can use it later to create Step Functions, go back and create another Lambda, call this one slow-lambda and set the timeout to 1500ms, you can use the below code.

exports.handler = (event, context, callback) => {
    console.log("slow lambda started executing");
    
    // set timeout is used to make the lambda wait a set 
    // amount of time before returning
    setTimeout(function () {
        console.log("slow lambda done executing");
        callback(null, 'Done');
    }, 1500); // waiting 1500ms
};

Your test response should look something like this, with a duration of over 1500ms

Create one last Lambda that fails by default (optional)

Create a new Lambda called fail-lambda using the same configuration as before but the below code

exports.handler = (event, context, callback) => {
    var error = new Error("something went wrong");
    callback(error);
};

Your execution result should fail like below

Cloud Watch

Amazon Cloud Watch is a monitoring service for AWS cloud resources and the applications you run on AWS. You can use Amazon Cloud Watch to collect and track metrics, collect and monitor log files, set alarms, and automatically react to changes in your AWS resources. Amazon CloudWatch can monitor AWS resources such as Amazon EC2 instances, Amazon DynamoDB tables, and Amazon RDS DB instances, as well as custom metrics generated by your applications and services, and any log files your applications generate. You can use Amazon CloudWatch to gain system-wide visibility into resource utilization, application performance, and operational health. You can use these insights to react and keep your application running smoothly. — Amazon CloudWatch

Now that we have created and run some Lambda functions, lets see what their logs look like in Cloud Watch

Go back to the AWS console and open Cloud Watch -> Logs
You should now see two log groups, one for fast-lambda and one for slow-lambda
Open up the fast-lambda log group (/aws/lambda/fast-lambda)
Click the latest log stream
You should see something like this, including the text we logged (fast lambda started executing…)

Step Functions vs Simple Notification Service (SNS)

What’s worth noting before we work with Step Functions, is that you will need to think through the needs of your application before deciding whether Step Functions is the right option for your use case.

Step Functions: If your use case involves “do this, then that” or “if this, then that” functionality (in other words, workflows), the step functions are a good option. It also handles retries and basic logic. It is generally considered bad practice to have direct references between Microservices so Step Functions should be used sparingly in cases where you are implementing a workflow.

Simple Notification Service (SNS): Also triggers Lambdas however does not support retries or have basic logic built in. Use it if you want your Lambdas to react to events happening in other Lambdas.

For this tutorial we are going to focus on Step Functions.

Step Functions

AWS Step Functions makes it easy to coordinate the components of distributed applications and microservices using visual workflows. Building applications from individual components that each perform a discrete function lets you scale and change applications quickly. Step Functions is a reliable way to coordinate components and step through the functions of your application. Step Functions provides a graphical console to arrange and visualize the components of your application as a series of steps. — AWS Step Functions

This is where it gets interesting, Step Functions will help you orchestrate your Lambdas and debug any issues that arise.

Run a successful Step Function

From the AWS console open Step Functions
Click create a state machine
Make sure Author from scratch is selected and name your Step Function complex-state-machine
Use the below code to set it up (make sure you replace the arn… with the arns for your fast and slow Lambdas. If you click in the resource field it will give you a choice of your arn’s)

{
   "StartAt":"First",
   "States":{
      "First":{
         "Type":"Parallel",
         "Next":"Done",
         "Branches":[
            {
               "StartAt":"FastLambda",
               "States":{
                  "FastLambda":{
                     "Type":"Task",
                     "Resource":"arn...fast-lambda",
                     "End":true
                  }
              }
            },
            {
               "StartAt":"SlowLambda",
               "States":{
                  "SlowLambda":{
                     "Type":"Task",
                     "Resource":"arn...slow-lambda",
                     "End":true
                     
                  }
               }
            }
         ]
      },
      "Done":{
         "Type":"Pass",
         "End":true
      }
   }
}

What this configuration does is create a Step Function that runs two Lambdas (your fast and slow Lambdas) in parallel.

Click create state machine
Click New execution and then click start execution (the inputs will be ignored)
After a few seconds you should see this:

Play around with the Execution details section to see results of the Step Function

Run a failing Step Function

Click complex state machine in the page’s breadcrumbs at the top
Click State machine details and then Copy to new
Name the state machine failing-complex-state-machine
Update the code to this time using the fail-lambda instead of the slow-lambda (remember to change the arns using the fail-lambda as the 2nd one)

{
   "StartAt":"First",
   "States":{
      "First":{
         "Type":"Parallel",
         "Next":"Done",
         "Branches":[
            {
               "StartAt":"fast-lambda",
               "States":{
                  "fast-lambda":{
                     "Type":"Task",
                     "Resource":"arn...fast-lambda",
                     "End":true
                  }
              }
            },
            {
               "StartAt":"fail-lambda",
               "States":{
                  "fail-lambda":{
                     "Type":"Task",
                     "Resource":"arn...fail-lambda",
                     "End":true
                     
                  }
               }
            }
         ]
      },
      "Done":{
         "Type":"Pass",
         "End":true
      }
   }
}

Click New execution and start execution
You should see something like this:

Click the output tab under Execution details, and you should see your error message

As you can see Step Functions is an easy to use service to help you orchestrate and organize Lambdas, giving you visual insight and information about problem areas in your system.

X-Ray

AWS X-Ray helps developers analyze and debug production, distributed applications, such as those built using a microservices architecture. With X-Ray, you can understand how your application and its underlying services are performing to identify and troubleshoot the root cause of performance issues and errors. X-Ray provides an end-to-end view of requests as they travel through your application, and shows a map of your application’s underlying components. You can use X-Ray to analyze both applications in development and in production, from simple three-tier applications to complex microservices applications consisting of thousands of services. — AWS X-Ray

Go back to the AWS console and then to the Lambda section
Click on fast-lambda and scroll down to Debugging and error handling
Check the Enable active tracing box and click Save in the top right
Do the same with the slow-lambda and the fail-lambda
That’s it, your Lambdas now include instrumentation to view tracing data
Go back to your Step Functions and re-execute both of them so that we can see data about how they ran
Go back to the AWS console and click AWS X-Ray, you should see something similar to below (if you don’t see data give it a minute or so)

What does this tell us?

Our fast Lambda took 617ms and 204ms
Our slow Lambda took 1.55s to 1.93s
Our fail Lambda fails every time

If you click on one of the circles you can see additional details over time, including how often a Lambda passes or fails.

When you have a complex system with lots of Microservices, X-Ray allows you to easily identify slow running services, errors, throttling issues, etc. It gives you invaluable insight into how your system is running with minimal effort.

Hopefully this was helpful in understanding how all these pieces fit together. Please note that these tutorials walk you through setting up configuration through the AWS console for simplicity, however you should be using Cloud Formation to provision and configure all your infrastructure.

Ajit is an AWS Certified Solutions Architect and a member of the One Six Solutions team.