Orchestrating and tracking complex Microservices in Amazon Web Services

Published

10.3.2019

Products mentioned

Application Modernization

This article and tutorial will help you build a working solution that covers AWS Lambda, Cloud Watch, Step Functions and X-Ray. It is intended to cover a wide range of subjects at a surface level to help you get everything working together. The intent is to help you think through options in architecting your Microservices, to reduce complexity and make them easier to track and debug. 

The 30,000 Foot View: What is the problem we are trying to solve?

To understand the problem we are trying to solve better, let’s take a look at how the complexity of Microservices quickly gets out of hand.

Month 1: A few Microservices

 
 

Month 2: More Microservices, that are interrelated

 
 

Month 3: The number of Microservices and their dependencies quickly getting out of hand

 
 

As you can see with so many execution paths, many of which are running in parallel, knowing what is happening within your system is difficult. Furthermore, when an error occurs, tracing it back to what or why it occurred can be cumbersome. If using a notification system (like SNS) to communicate between your Microservices, a workflow may be stalled because one service acted differently than you expected, understanding which one and why is critical in keeping your system running reliably.

Let’s take for example a simple system where:

  • A user uploads an image to S3
  • The S3 bucket sends a notification saying “new file uploaded”
  • A Microservice receives the notification and adds a watermark to the image, placing it in a new S3 bucket viewable on your website

If the user is waiting for the file to show up on the website and it doesn’t, there are a long list of things to check; Did the S3 bucket send the notification? Was there an error in the Microservice? Is the Microservice still running? Was the image placed in the correct S3 bucket after processing? etc.

Now imagine this in a complex system where events are happening in parallel with many preconditions and post actions.

While building out your Microservices architecture, it’s important to consider how you are going to debug, track and orchestrate complex scenarios.

To setup the infrastructure to solve this problem I will walk you through:

  • Setting up a couple of simple Lambda functions
  • Logging and viewing the logs in Cloud Watch
  • Organizing your Lambdas into Step Functions
  • Instrumentation and tracing using X-Ray

Lambda

Let’s first setup an IAM role and some simple Lambda functions so we have something to work with. The IAM role allows us to give the Lambda access to Cloud Watch.

AWS Lambda lets you run code without provisioning or managing servers. You pay only for the compute time you consume — there is no charge when your code is not running. — AWS Lambda

Create an IAM role for your Lambda API

  • Sign into the AWS Console and go to IAM -> Roles -> Create Role
  • Select AWS service and then choose Lambda as your service you want to create a role for and click next
  • Search for and select CloudWatchLogsFullAccess, then click next to continue
 
 
  • Name your role cloud-watch-full-access and click create role

Create the first Lambda function

  • Go back to the AWS console and then to Lambda -> Create function
  • Make sure author from scratch is selected
  • Name your function fast-lambda, use Node.js 6.10 as your runtime and choose the role we just created under existing role
 
 
  • Finally click Create function

Add some code to your Lambda

  • Add the below code to your Lambda in the index.js file
exports.handler = (event, context, callback) => {
    console.log("fast lambda started executing");
    
    // set timeout is used to make the lambda wait a set 
    // amount of time before returning
    setTimeout(function () {
        console.log("fast lambda done executing");
        callback(null, 'Done');
    }, 200); // waiting 200ms
};
  • Click save in the top right corner
  • Click Test
  • Enter the event name “testing” and then Create using the defaults (the inputs will be ignored)
  • Click Test again, you should see the below execution results
 
 

Looking at the result there are a couple of things worth noting

  • Your Lambda executed successfully
  • It took over 200ms (337.61ms in my case); this was because of the 200ms set timeout we added to the code.
  • The log output shows the two console.log statements we added

Create another Lambda that’s slow

  • So that we can use it later to create Step Functions, go back and create another Lambda, call this one slow-lambda and set the timeout to 1500ms, you can use the below code.
exports.handler = (event, context, callback) => {
    console.log("slow lambda started executing");
    
    // set timeout is used to make the lambda wait a set 
    // amount of time before returning
    setTimeout(function () {
        console.log("slow lambda done executing");
        callback(null, 'Done');
    }, 1500); // waiting 1500ms
};
  • Your test response should look something like this, with a duration of over 1500ms
 
 

Create one last Lambda that fails by default (optional)

  • Create a new Lambda called fail-lambda using the same configuration as before but the below code
exports.handler = (event, context, callback) => {
    var error = new Error("something went wrong");
    callback(error);
};
  • Your execution result should fail like below
 
 

Cloud Watch

Amazon Cloud Watch is a monitoring service for AWS cloud resources and the applications you run on AWS. You can use Amazon Cloud Watch to collect and track metrics, collect and monitor log files, set alarms, and automatically react to changes in your AWS resources. Amazon CloudWatch can monitor AWS resources such as Amazon EC2 instances, Amazon DynamoDB tables, and Amazon RDS DB instances, as well as custom metrics generated by your applications and services, and any log files your applications generate. You can use Amazon CloudWatch to gain system-wide visibility into resource utilization, application performance, and operational health. You can use these insights to react and keep your application running smoothly.  Amazon CloudWatch

Now that we have created and run some Lambda functions, lets see what their logs look like in Cloud Watch

  • Go back to the AWS console and open Cloud Watch -> Logs
  • You should now see two log groups, one for fast-lambda and one for slow-lambda
  • Open up the fast-lambda log group (/aws/lambda/fast-lambda)
  • Click the latest log stream
  • You should see something like this, including the text we logged (fast lambda started executing…)
 
 

Step Functions vs Simple Notification Service (SNS)

What’s worth noting before we work with Step Functions, is that you will need to think through the needs of your application before deciding whether Step Functions is the right option for your use case.

Step Functions: If your use case involves “do this, then that” or “if this, then that” functionality (in other words, workflows), the step functions are a good option. It also handles retries and basic logic. It is generally considered bad practice to have direct references between Microservices so Step Functions should be used sparingly in cases where you are implementing a workflow.

Simple Notification Service (SNS): Also triggers Lambdas however does not support retries or have basic logic built in. Use it if you want your Lambdas to react to events happening in other Lambdas.

For this tutorial we are going to focus on Step Functions.

Step Functions

AWS Step Functions makes it easy to coordinate the components of distributed applications and microservices using visual workflows. Building applications from individual components that each perform a discrete function lets you scale and change applications quickly. Step Functions is a reliable way to coordinate components and step through the functions of your application. Step Functions provides a graphical console to arrange and visualize the components of your application as a series of steps. — AWS Step Functions

This is where it gets interesting, Step Functions will help you orchestrate your Lambdas and debug any issues that arise.

Run a successful Step Function

  • From the AWS console open Step Functions
  • Click create a state machine
  • Make sure Author from scratch is selected and name your Step Function complex-state-machine
  • Use the below code to set it up (make sure you replace the arn… with the arns for your fast and slow Lambdas. If you click in the resource field it will give you a choice of your arn’s)
{
   "StartAt":"First",
   "States":{
      "First":{
         "Type":"Parallel",
         "Next":"Done",
         "Branches":[
            {
               "StartAt":"FastLambda",
               "States":{
                  "FastLambda":{
                     "Type":"Task",
                     "Resource":"arn...fast-lambda",
                     "End":true
                  }
              }
            },
            {
               "StartAt":"SlowLambda",
               "States":{
                  "SlowLambda":{
                     "Type":"Task",
                     "Resource":"arn...slow-lambda",
                     "End":true
                     
                  }
               }
            }
         ]
      },
      "Done":{
         "Type":"Pass",
         "End":true
      }
   }
}

What this configuration does is create a Step Function that runs two Lambdas (your fast and slow Lambdas) in parallel.

  • Click create state machine
  • Click New execution and then click start execution (the inputs will be ignored)
  • After a few seconds you should see this:
 
 
  • Play around with the Execution details section to see results of the Step Function

Run a failing Step Function

  • Click complex state machine in the page’s breadcrumbs at the top
  • Click State machine details and then Copy to new
  • Name the state machine failing-complex-state-machine
  • Update the code to this time using the fail-lambda instead of the slow-lambda (remember to change the arns using the fail-lambda as the 2nd one)
{
   "StartAt":"First",
   "States":{
      "First":{
         "Type":"Parallel",
         "Next":"Done",
         "Branches":[
            {
               "StartAt":"fast-lambda",
               "States":{
                  "fast-lambda":{
                     "Type":"Task",
                     "Resource":"arn...fast-lambda",
                     "End":true
                  }
              }
            },
            {
               "StartAt":"fail-lambda",
               "States":{
                  "fail-lambda":{
                     "Type":"Task",
                     "Resource":"arn...fail-lambda",
                     "End":true
                     
                  }
               }
            }
         ]
      },
      "Done":{
         "Type":"Pass",
         "End":true
      }
   }
}
  • Click New execution and start execution
  • You should see something like this:
 
 
  • Click the output tab under Execution details, and you should see your error message
 
 

As you can see Step Functions is an easy to use service to help you orchestrate and organize Lambdas, giving you visual insight and information about problem areas in your system.


X-Ray

AWS X-Ray helps developers analyze and debug production, distributed applications, such as those built using a microservices architecture. With X-Ray, you can understand how your application and its underlying services are performing to identify and troubleshoot the root cause of performance issues and errors. X-Ray provides an end-to-end view of requests as they travel through your application, and shows a map of your application’s underlying components. You can use X-Ray to analyze both applications in development and in production, from simple three-tier applications to complex microservices applications consisting of thousands of services. — AWS X-Ray

  • Go back to the AWS console and then to the Lambda section
  • Click on fast-lambda and scroll down to Debugging and error handling
  • Check the Enable active tracing box and click Save in the top right
  • Do the same with the slow-lambda and the fail-lambda
  • That’s it, your Lambdas now include instrumentation to view tracing data
  • Go back to your Step Functions and re-execute both of them so that we can see data about how they ran
  • Go back to the AWS console and click AWS X-Ray, you should see something similar to below (if you don’t see data give it a minute or so)
 
 

What does this tell us?

  • Our fast Lambda took 617ms and 204ms
  • Our slow Lambda took 1.55s to 1.93s
  • Our fail Lambda fails every time

If you click on one of the circles you can see additional details over time, including how often a Lambda passes or fails.

 
 

When you have a complex system with lots of Microservices, X-Ray allows you to easily identify slow running services, errors, throttling issues, etc. It gives you invaluable insight into how your system is running with minimal effort.


Hopefully this was helpful in understanding how all these pieces fit together. Please note that these tutorials walk you through setting up configuration through the AWS console for simplicity, however you should be using Cloud Formation to provision and configure all your infrastructure.


Ajit is an AWS Certified Solutions Architect and a member of the One Six Solutions team.