FAUN — Developer Community 🐾

We help developers learn and grow by keeping them up with what matters. 👉 www.faun.dev

Follow publication

100 Days of DevOps — Day 1(Introduction to CloudWatch Metrics)

--

Check the updated 101 Days of DevOps Course

Course Registration link: https://www.101daysofdevops.com/register/

Course Link: https://www.101daysofdevops.com/courses/101-days-of-devops/

YouTube link: https://www.youtube.com/user/laprashant/videos

Welcome to Day 1 of 100 Days of DevOps, I would like to start this journey with one of the most critical concepts in DevOps Monitoring and Alerting.

Problem Statement

  • Create a CloudWatch alarm that sends an email using SNS notification when CPU Utilization is more than 70%.
  • Creating a Status Check Alarm to check System and Instance failure and send an email using SNS notification

Solution

This can be achieved via in one of the three ways

  • AWS Console
  • AWS CLI
  • Terraform

NOTE: This is not the complete list and there are more ways to achieve the same

What is CloudWatch?

AWS CloudWatch is a monitoring service to monitor AWS resources, as well as the applications that run on AWS.

As per official documentation

Amazon CloudWatch monitors your Amazon Web Services (AWS) resources and the applications you run on AWS in real time. You can use CloudWatch to collect and track metrics, which are variables you can measure for your resources and applications.

Reference:

EC2/Host Level Metrics that CloudWatch monitors by default consist of

  • CPU
  • Network
  • Disk
  • Status Check

There are two types of status check

  • System status check: Monitor the AWS System on which your instance runs. It either requires AWS involvement to repair or you can fix it by yourself by just stop/start the instance(in case of EBS volumes).Examples of problems that can cause system status checks to fail
* Loss of network connectivity
* Loss of system power
* Software issues on the physical host
* Hardware issues on the physical host that impact network reachability
  • Instance status check: Monitor the software and network configuration of an individual instance. It checks/detects problems that require your involvement to repair.
* Incorrect networking or startup configuration
* Exhausted memory
* Corrupted filesystem
* Incompatible kernel

NOTE

  • Memory/RAM utilization is custom metrics.
  • By default, EC2 monitoring is 5 minutes intervals but we can always enable detailed monitoring(1 minutes interval, but that will cost you some extra $$$)

Reference

P.S: CloudWatch can be used on premise too. We just need to install the SSM(System Manager) and CloudWatch agent.

Enough of the theoretical concept, let setup first CloudWatch alarm

  • Scenario1: We want to create a CloudWatch alarm that sends an email using SNS notification when CPU Utilization is more than 70%

Solution1: Setup a CPU Usage Alarm using the AWS Management Console

  • Open the CloudWatch console at https://console.aws.amazon.com/cloudwatch/.
  • In the navigation pane, choose Alarms, Create Alarm.
  • Go to Metric → Select metric → EC2 → Per-Instance-Metrics → CPU Utilization → Select metric
Define the Alarm as follows* Type the unique name for the alarm(eg: HighCPUUtilizationAlarm)
* Description of the alarm
* Under whenever,choose >= and type 70, for type 2. This specify that the alarm is triggered if the CPU usage is above 70% for two consecutive sampling period
* Under Additional settings, for treat missing data as, choose bad(breaching threshold), as missing data points may indicate that the instance is down
* Under Actions, for whenever this alarm, choose state is alarm. For Send notification to, select an exisiting SNS topic or create a new one
* To create a new SNS topic, choose new list, for send notification to, type a name of SNS topic(for eg: HighCPUUtilizationThreshold) and for Email list type a comma-seperated list of email addresses to be notified when the alarm changes to the ALARM state.
* Each email address is sent to a topic subscription confirmation email. You must confirm the subscription before notifications can be sent.
* Click on Create Alarm

Solution2: Setup CPU Usage Alarm using the AWS CLI

  • Create an alarm using the put-metric-alarm command
aws cloudwatch put-metric-alarm --alarm-name cpu-mon --alarm-description "Alarm when CPU exceeds 70 percent" --metric-name CPUUtilization --namespace AWS/EC2 --statistic Average --period 300 --threshold 70 --comparison-operator GreaterThanThreshold  --dimensions "Name=InstanceId,Value=i-12345678" --evaluation-periods 2 --alarm-actions arn:aws:sns:us-east-1:111122223333:MyTopic --unit Percent
  • Using the command line, we can test the Alarm by forcing an alarm state change using a set-alarm-state command
  • Change the alarm-state from INSUFFICIENT_DATA to OK
# aws cloudwatch set-alarm-state --alarm-name "cpu-monitoring" --state-reason "initializing" --state-value OK
  • Change the alarm-state from OK to ALARM
# aws cloudwatch set-alarm-state --alarm-name "cpu-monitoring" --state-reason "initializing" --state-value ALARM
  • Check if you have received an email notification about the alarm

GITHUB link:

Solution3: Setup CPU Usage Alarm using the Terraform

#cloudwatch.tfresource "aws_cloudwatch_metric_alarm" "cpu-utilization" {
alarm_name = "high-cpu-utilization-alarm"
comparison_operator = "GreaterThanOrEqualToThreshold"
evaluation_periods = "2"
metric_name = "CPUUtilization"
namespace = "AWS/EC2"
period = "120"
statistic = "Average"
threshold = "80"
alarm_description = "This metric monitors ec2 cpu utilization"
alarm_actions = [ "${aws_sns_topic.alarm.arn}" ]
dimensions {
InstanceId = "${aws_instance.my_instance.id}"
}
}

GitHub link:

Scenario2: Create a status check alarm to notify when an instance has failed a status check

Solution1: Creating a Status Check Alarm Using the AWS Console

  1. Open the Amazon EC2 console at https://console.aws.amazon.com/ec2/.
  2. In the navigation pane, choose Instances.
  3. Select the instance, choose the Status Checks tab, and choose to Create Status Check Alarm.
* You can create new SNS notification or use the exisiting one(I am using the existing one create in earlier example of high CPU utilization)
* In
Whenever, select the status check that you want to be notified about(options Status Check Failed(Any), Status Check Failed(Instance) and Status Check Failed(System)
* In
For at least, set the number of periods you want to evaluate and in consecutive periods, select the evaluation period duration before triggering the alarm and sending an email.
* In
Name of alarm, replace the default name with another name for the alarm.
* Choose
Create Alarm.

Solution2: To create a status check alarm via AWS CLI

  • Use the put-metric-alarm command to create the alarm
aws cloudwatch put-metric-alarm --alarm-name StatusCheckFailed-Alarm-for-test-instance --metric-name StatusCheckFailed --namespace AWS/EC2 --statistic Maximum --dimensions Name=InstanceId,Value=i-1234567890abcdef0 --unit Count --period 300 --evaluation-periods 2 --threshold 1 --comparison-operator GreaterThanOrEqualToThreshold --alarm-actions arn:aws:sns:us-west-2:111122223333:my-sns-topic

GitHub link:

Solution3:To create a status check alarm via Terraform

resource "aws_cloudwatch_metric_alarm" "instance-health-check" {
alarm_name = "instance-health-check"
comparison_operator = "GreaterThanOrEqualToThreshold"
evaluation_periods = "1"
metric_name = "StatusCheckFailed"
namespace = "AWS/EC2"
period = "120"
statistic = "Average"
threshold = "1"
alarm_description = "This metric monitors ec2 health status"
alarm_actions = [ "${aws_sns_topic.alarm.arn}" ]
dimensions {
InstanceId = "${aws_instance.my_instance.id}"
}
}

GitHub link:

Looking forward from you guys to join this journey and spend a minimum an hour every day for the next 100 days on DevOps work and post your progress using any of the below medium.

Reference

Follow us on Twitter 🐦 and Facebook 👥 and join our Facebook Group 💬.

To join our community Slack 🗣️ and read our weekly Faun topics 🗞️, click here⬇

If this post was helpful, please click the clap 👏 button below a few times to show your support for the author! ⬇

--

--

Published in FAUN — Developer Community 🐾

We help developers learn and grow by keeping them up with what matters. 👉 www.faun.dev

Written by Prashant Lakhera

AWS Community Builder, Ex-Redhat, Author, Blogger, YouTuber, RHCA, RHCDS, RHCE, Docker Certified,4XAWS, CCNA, MCP, Certified Jenkins, Terraform Certified, 1XGCP

Responses (8)