In this article I present a demonstration of how one can use Cloud Watch Metrics and CloudWatch Alarms to monitor CPU Utilization for an AWS RDS database consisting of a primary read/write instance and a read-only replica. To make this demonstration repeatable I've used Infrastructure as Code (IaC) via the AWS Cloud Development Kit (CDK).
The vast majority of applications are of little value without the ability to save, update and retrieve data which is usually managed through a traditional relational database management system (RDBMS). The fact that data drives so many business processes has often lead to system bottlenecks at the database due to the many applications' demand for highly sought after data. The cloud has made great strides in helping to aleviate this bottleneck by simplifying the process of scaling databases.
When it comes to options for scaling you generally have two approaches. The first is to vertically scale by increasing the size of the database effectively adding CPU, RAM, and storage. This traditional approach often works quite well. However, there are also ways to implement horizontal scaling where you dedicate one database instance to serve as a primary capable of serving both reads and writes. Then you can implement replication in the primary database to standby read-only replicas which can offload applications data read needs as well as serve as a fast failover option if the primary goes down.
By separating the writes to a primary database instance and then delegating read functionality to read-replicas you increase your scaling options. You can now vertically scale the primary database in order to optimize its ability to handle writes and you can both vertically and horizontally scale the pool of read replicas. However, when you have two sets of databases serving different use cases you also need to ensure you're monitoring their metrics independently to ensure you have proper resource planning and utilization. This is what AWS CloudWatch metrics and alarms give us.
I've made the complete demo cloud system available on my GitHub account which uses the AWS CDK to provision an RDS Aurora Postgres cluster of one writer and one reader instance within a VPC of public subnets spread over two availability zones. I also provision three CPU Utilization CloudWatch metrics and alarms at the database cluster, writer and reader instance granularity.
To get started first clone the repo and change directories into the project root.
git clone https://github.com/amcquistan/aws-cdk-aurora-cw-alarms.git
cd aws-cdk-aurora-cw-alarms
Next install the typescript dependencies.
npm install
Edit the bin/aws-cdk-aurora-cw-alarms.ts file to at the very least include your home IP address which allows you to directly connect to the PostgreSQL database from your laptop. You can use the https://checkip.amazonaws.com REST endpoint to determine your IP address. These values can be exported as shell environment variables or placed directly in the code as shown below.
#!/usr/bin/env node
import 'source-map-support/register';
import * as cdk from 'aws-cdk-lib';
import { AwsCdkAuroraCwAlarmsStack } from '../lib/aws-cdk-aurora-cw-alarms-stack';
const app = new cdk.App();
new AwsCdkAuroraCwAlarmsStack(app, 'AwsCdkAuroraCwAlarmsStack', {
dbName: process.env.DB_NAME || 'appdb',
dbUsername: process.env.DB_USERNAME || 'appuser',
dbPassword: process.env.DB_PASSWORD || 'SomePassword',
safeIp: process.env.SAFE_IP || 'YOUR_IP/32'
});
Next build the project.
npm run build
If necessary bootstrap the CDK application to your AWS account and region.
cdk bootstrap YOUR_AWS_ACCOUNT_ID/YOUR_AWS_REGION
Then deploy like so.
cdk deploy
After around 5 minutes the infrastructure will be provisioned to your AWS cloud account and you will see the names of the CloudWatch metics and alarms along with the connection endpoints for the writer and reader instances.
Example output.
Outputs:
AwsCdkAuroraCwAlarmsStack.DbCpuAlarmName = AwsCdkAuroraCwAlarmsStack-DbCpuAlarmF24BD102-1OHL609TGDDQF
AwsCdkAuroraCwAlarmsStack.DbCpuMetricName = CPUUtilization
AwsCdkAuroraCwAlarmsStack.DbReaderCpuAlarmName = AwsCdkAuroraCwAlarmsStack-DbReaderCpuAlarm1E48F5B7-N3GY2EVTS0RZ
AwsCdkAuroraCwAlarmsStack.DbReaderCpuMetricName = CPUUtilization
AwsCdkAuroraCwAlarmsStack.DbReaderEndpoint = awscdkauroracwalarmsstack-dbcluster224236ef-l7zczj6qxf8l.cluster-ro-cwiohwi3ip7q.us-east-2.rds.amazonaws.com
AwsCdkAuroraCwAlarmsStack.DbWriterCpuAlarmName = AwsCdkAuroraCwAlarmsStack-DbWriterCpuAlarm156ACAFD-B3PAP6CLEXX0
AwsCdkAuroraCwAlarmsStack.DbWriterCpuMetricName = CPUUtilization
AwsCdkAuroraCwAlarmsStack.DbWriterEndpoint = awscdkauroracwalarmsstack-dbcluster224236ef-l7zczj6qxf8l.cluster-cwiohwi3ip7q.us-east-2.rds.amazonaws.com
Below you can see a code snippet from lib/aws-cdk-aurora-cw-alarms-stack.ts of the CDK code which establishes the Cloud Watch Metics and Alarms to track and alert on CPU Utilization for the database.
// CPU Utilization CloudWatch Metric Averaged over 2 mins
// for DB Cluster, Writer Instance and Reader Instance
const dbCpuMetric = dbCluster.metricCPUUtilization({
period: Duration.minutes(2)
});
const dbWriterCpuMetric = dbCluster.metricCPUUtilization({
period: Duration.minutes(2),
dimensionsMap: {
Role: 'WRITER',
DBClusterIdentifier: dbCluster.clusterIdentifier
}
});
const dbReaderCpuMetric = dbCluster.metricCPUUtilization({
period: Duration.minutes(2),
dimensionsMap: {
Role: 'READER',
DBClusterIdentifier: dbCluster.clusterIdentifier
}
});
// CloudWatch alarms for CPU Utilization metrics set to
// alarm state when CPU os over 25 percent at
// DB Cluster, Writer and Reader instances
const dbCpuAlarm = dbCpuMetric.createAlarm(this, 'DbCpu25Alarm', {
evaluationPeriods: 1,
alarmDescription: "Cluster CPU Over 25 Percent",
threshold: 25
});
const dbWriterCpuAlarm = dbWriterCpuMetric.createAlarm(this, 'DbWriterCpu25Alarm', {
evaluationPeriods: 1,
alarmDescription: "Writer CPU Over 25 Percent",
threshold: 25
});
const dbReaderCpuAlarm = dbReaderCpuMetric.createAlarm(this, 'DbReaderCpu25Alarm', {
evaluationPeriods: 1,
alarmDescription: "Reader CPU Over 25 Percent",
threshold: 25
});
First this snippet creates metrics to track CPU utilization over two minute windows for the individual primary writer instance as well as across the pool of reader instance (only one in this example) along with a third metric representative of the writer and readers together. Then the snippet creates alarms for each metric so an alarm state can be triggered at the same level of overall cluster, writer only as well as reader instance pool whenever CPU utilization exceeds 25 percent for one period of the original 2 minute metric.
With the infrastructure provisioned I can now experiment with the metrics and alarms response to imposing CPU load on the writer instance. I'll use the psql shell client but you could also use a GUI based client like Postico or PGAdmin if you prefer.
Once connected to the writer (aka cluster) endpoint I issue the following query which creates an in-memory table structure of 50 million rows of random numbers, categorizes them then calculates some summary stats on each category.
WITH cte AS (
SELECT
CASE
WHEN s < 100 THEN 'tens'
WHEN s >= 100 AND s < 1000 THEN 'hundreds'
WHEN s >= 1000 AND s < 1000000 THEN 'thousands'
when s >= 1000000 THEN 'millions'
END AS name,
ceil(s * random()) AS derived_value
FROM generate_series(1, 50000000, 1) AS s
)
SELECT
avg(derived_value),
min(derived_value),
max(derived_value),
sum(derived_value),
min(derived_value) + max(derived_value) AS range,
(min(derived_value) + max(derived_value)) / 2 AS midpoint,
name
FROM cte
GROUP BY name;
After waiting for the query to finish executing I go to the AWS CloudWatch Alarms console and have a look at the three alarms I created.
First I have a look at the writer instance specific alarm and see a spike in its CPU from a baseline of about 10 percent up over the 25 percent threshold (around 26 percent) putting it into an alarm state.
Next I inspect the read replica pool alarm and note that there is no change in baseline CPU utilization of around 10 percent. This should make sense because I issued the query on the writer instance which is completely isolated from the reader instance's compute resources.
Finally I inspect the alarm for the cluster as a whole which tracks the overall CPU utilization across all readers and the writer instance. I see a modest increase from a baseline of about 10 percent CPU to around 18 percent. Mathematically this also makes sense since the reader was at its baseline of roughly 10 percent and the writer experienced elevated CPU of about 26% which average out to 18 percent between them.
By breaking the CPU metrics up along the lines of primary writer instance and read replicas I am able to quantitatively plan and adjust for the data access patterns of my database. For example, if I see that my primary instance is receiving a lot of CPU load but my read replica instance is relatively unphased I can further investigate read vs write IOPs to see if applications are unnecessarily burdening the primary with read heavy work loads. In that case I could request that the read only traffic be spread out to the read replicas. If instead I see that the CPU load is due to the primary being overwhelmed with writes my course of actions would be to scale up the instance to a larger size or look for indexes that could be added to improve update and delete statements.
On the other hand if the CPU utilization is high among the read replicas I also have some targetted options to pursue. I could try adding more replicas to horizontally scale the traffic across more nodes. I could also look into scaling the size of the read replicas up in the event that overall read requests are relatively low and there are just expensive queries but, I'd probably also look into different indexing or data modeling strategies to improve the read performance as well.
In this article I put forth a demo of how to implement CloudWatch Metrics and Alarms for AWS RDS database. I showed that by differentiating the CPU metrics across the two data access patterns made available by a primary database and read replicas you can better understand resource consumption across an RDS cluster as a whole.
As always, I thank you for reading and please feel free to ask questions or critique in the comments section below.