Aurora PostgreSQL Slow Query Logging and CloudWatch Alarms via AWS CDK

By Adam McQuistan in DevOps  02/19/2022 Comment

Introduction

In this article I discuss the benefits of architecting observability into your AWS Aurora PostgreSQL deployments through the use of CloudWatch Logs, Metric Filters, and Alarms. To make the discussion more concrete I provide a working implementation of a Aurora PostgreSQL cluster enabled with slow query capture exported to CloudWatch logs. To make things actionable I use log metrics filter to identify and alarm on the presence of slow queries captured in CloudWatch Logs. By using these managed services for capturing and automating telemetry based insights I can achive greater operational stability. 

To round out the discussion I briefly highlight the use of native PostgreSQL tools such as the EXPLAIN and ANALYZE commands to perform root cause analysis of poor performing queries. To aid in reproducability of this tutorial all AWS resources are defined using Infrastructure as Code (IaC) via AWS Cloud Development Kit (CDK).

All code for this tutorial can be viewed and cloned for experimentation on my GitHub account.

Overview of Solution Architecture

Above is a diagram of the solution architecture identifying the core AWS services and resources that are implemented in this tutorial. Below I've called out each AWS service and given a brief description of how each will be used to accomplish the end goal of capturing slow performing queries in CloudWatch logs resulting in a alarm being sent via email to notify the recipient that the query should be investigated.

  • Aurora PostgreSQL is a relational database service that AWS offers as a proprietary implementation of the PostgreSQL engine
  • CloudWatch Logs are a managed service offered to collect, store, and search logs from cloud services and applications
  • CloudWatch Metrics are another managed service used to collect and store telemetry specific to AWS services and applications that can then be used for visual analysis or integrated into alarms
  • CloudWatch Alarms are a managed service which respond to CloudWatch metrics resulting in an action being taken when telemetry values cross a defined threshold
  • Simple Notification Service (SNS) is a publish and subscribe notification service for integration and automation of various AWS services

Enabling Slow Query Logging with RDS Parameter Groups

A typical PostgreSQL database implementation will provide the ability to specify settings within a file named postgresql.conf such as log_min_duration_statement.  Any queries that take longer than this setting value to execute will be logged to the main postgresql.log file.

In the case of Aurora PostgreSQL (along with regular RDS PostgreSQL) AWS does not provide access to the postgresql.conf file directly because its a managed service under tight control of AWS engineers. Instead you use Parameter Groups where you can specify settings for your desired implementation such as the Aurora PostgreSQL specific log_min_duration_statement in milliseconds and attach it to your Aurora cluster during deployment.

Below is a code snippet from lib/aws-cdk-aurora-stack.ts CDK IaC which creates a Cluster Parameter Group configured to log queries that take 1.5 seconds or longer to execute.

const rdsEngine = rds.DatabaseClusterEngine.auroraPostgres({
  version: rds.AuroraPostgresEngineVersion.VER_13_4
});
const rdsParamGroup = new rds.ParameterGroup(this, 'SlowQueryParamsGrp', {
  engine: rdsEngine,
  description: 'Aurora PostgreSQL Cluster Parameter Group with Slow Query Logging',
  parameters: {
    log_min_duration_statement: '1500'
  }
});

this.dbCluster = new rds.DatabaseCluster(this, 'DbCluster', {
  engine: rdsEngine,
  cloudwatchLogsExports: ['postgresql'],
  cloudwatchLogsRetention: logs.RetentionDays.THREE_DAYS,
  parameterGroup: rdsParamGroup,
  credentials: rds.Credentials.fromPassword(
    props.dbUsername,
    SecretValue.plainText(props.dbPassword),
  ),
  defaultDatabaseName: props.dbName,
  instances: 1,
  instanceProps: {
    vpc,
    publiclyAccessible: true,
    securityGroups: [ dbSecurityGroup ],
    vpcSubnets: { subnetType: ec2.SubnetType.PUBLIC }
  }
});

new CfnOutput(this, 'DbClientEndpoint', {
  value: `postgresql://${props.dbUsername}:${props.dbPassword}@${this.dbCluster.clusterEndpoint.hostname}:5432/${props.dbName}`
});
new CfnOutput(this, 'ClusterLogGroup', {
  value: `/aws/rds/cluster/${this.dbCluster.clusterIdentifier}/postgresql`
});

I also configure the CDK DatabaseCluster construct to export the postgresql log to CloudWatch and add the optional retention value of 3 days to minimize my CloudWatch costs but, in a production setting you'll likely want to set this to a value like 5-10 times higher.

Once deployed the Aurora PostgreSQL logs are exported to a CloudWatch log group with a naming convention of /aws/rds/cluster/cluster-name/postgresql

I've also included a CloudFormation output of its name for easy identification.

Defining the CloudWatch Log Metric Filter and Alarm

In the last section I showed how to enable slow query logging for queries taking over 1.5 seconds (this is probably a bit low but it makes demonstration easier) as well as how to export the postgresql log to CloudWatch. Now I'd like to shift gears and focus on how to make the presence of slow queries being logged actionable so it draws attention and can be further investigated.

I define a CloudWatch Log Metric Filter which will match on the presence of the text "duration" which is what will prefix all logged slow queries. Then I can extend the metric filter by building an CloudWatch Alarm on top of it to send a SNS topic notification to my email letting me know a slow query has been identified.

Below is a code snippet from the lib/aws-cdk-cloudwatch-stack.ts CDK IaC which accomplishes what I described above.

const dbLogGroup = logs.LogGroup.fromLogGroupName(
  this,
  'DbLogGroup',
  `/aws/rds/cluster/${props.dbCluster.clusterIdentifier}/postgresql`
);

const filterMetric = dbLogGroup.addMetricFilter('SlowQueryMetricFilter', {
  filterPattern: logs.FilterPattern.anyTerm('duration'),
  metricName: 'SlowQuery',
  metricNamespace: 'AuroraPostgreSQL',
  defaultValue: 0
});

const metric = filterMetric.metric()
const alarm = metric.createAlarm(this, 'SlowQueryAlarm', {
  threshold: 1,
  evaluationPeriods: 1
});
const snsTopic = new sns.Topic(this, 'SlowQueryTopic', {
  displayName: 'Slow Query Alarm Topic'
});
snsTopic.addSubscription(new sns_subs.EmailSubscription(props.email));
alarm.addAlarmAction(new cw_actions.SnsAction(snsTopic));

Deploying the Solution to AWS

This complete code for this project lives in my GitHub account under a branch named slow-query-logging as a AWS CDK application. Since its a AWS CDK application recreating the demo in the AWS Cloud is very simple once you have the cdk installed per the AWS docs.

git clone --branch slow-query-logging https://github.com/amcquistan/aws-cdk-aurora-cw-alarms.git
cd aws-cdk-aurora-cw-alarms

You will want to copy the project's .env.example file and fill in values specific to you such as your IP address (use https://checkip.amazonaws.com to find IP), database credentials of your choosing along with an email address to send slow query alerts to.

cp .env.example .env # then update .env variables

Then install dependencies and build the project.

npm install && npm run build

If you are deploying to a new AWS account or region you'll likely want to bootstrap the project to your environment.

cdk bootstrap YOUR_AWS_ACCOUNT_ID/YOUR_AWS_REGION

Then all thats left is to run the deploy command.

cdk deploy --all

This will output a couple of useful bits of information such as the Aurora Database client connection string. Take note of the connection string with the CloudFormation output named "AwsCdkAuroraStack.DbClientEndpoint" as that will be useful later.

Database Schema Definition and Data Seeding for Demonstration

In this section I create a couple of tables representing products and orders then load them with data to query for demonstration purposes. In the linked GitHub repo you can find a file named schema.sql which contains the below SQL DDL and DML statements.

-- schema.sql

DROP TABLE IF EXISTS orders;
DROP TABLE IF EXISTS products;

CREATE TABLE products (
    id INTEGER PRIMARY KEY,
    name VARCHAR(20) NOT NULL,
    price NUMERIC(5, 2) NOT NULL
);

CREATE TABLE orders (
    id SERIAL PRIMARY KEY,
    product_id INTEGER NOT NULL,
    quantity INTEGER NOT NULL,
    FOREIGN KEY (product_id)
        REFERENCES products (id)
);


INSERT INTO products (id, name, price)
VALUES 
(1000, 'Bubble Wrap', 4.99),
(2000, 'Fluffernutter', 2.99),
(3000, 'Chapstick', 1.99),
(4000, 'Post-its', 5.99),
(5000, 'Bubble Gum', 0.99),
(6000, 'Toothbrush', 3.99),
(7000, 'Socks', 6.99),
(8000, 'Velcro', 3.99);

SELECT setseed(0.23);

INSERT INTO orders (product_id, quantity)
SELECT 
    CASE mod(ceil(random() * s)::integer, 8)
        WHEN 0 THEN 1000
        WHEN 1 THEN 2000
        WHEN 2 THEN 3000
        WHEN 3 THEN 4000
        WHEN 4 THEN 5000
        WHEN 5 THEN 6000
        WHEN 6 THEN 7000
        WHEN 7 THEN 8000
    END AS product_id,
    ceil(random() * 100)::integer + 1 AS quantity
FROM generate_series(1, 3000000, 1) AS s;

Using the CDK / CloudFormation DbClientEndpoint output as a environment variable in my CLI terminal I can use the psql client to execute the schema.sql script like so. Of course you could another PostgreSQL compatible client such as PGAdmin or Postico also.

psql $DbClientEndpoint --file schema.sql

Imposing Load on the Database to Test CloudWatch Metric Filter Alarm

Ok now for the fun part. This is where I'll impose some load on the database with a terribly written query to calculate total revenue over all orders for each product using a correlated subquery within the SELECT statement. Please do not use this pattern for anything other than being mean to your database :)

Using the database client connection string output from the CDK stack I'll start up the psql shell but, you could use something like PGAdmin or Postico as well. From the psql client I'll issue my terribly written query with the EXPLAIN and ANALYZE keywords to get a read out of the query plan and execution metrics like so.

EXPLAIN ANALYZE
SELECT 
  p.*,
  (
    SELECT sum(quantity) * p.price FROM orders o
    WHERE o.product_id = p.id
  ) as product_revenue
FROM products p;

Which gives me the following output.

                                                           QUERY PLAN                                                           
--------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on products p  (cost=0.00..43723630.00 rows=800 width=106) (actual time=256.235..1860.804 rows=8 loops=1)
   SubPlan 1
     ->  Aggregate  (cost=54654.50..54654.51 rows=1 width=32) (actual time=232.594..232.594 rows=1 loops=8)
           ->  Seq Scan on orders o  (cost=0.00..53717.00 rows=375000 width=4) (actual time=0.004..206.376 rows=375000 loops=8)
                 Filter: (product_id = p.id)
                 Rows Removed by Filter: 2625000
 Planning Time: 1.445 ms
 Execution Time: 1860.890 ms
(8 rows)

As you can see from the output provided by EXPLAIN and ANALYZE the query took 1.8 seconds which is above the 1.5 second threshold I set for the log_min_duration_statement paramter in the Parameter Group which resulted in a log being captured and exported to CloudWatch logs as shown below.

I should see that the Metric Filter enabled CloudWatch Alarm goes into an Alarm state within the next 5 minutes and subsequently receive an email notification of the alarm state transition.

At this point I've flipped my developer cap on backwards and am in total DBA mode looking to optimize this query. Lets pretend for the moment that I don't see any issues with the overall structure of the query and its aweful use of the correlated subquery within the SELECT construct. Instead I'm going to hone in on the fact that the primary filter applied as listed in the output of the EXPLAIN ANALYZE is "Seq Scan on orders ... Filter: (product_id = p.id)". A reasonable thing to try here is placing an index on the filter column of product_id in the orders table to avoid doing a full orders table scan for each row of the products table.

CREATE INDEX orders_product_fk ON orders(product_id);

Now let me reissue the same query and see if there are any improvements.

                                                                      QUERY PLAN                                                                       
-------------------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on products p  (cost=0.00..20816574.00 rows=800 width=106) (actual time=109.755..924.354 rows=8 loops=1)
   SubPlan 1
     ->  Aggregate  (cost=26020.68..26020.69 rows=1 width=32) (actual time=115.537..115.537 rows=1 loops=8)
           ->  Bitmap Heap Scan on orders o  (cost=4178.68..25083.18 rows=375000 width=4) (actual time=13.024..89.969 rows=375000 loops=8)
                 Recheck Cond: (product_id = p.id)
                 Heap Blocks: exact=129736
                 ->  Bitmap Index Scan on orders_product_fk  (cost=0.00..4084.93 rows=375000 width=0) (actual time=10.486..10.486 rows=375000 loops=8)
                       Index Cond: (product_id = p.id)
 Planning Time: 0.213 ms
 Execution Time: 924.500 ms
(10 rows)

The filter condition now uses a "BitMap Index Scan" and the overall query execution time is under 1 second. Nice improvement right?

Conclusion

In this article I demonstrated how to instrument observability into a AWS Aurora PostgreSQL deployment utilizing Cloud Watch logging, metric filters, and alarms paired with SNS for automated notifications in response to the presence of slow executing queries. Collectively this combination of observability and alerting begins to give the power of operating in control when running services in the AWS Cloud. 

As always, I thank you for reading and please feel free to ask questions or critique in the comments section below.

Share with friends and colleagues

[[ likes ]] likes

Community favorites for DevOps

theCodingInterface