theCodingInterface

Creating PostgreSQL Hot Standby Log Shipped Replica using WAL-E

By Adam McQuistan in PostgreSQL 10/21/2020 Comment

Introduction

In this article I demonstrate how to setup a log shipped replica of a Postgres v11 master database creating a warm standby replica for High Availability and read scaling utilizing the WAL-E program. WAL-E is a Python based program designed to facilitate performing continuous archiving of PostgreSQL WAL files and base backups.

For this article I will be using two separate Linux based Cloud Virtual Private Servers (AWS Linux AMI 2 specifically but both Ubuntu and CentOS can be used interchangably) along with an AWS S3 bucket.

PostgreSQL Master Server Setup

In this section I walk though installing and configuring the AWS Linux AMI 2 based master server installing the PostgreSQL and WAL-E dependencies, build and populate the master database with source data, and initiate continuous archving to an AWS S3 bucket. Note that although I am using the AWS Linux AMI distro the process is essentially identical for Ubuntu and CentOS / Fedora and I link to other articles on The Coding Interface that detail how to install on those distros.

First I update the newly launched Linux server and install the necessary linux packages for Postgres and the WAL-E dependencies.

sudo yum update -y

Installing Postgres (see link below for instructions on installing on Ubuntu / Centos).

sudo amazon-linux-extras install epel -y
sudo amazon-linux-extras install postgresql11 -y
sudo yum install postgresql-server postgresql-contrib -y

For instructions specific to installing PostgreSQL on Ubuntu and CentOS see my earlier article here on The Coding Interface, Developer's Guide to PostgreSQL on Linux: Installation

Install Python3 and PIP as well as WAL-E and the very useful envdir for working with file based environment variables.

sudo yum install python3 lzop pv -y
curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
sudo python3 get-pip.py
sudo python3 -m pip install envdir
sudo python3 -m pip install wal-e[aws]

Initialize, start and enable Postgres.

sudo su - # become root user
su - postgres
pg_ctl init
exit
systemctl start postgresql
systemctl enable postgresql

Create a Postgres user and database for interacting with the master source database.

su - postgres
createuser -e -l --pwprompt pguser
createdb -e --owner=pguser pgdb

Create some functions and stored procedures for generating test data to work with in the replication process.

CREATE OR REPLACE FUNCTION random_between(low INT, high INT) 
   RETURNS INT AS
$$
BEGIN
   RETURN floor(random()* (high-low + 1) + low);
END;
$$ language 'plpgsql' STRICT;


CREATE OR REPLACE FUNCTION random_letter() 
   RETURNS TEXT AS
$$
DECLARE
    letter_idx INTEGER := 0;
    letters TEXT := 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ';
BEGIN
    SELECT random_between(0, LENGTH(letters) - 1) INTO letter_idx; 
    RETURN SUBSTRING(letters, letter_idx, 1);
END;
$$ language 'plpgsql' STRICT;

CREATE TABLE IF NOT EXISTS random_numbers (
    id serial PRIMARY KEY,
    number INTEGER NOT NULL,
    created_at TIMESTAMP NOT NULL
);


CREATE TABLE IF NOT EXISTS random_letters (
    id serial PRIMARY KEY,
    letter VARCHAR(1) NOT NULL,
    created_at TIMESTAMP NOT NULL
);


CREATE OR REPLACE PROCEDURE generate_randomness() 
language plpgsql
AS $$
DECLARE
    num INTEGER;
BEGIN
    SELECT random_between(0, 100) INTO num;
    
    INSERT INTO random_numbers (number, created_at)
    SELECT random_between(0, 100), NOW()
    FROM generate_series(1, num) as s;
    
    INSERT INTO random_letters (letter, created_at)
    SELECT random_letter(), NOW()
    FROM generate_series(1, num) as s;
    
    COMMIT;

END; $$

I placed the above functions and stored procedure into a file called initdb.sql and loaded it into the database like so.

psql -U pguser -d pgdb -f initdb.sql

Create a cron task as postgres user to generate random numbers and letter data every minute.

*/1 * * * * psql -U pguser -d pgdb -c "CALL generate_randomness();"

In order for WAL-E to interact with S3 an IAM user with programmatic access is required. I've created a IAM user in my AWS account named wal-e with S3FullAccess permission and gathered up the access key ID and secret. See AWS IAM Docs if you are unsure how to do this.

I also setup a S3 bucket named tci-wal-e-demo in the US-WEST (Oregon) region to work with.

Now over on the Master Linux server I create a directory for containing my IAM user credentials and S3 bucket info in the form of [Python envdir](https://envdir.readthedocs.io/en/latest/api.html) environment variable files as shown below.

Setup S3 Bucket in AWS for Containing Backup Data. I've named my bucket tci-wal-e-demo

With the IAM user and S3 bucket created I can configure WAL-E along with envdir back on the master server.

sudo mkdir -p /etc/wal-e/env
sudo chown -R postgres:postgres /etc/wal-e
sudo su - # become root user
su - postgres # become postgres user
cd /etc/wal-e/env
echo "PUT-YOUR-ACCESS-KEY-ID-HERE" > AWS_ACCESS_KEY_ID
echo "PUT-YOUR-ACCESS-KEY-SECRET-HERE" > AWS_SECRET_ACCESS_KEY
echo "YOUR-BUCKET-REGION" > AWS_REGION
echo "s3://YOUR-BUCKET-NAME/dbname" > WALE_S3_PREFIX

This is what the /etc/wal-e/env directory should look like now.

ls -l
total 16
-rw-r--r-- 1 postgres postgres 21 Oct 21 15:32 AWS_ACCESS_KEY_ID
-rw-r--r-- 1 postgres postgres 10 Oct 21 15:32 AWS_REGION
-rw-r--r-- 1 postgres postgres 41 Oct 21 15:32 AWS_SECRET_ACCESS_KEY
-rw-r--r-- 1 postgres postgres 20 Oct 21 15:32 WALE_S3_PREFIX

At this point I can configure the Master Postgres database instance to perform continuous WAL archiving and generate a Base Backup with WAL-E (run as postgres user)

First I update /var/lib/pgsql/data/postgresql.conf settings

wal_level = replica
archive_mode = on
archive_command = 'envdir /etc/wal-e/env wal-e wal-push %p'
archive_timeout = 60

This requires a restart of the postgresql service to take effect

systemctl restart postgresql

Now I push a base backup (as postgres user) utilizing the following wal-e command. However, I have found that in production systems it is best to wrap this command in a shell script and run the shell script with nohup as a background job because the process can take a fair amount of time and it's possible that the ssh connection to the server may fail.

envdir /etc/wal-e/env wal-e backup-push /var/lib/pgsql/data

That being said I create the following shell script named wal-e-push.sh with the following contents.

#!/bin/bash

echo "starting wal-e backup-push"

envdir /etc/wal-e/env wal-e backup-push /var/lib/pgsql/data

echo "wal-e backup-push complete"

Then make the file executable and kick it off using nuhup as so.

chmod +x wal-e-push.sh
nohup ./wal-e-push.sh &

I can then cat or tail the nohup.out script watching for the completed message to appear indicating the backup has been successfully pushed.

$ cat nohup.out
starting wal-e backup-push
wal_e.main   INFO     MSG: starting WAL-E
        DETAIL: The subcommand is "backup-push".
        STRUCTURED: time=2020-10-21T16:43:13.441974-00 pid=13164
wal_e.operator.backup INFO     MSG: start upload postgres version metadata
        DETAIL: Uploading to s3://tci-wal-e-demo/pgdb/basebackups_005/base_000000010000000000000002_00000040/extended_version.txt.
        STRUCTURED: time=2020-10-21T16:43:14.619558-00 pid=13164
wal_e.operator.backup INFO     MSG: postgres version metadata upload complete
        STRUCTURED: time=2020-10-21T16:43:14.675762-00 pid=13164
wal_e.worker.upload INFO     MSG: beginning volume compression
        DETAIL: Building volume 0.
        STRUCTURED: time=2020-10-21T16:43:14.764043-00 pid=13164
wal_e.worker.upload INFO     MSG: begin uploading a base backup volume
        DETAIL: Uploading to "s3://tci-wal-e-demo/pgdb/basebackups_005/base_000000010000000000000002_00000040/tar_partitions/part_00000000.tar.lzo".
        STRUCTURED: time=2020-10-21T16:43:15.193014-00 pid=13164
wal_e.worker.upload INFO     MSG: finish uploading a base backup volume
        DETAIL: Uploading to "s3://tci-wal-e-demo/pgdb/basebackups_005/base_000000010000000000000002_00000040/tar_partitions/part_00000000.tar.lzo" complete at 18202.3KiB/s.
        STRUCTURED: time=2020-10-21T16:43:15.520506-00 pid=13164
NOTICE:  pg_stop_backup complete, all required WAL segments have been archived
wal-e backup-push complete

Afterwards I verify backups are on S3 using the wal-e backup-list command like so.

$ envdir /etc/wal-e/env wal-e backup-list
wal_e.main   INFO     MSG: starting WAL-E
        DETAIL: The subcommand is "backup-list".
        STRUCTURED: time=2020-10-21T16:44:23.915389-00 pid=13218
name	last_modified	expanded_size_bytes	wal_segment_backup_start	wal_segment_offset_backup_start	wal_segment_backup_stop	wal_segment_offset_backup_stop
base_000000010000000000000002_00000040	2020-10-21T16:43:17.000Z		000000010000000000000002	00000040

Postgres Hot Standby Replica Setup

Ok now a separate Linux server (again I'm using the AWS Linux 2 AMI but Ubuntu / CentOS can be treated the same except where I mention otherwise) I can set it up as a log shipped replica in hot standby mode. Warm standby could also be done, the only difference is that hot standby allows for read only connections while warm standby don't allow client connections.

To start I update the server and install Postgres 11.

sudo yum update -y
sudo amazon-linux-extras install epel postgresql11 -y
sudo yum install postgresql-server postgresql-contrib -y

Again, see Developer's Guide to PostgreSQL on Linux: Installation for specifics on installing on Ubuntu or Centos

Install Python3, PIP, envdir and WAL-E. Note that in the replica I don't have to install the pv program since thats only required for generating the base backup in the master server.

sudo yum install python3 lzop -y
curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
sudo python3 get-pip.py
sudo python3 -m pip install envdir wal-e[aws]

Next I configure WAL-E in the same manner I did for the master.

sudo mkdir -p /etc/wal-e/env
sudo chown -R postgres:postgres /etc/wal-e
sudo su -
su - postgres
cd /etc/wal-e/env
echo "PUT-YOUR-ACCESS-KEY-ID-HERE" > AWS_ACCESS_KEY_ID
echo "PUT-YOUR-ACCESS-KEY-SECRET-HERE" > AWS_SECRET_ACCESS_KEY
echo "YOUR-BUCKET-REGION" > AWS_REGION
echo "s3://YOUR-BUCKET-NAME/dbname" > WALE_S3_PREFIX

Verify WAL-E is setup by listing available backups

envdir /etc/wal-e/env wal-e backup-list --detail

should see output like this

wal_e.main   INFO     MSG: starting WAL-E
        DETAIL: The subcommand is "backup-list".
        STRUCTURED: time=2020-10-21T17:12:48.605932-00 pid=12751
name	last_modified	expanded_size_bytes	wal_segment_backup_start	wal_segment_offset_backup_start	wal_segment_backup_stop	wal_segment_offset_backup_stop
base_000000010000000000000017_00000040	2020-10-20T05:28:04.000Z	33969406	000000010000000000000017	00000040	000000010000000000000017	00029456

For a RHEL based system the default Postgres Data Directory lives at /var/lib/pgsql/data and as you can see it is empty since Postgres has not yet been initialized or restored from an existing backup.

cd /var/lib/pgsql/data
ls -l
total 0

I temporarily initialize the Postgres Data Directory so I can gather the config files to be used after restoring the base physical backup.

pg_ctl init # run as postgres user
cd /var/lib/pgsql
cp /var/lib/pgsql/data/postgresql.conf .
cp /var/lib/pgsql/data/pg_hba.conf .
cp /var/lib/pgsql/data/pg_ident.conf .

Now I can nuke data directory since it will be replaced by the restored base backup.

rm -r /var/lib/pgsql/data/*

With the default data directory back to pristine, uninitialized condition, I fetch the base backup from S3 using WAL-E with the backup-fetch argument. However, for most production systems this command may take a while to execute since it's potentially pulling large amounts of data from S3 and extracting it into the specified directory I wrap the command in a shell script so I can run it with the nohup utility as a background job protecting against a SSH shell connection failure.

#!/bin/bash

echo "Starting wal-e backup-fetch"

envdir /etc/wal-e/env wal-e backup-fetch --blind-restore /var/lib/pgsql/data/ LATEST

echo "wal-e backup-fetch complete"

I execute the shell script using nohup as a background jobs like so.

nohup ./wal-e-fetch.sh &

I can then watch for the job to finish by using cat or tail on the nohup.out file watching for the "wal-e backup-fetch complete" message to appear telling me the backup has been successfully fetched and extracted.

As you can see the backup is now fully extracted into the specified data directory.

ls -l /var/lib/pgsql/data

Showing

-rw------- 1 postgres postgres  253 Oct 20 05:28 backup_label
drwx------ 6 postgres postgres   54 Oct 21 17:14 base
-rw------- 1 postgres postgres   30 Oct 20 05:26 current_logfiles
drwx------ 2 postgres postgres 4096 Oct 21 17:14 global
drwx------ 2 postgres postgres   32 Oct 21 17:14 log
drwx------ 2 postgres postgres    6 Oct 20 02:23 pg_commit_ts
drwx------ 2 postgres postgres    6 Oct 20 02:23 pg_dynshmem
drwx------ 4 postgres postgres   68 Oct 21 17:14 pg_logical
drwx------ 4 postgres postgres   36 Oct 21 17:14 pg_multixact
drwx------ 2 postgres postgres   18 Oct 21 17:14 pg_notify
drwx------ 2 postgres postgres    6 Oct 20 02:23 pg_replslot
drwx------ 2 postgres postgres    6 Oct 20 02:23 pg_serial
drwx------ 2 postgres postgres    6 Oct 20 02:23 pg_snapshots
drwx------ 2 postgres postgres    6 Oct 20 05:26 pg_stat
drwx------ 2 postgres postgres    6 Oct 20 05:27 pg_stat_tmp
drwx------ 2 postgres postgres   18 Oct 21 17:14 pg_subtrans
drwx------ 2 postgres postgres    6 Oct 20 02:23 pg_tblspc
drwx------ 2 postgres postgres    6 Oct 20 02:23 pg_twophase
-rw------- 1 postgres postgres    3 Oct 20 02:23 PG_VERSION
drwx------ 2 postgres postgres    6 Oct 20 05:28 pg_wal
drwx------ 2 postgres postgres   18 Oct 21 17:14 pg_xact
-rw------- 1 postgres postgres   88 Oct 20 02:23 postgresql.auto.conf

I now copy the previously collected default config files into the newly restored data directory.

cp /var/lib/pgsql/postgresql.conf data/
cp /var/lib/pgsql/pg_hba.conf data/
cp /var/lib/pgsql/pg_ident.conf data/
cd /var/lib/pgsql/data

In postgresql.conf I update the hot_standby config parameter to on like so.

hot_standby = on

Next I create a recovery.conf file in the data directory which specifies how Postgres should restore the database once started.

restore_command = 'envdir /etc/wal-e/env wal-e wal-fetch %f %p'
standby_mode = on

Start and enable Postgres via systemd service file (as root)

systemctl start postgresql
systemctl enable postgresql

I should now be able to compare queries on the master database to those on the hot standby and see the same (or nearly the same up to 60 seconds of lag).

Promoting Replica to Master

Additionally, if the master server were to crash I could promote the hot standby with the following command to become the master server enabling it to accept writes.

pg_ctl promote -D /var/lib/pgsql/data

Giving the following output.

waiting for server to promote....... done
server promoted

Conclusion

In this article I detailed how to establish a log shipped replica as a hot standby in a Master / Replica Postgres version 11 enviroment. I should say that I've tested and verfied this same procedure will work for versions 8.6, 9, 10, 11 and I presume it will work for version 12 also.

As always, thanks for reading and please do not hesitate to critique or comment below.

AWS AWS S3 PostgreSQL Databases Linux DevOps

Share with friends and colleagues

[[ likes ]] likes