Disclaimer: Proximity Corporation provided the information in this article and it was deemed accurate as of 30 May 2007. Apple Inc. is not responsible for the article's content. This article is provided as is and may or may not be updated in the future.
Important: If upgrading from a version of artbox prior to 3.2.6, make sure you follow all the upgrade notes for each intervening release.
As of version 1.0.6 of the pxpginstall package, there is built-in support for backup and point-in-time-recovery of the postgres database. Documentation of how it works follows.
As of version 1.0.9 it also supports failover to a standby database using PITR.
What you can achieve with the backup system provided here is to be able to recover from losing the main database storage, or from a severe database corruption, or even from a catastrophic user error. The advantages it has over using pg_dump are that you should be able to get back online in minutes, rather than hours, and you should be able to get back to a very recent state of the database, such as to within seconds before losing the database, rather than just to the last time you did a dump.
The failover support provided here lets you set up a dedicated standby database server (the secondary), which will keep itself busy keeping up to date with the primary. The database on the secondary will not be available for use while it is in standby mode.
If anything goes wrong with the primary, or it's necessary to shut it down for whatever reason, the secondary can be brought up and running in a state identical to the state the primary was in when it went down. The process of bringing the secondary up should be very quick (seconds after someone initiates it).
PITR (Point In Time Recovery) is a postgres feature which allows recovery of a database to a particular point in time by using the write-ahead-log (WAL) files. The WAL keeps a record of every transaction, and its initial intended use was for robustness against power failures and the like, so that the database can ensure reliability by syncing just the one file (such as the WAL file) to disk, rather than having to endure the cost of syncing all its open files to disk all the time. If a database goes down unexpectedly, for whatever reason, next time it starts up it tries to recover by playing forward whatever WAL files it can find. This feature can also be used to recover in other circumstances, allowing us to do both backups and failover.
The configuration of both backup and failover is set up in the file /var/lib/pgsql/pitr.conf.
An example of its contents follows:
# $Id: pitr.conf,v 1.1 2005/09/09 05:52:27 tim Exp $ # A set of variables used to configure the use of PITR for # backup and/or failover # This file consists of a set of variable assignments, of the form # <variable>=<value> # note the lack of space around the equals sign # do we want pitr backup enabled? 1 or 0 PX_DB_BACKUP_ENABLE=1 # path to archive, where we put the backup wal files PX_DB_ARCHIVE_PATH=/var/lib/pgsql/backups # path of the directory above the database directory PX_DB_DATA_PATH=/var/lib/pgsql # name of the database directory PX_DB_DATA_DIR=data # do we want failover enabled? 1 or 0 PX_DB_FAILOVER_ENABLE=1 # where we (the primary) write wal files so they can be fed to the secondary PX_DB_FAILOVER_WRITE_PATH=/var/lib/pgsql/failover_write # where we read wal files from if we are the secondary PX_DB_FAILOVER_READ_PATH=/var/lib/pgsql/failover_read
The comments in the file are hopefully reasonably explanatory. Note that backup and failover can both be enabled or disabled independently. Set the corresponding variable to 0 to disable it.
For backup, the location to store backup files can be altered from the default if desired if you may to put the files on some remote networked storage.
For failover, both primary and secondary machines need a version of this file. The PX_DB_FAILOVER_WRITE_PATH is where the primary will put its failover files. The PX_DB_FAILOVER_READ_PATH is that very same directory, but as viewed from the point of view of the secondary. The secondary will read failover files from that directory.
PITR is enabled using the archive_command postgres configuration setting. If no form of PITR is being used, that should be set to archive_command='' in postgresql.conf.
To enable it, set it to archive_command = '/usr/sbin/pitr_archive_wal.rb %p %f'
Restarting the database server is required after changing this setting (which of course also requires shutting down artbox in the meantime).
As it is now possible to disable backup and/or failover in the pitr.conf file, you could keep PITR enabled in the archive_command setting as above, but disable both backup and failover in the pitr.conf; this wouldn't be too bad, it would just mean that pitr_archive_wal.rb script would be run every now and then and would do nothing.
To enable backup, enable pitr as above and set PX_DB_BACKUP_ENABLE=1 in the pitr.conf file. To enable failover, set PX_DB_FAILOVER_ENABLE=1 in the pitr.conf file. Set them to 0 to disable them.
Periodically, such as a nightly cron job, the postgres data directories should be backed up. The script /usr/sbin/pitr_backup.sh when run as user postgres will do this job. Sometimes tar will emit complaints to the effect that some file or other changed while it was reading it—this is entirely expected.
This script will always make the backup snapshot (even if backups are disabled). If failover is enabled, it will also copy the snapshot to the failover directory. In other words, if you want failover but don't want backups, you have to run this script anyway, and you will get backup snapshots made.
Old backup files can be purged using /usr/sbin/pitr_purge.rb which should also be run from cron. It accepts an optional argument, the number of days of data to keep. If you don't specify, it defaults to seven so that it throws out anything older than seven days. This script could be run either as root or as postgres.
Thought needs to be given about how long files should be kept for. If the ability to restore to some very ancient state is required, then separate provision needs to be made to copy selected snapshots somewhere where they will be archived.
If you had to restore from a backup, do the following. First, if the existing database is still actually running, shut it down with the command:
artbox stop service postgresql stop
Next, have a look in /var/lib/pgsql/backups/data:
ls -lrt /var/lib/pgsql/backups/data
What you should see in this directory is one or more tar balls, where the filenames have the form YYYYMMDDhhmm.tar.gz
The YYYYMMDDhhmm bit is the backup label. Find the most recent one, and use its backup label to run /usr/sbin/pitr_restore.sh <backup label> as the user postgres. You can expect this to take a few minutes.
Once it has finished, you should be able to restart the database server, service postgresql start and then restart artbox.
Note: It helps to have plenty of disk space available when restoring, since an extra copy of the database might be recreated.
Artbox keeps the backups in a peer directory to the database data directory, /var/lib/pgsql/backups. Data snapshots are in /var/lib/pgsql/backups/data and archived write-ahead-log files are in /var/lib/pgsql/backups/wal. At the time artbox does a snapshot, grab the required wal files that might not be archived yet and put them in /var/lib/pgsql/backups/pending_wal.
The pitr_purge script throws out any pending_wal files that are already in wal, as well as throwing out anything older than the specified or default number of days.
The restore script moves aside the existing database data directory, if any, and recreates it by untarring the tarball from /var/lib/pgsql/backups/data. The archived WAL files are moved back during database startup. The expected result is that we restore the state of the database to the time of the last archived WAL segment, such as the most recent file in /var/lib/pgsql/backups/wal.
In the restore script try to use any remaining WAL files from the old installation just before the database quit. If these files are known to be damaged, the old data directory should be removed or moved aside before restoring.
The file /var/lib/pgsql/pitr.conf specifies the location of the backup archive. This could be set to somewhere else by editing this file.
The backup system described here is short-term and has no built-in allowance for off-site backups. Which means that by itself it's not yet a really useful solution for securing some customer's data. What is needed in addition is to periodically make copies of the contents of /var/lib/pgsql/backups and store those copies remotely, or on tape. This could be done on any sort of basis—the more often it's done, the better the ability to recover to a recent state of the database.
Another option would be to make clever use of some remote filesystem and symlinks, and have the /var/lib/pgsql/backups directory (or another directory, if you've edited pitr.conf) actually be on a remote system.
If the server is currently unrestorable create a /var/lib/pgsql/backups directory and restore into it the most recent snapshot you have of that directory. You can restore on a similarly configured machine with the same version of postgres which is binary-compatible with the one that was running on the original machine.
Then do the restore process as described above.
A limitation of postgres that is due to be removed in version 8.1 is that GiST indexes (which includes tsearch2 indexes) are not supported by the write ahead log. This means that restoring the database as described above will not by itself restore the state of the text index. An extra step is required. After following the above instructions for restoring start up:
psql px pxdb
and then run:
reindex table pxentity; \\q
Note that in PostgreSQL 8.1 this step is not required.
You need to choose some shared directory, which can be accessed by both primary and secondary, on a partition which has a reasonable amount of free disk space. There are various ways to achieve a shared directory, such as NFS or shared arrays, but it's probably a good idea to make sure the directory is not on the local disk of the primary, since one of the things we want to be able to recover from is loss of the primary's disks.
You also need to ensure that both primary and secondary can access the directory, and that the user and group ids are consistent enough between the two that the same user/group on both machines will be identified. If the database files normally belong to the user postgres, belonging to group postgres, so as long as you ensure that this user exists on both machines with the same user number, and that the sharing is set up to let this user access the directory.
With this directory chosen, enable failover in the pitr.conf and set PX_DB_FAILOVER_WRITE_PATH to point to the directory. And then make sure that the usual pitr_backup.sh script has been run at least once (as user postgres) since you enabled failover (see the Performing a Backup Snapshot section above).
Once you have the primary database running, with failover enabled, you then need to set the secondary going. Edit the pitr.conf file on the secondary to set PX_DB_FAILOVER_READ_PATH to point to the shared directory discussed above.
The secondary needs to already have a /var/lib/pgsql directory, since that's where the pitr.conf file is read from. This directory should be owned by postgres:postgres. You don't need a data directory there to start with, although the postgres rpm quite likely made one for you at installation time. It doesn't hurt if there is one there, as the standby script will move it aside.
The version of PostgreSQL on the secondary must match that of the primary in at least the first two numbers, for example 8.1.2 on primary and 8.1.3 on secondary is okay, but 8.1.3 on primary and 8.0.3 on secondary is not. The hardware architectures on the two machines must also match, for example Intel to PowerPC won't work. This is the same criterion for whether you can upgrade a database without needing a dump/restore. The best option, of course, is to have identical versions of postgres on the two machines.
If any database is running on the secondary, stop it now using: service postgresql stop.
Look in the failover read directory, in the snapshot subdirectory. You should see a set of files with filenames of the form YYYYMMDDhhmm.tar.gz.
The YYYYMMDDhhmm bit is the backup label. Choose the most recent file, and use its backup label in the following command as user postgres:
/usr/sbin/pitr_failover_start_standby <backup label>
Once this completes, start up the database using: service postgresql start.
From this point on, the database will be in recovery mode indefinitely. Every time the primary finishes a WAL file, it will be fed to the secondary, so the secondary will bring itself up to date. When no WAL file is available, the secondary just waits for it. The database on the secondary will refuse all connection requests in this state.
If anything goes wrong with the secondary, or you need to stop it for some reason, you certainly can. Just start again at the beginning of this section. There is no particular point to restarting the standby periodically, but if you need to you can restart it from any more recent snapshot.
Should you need to bring the secondary database up into production then there are two paths to follow. In the case where you are carefully and deliberately bringing the primary down, under control, then you should run the following on the primary, as user postgres, after stopping its database:
/usr/sbin/pitr_finalise_archive.rb
This ensures that the most recent WAL file, which may not have been completed, gets copied to the failover directory. The result of this should be that the secondary will notice it has an incomplete WAL file, and complete the recovery. If it hasn't come up after a minute or two, you could try running the command mentioned below for the uncontrolled failover case, since it seems conceivable that if you were incredibly unlucky with your timing, the last WAL file copied across by the above might actually be complete.
If necessarily, proceed to bring the secondary up as best it can. This means the database will not be completely up to date, as the last WAL file might not be available—how far behind it is depends on how active it has been recently.
You bring it up by running the following on the secondary, as user postgres:
/usr/sbin/do_pitr_failover.sh
What this does is write a file named FAILOVER_NOW into the failover read directory. The restore script will recognize this as the signal to stop the recovery, and bring the database up fully. The database should come up quite quickly.
Once you've followed either of the above two approaches, verify that the database is up and running by attempting to connect to it using:
psql px pxdb
As soon as you can do this, then the database is ready for use, and you can point the artbox storeserver and the order management system at it, such as by reconfiguring them.
Once your database is running again, you need to start planning for the future. Your former secondary is now your primary. Assuming that either the old primary or its replacement is going to be brought back online at some point, you will want that box to become the secondary. So you'll need to start again at the "Setting up a failover primary" section and go from there. The failover directory will need to be cleaned up a bit before you re-use it properly—it's probably a good idea to use a completely different directory for failover from now on. At the least, you need to remove the FAILOVER_NOW file.
From this point on, either the old secondary and primary will reverse roles forevermore, or you will revert the old secondary to being a secondary again— which means you have to do this failover process one more time, to get the old primary or its replacement up to date with the new primary, and then failover to become the new primary.
A common scenario is that a postgres database may be installed in the usual /var/lib/pgsql, but the data subdirectory might be a symlink to some other drive or partition. You can handle that case by careful configuration. In the pitr.conf file (which is always in /var/lib/pgsql), you set the PX_DB_DATA_PATH to point to the actual directory the database lives in, such as the direct, non-sym-linked, path to the data. This still needs to be in the directory above the actual data directory. The PX_DB_DATA_DIR variable still needs to refer to the name of the directory in PX_DB_DATA_PATH that contains the database. It is important, by the way, for this directory name to be the same on both primary and secondary.
For example, if you have a big disk mounted at /var/big/disk, the actual database data directory is at /var/big/disk/data, and you have /var/lib/pgsql/data symlinked to /var/big/disk/data, then as long as you set PX_DB_DATA_PATH to /var/big/disk, everything should work.
Note: If the actual data directory is on some shared array, such as with a SAN, make sure that the two database servers do not use the same data directory.