Artbox Failover Solution

Disclaimer: Proximity Corporation provided the information in this article and it was deemed accurate as of 30 May 2007. Apple Inc. is not responsible for the article's content. This article is provided as is and may or may not be updated in the future.

Highly Available Shared Disk Solution

This solution provides a highly available artbox, at relatively low cost. Two Dell PowerEdge 28501 are connected to a shared SCSI disk array (Dell PowerVault 2201). This will provide a Highly Available solution. In terms of this solution, this specifically means that if one of the PowerEdge 2850's becomes disconnected from shared storage (be it a hardware or software issue), the other box will automatically take over the load.

For the shared storage, direct attached SCSI was chosen as it offered the best price and performance for a two box solution. Networked Attached Storage (NAS) is not an option for this solution as it would not be able to offer the performance that we require for the database. Fiber Channel solutions, whilst they provide more flexibility, are prohibitively expensive and would not fit within the some organization's price constraints.

Hardware

Software

Configuration Notes

Limitations

Installation

  1. Install RedHat ES1 on each of the servers. Update with the latest patches if possible. Partition the RAID 1 internal system disks as you see fit. Give each one a fixed IP address and DNS entry.
  2. Install artbox on each machine, as you normally would on a single box solution. Use the latest version available.
  3. Disable any artbox components from being able to start up on the default runlevel. Eg:

    /sbin/chkconfig artbox off
    

    This needs to be done for the following init scripts:

    artbox  artbox-clipcopy  artbox-copyd  artbox-lds  artbox-newsroomd auto_pg_autovacuum postgresql
    

    Stop any of these services from currently running.

  4. Attach a dongle to both boxes. Pxlicensed is required to be running on each box, and each will require a separate license.
  5. Configure one of the RAID cards to have a different ID. Connect up the PowerVault to the two PowerEdges. From the primary box, make 4 partitions:
    • 2 x 10 MB; no filesystem required (they are RAW, for quorum)
    • 1 x 30 GB; format as ext3 (for DB)
    • 1 partition with the rest of the hard drive capacity; format as ext3 (for data)
  6. Try to mount the db and data partition somewhere onto the first box. Write some data to them to check, then unmount. Do the same on the second box—you should see the previously written data. They cannot be mounted on each box at the same time otherwise data corruption occurs.
  7. Make a directory on each box for both the DB partition and the Data partition. An example:

       /mnt/data   (for data partition)
       /mnt/db     (for database partition)
    

  8. On the primary box, mount the database partition on the directory created for it. Move the database to it:

       mv /var/lib/pgsql /mnt/db/
    

    make a symlink from the new location to the old:

       ln -s /mnt/db/pgsql /var/lib/pgsql
    

    Unmount the DB partition.

    On the secondary box, remove the /var/lib/pgsql, and make the same symlink. Mounting the DB partition is not required.

  9. Subscribe to, download and install the RedHat Cluster Manager1 rpms. These should be the following (the version numbers might be different):
    • clumanager-1.2.22-2.i386.rpm
    • piranha-0.7.10-2.i386.rpm
    • rh-cs-en-3-2.noarch.rpm
    • ipvsadm-1.21-9.ipvs108.i386.rpm
    • redhat-config-cluster-1.0.3-1.noarch.rpm
  10. Now would be a good time to read the cluster manager documentation was installed. it can be found here:

     /usr/share/doc/rh-cs-en-3
    

  11. As per the documentation, create the raw devices file and start the service. This needs to be done on each box. As an example, here is the one on the prototype sydney setup. The partitions will vary:

     [root@artbox02 root]# cat /etc/sysconfig/rawdevices
     # raw device bindings
     # format:  <rawdev> <major> <minor>
     #          <rawdev> <blockdev>
     # example: /dev/raw/raw1 /dev/sda1
     #          /dev/raw/raw2 8 5
     /dev/raw/raw1   /dev/sdb1
     /dev/raw/raw2   /dev/sdb2
    

  12. Now configure the cluster. The following files can be used for a base (username/pass is default xenobox pair):

    artbox02.proximity.com.au:/etc/clustermanager.xml
    artbox02.proximity.com.au:/etc/samba/smb.conf.media
    

    Get each of those files. In clustermanager.xml, you will need to modify the following two lines:

    <cluquorumd loglevel="6" pinginterval="" tiebreaker_ip="192.168.128.2"/>
    

    replace that with the ipaddress of hades for testing.

    <service_ipaddress broadcast="192.168.128.255" id="0" ipaddress="192.168.128.217" netmask="255.255.255.0"/>

    replace with a spare ipaddress. This is the virtual address that will be the one used to contact Artbox and the samba service

    In the smb.conf.media file, modify as appropriate for the setup (the 'path' might be the only variable that require changing for testing)

  13. Edit /usr/artbox/conf/artbox.conf on each machine. Modify the following lines:

     LDS_ADDRESS={whatever is the HA virtual address}
     PX_LICENCE_SERVER={the local machine}
    

  14. Start the cluster manager on each box:

     /etc/init.d/clumanager start
    

Monitoring

To monitor the system, and the states of the various managed processes, the following X tool can be used:

redhat-config-cluster

This can either be run on the primary or the secondary machine. As it is an X program, it can be run remotely—for example, from a Mac OS X desktop

Testing

Simulation of node disconnection if the HA cluster can be done by shutting down or pulling network out of a node. In the case of the later, the HA software realizes that it cannot contact the tiebreaker IP, and does a hard forced reboot. In both cases, the secondary node should compensate within a few seconds. Either method can, and should, be tried.

The following should be tested when a node disconnection is simulated:

  1. Searching assets (click on 'search' every few seconds). There should be about 10-15 seconds when a 'connection refused' is reported by the Java UI, but then should work as before.
  2. Uploading to a device.
  3. Copying to a the Samba share (into an watch folder).
  4. Copying from the Samba share.
  5. Copying from one device to another in the middle of a scan when a number of jobs are 'queued.'

Artbox will not put jobs that are in the RUN or WAIT state into FAIL state automatically. The benefit is that jobs that were in the RUN or WAIT state prior to the failover cannot be retried.

Upgrading

This process is a little more complex that a usual upgrade, as both nodes need to be upgraded.

  1. Use the 'redhat-config-cluster' program to stop all of the 'Artbox' processes. Everything except Postgresql and auto_pg_autovacuum should be stopped
  2. On the primary node, upgrade the artbox rpms as normal. Note that postgres should be running on this box
  3. Using the 'redhat-config-cluster' program, start all of the previously stopped processes.
  4. On the secondary box, there should be no artbox or postgres processes running. Upgrade the rpms as normal. The artbox rpm will print a number of errors to the effect that it is unable to connect the database - this can safely ignored. The software should now be up to date
  5. Test the HA by intentionally failing the primary node

1. Important: Mention of third-party websites and products is for informational purposes only and constitutes neither an endorsement nor a recommendation. Apple assumes no responsibility with regard to the selection, performance or use of information or products found at third-party websites. Apple provides this only as a convenience to our users. Apple has not tested the information found on these sites and makes no representations regarding its accuracy or reliability. There are risks inherent in the use of any information or products found on the Internet, and Apple assumes no responsibility in this regard. Please understand that a third-party site is independent from Apple and that Apple has no control over the content on that website. Please contact the vendor for additional information.

Published Date: Feb 20, 2012