Network Server: AIX System Dumps

This article describes AIX system dumps on the Network Server 500 and Network Server 700.
AIX generates a "system dump" when a severe error occurs, such as a kernel panic or hardware failure. Users with root privileges can initiate system dumps as well. A system dump creates a picture of your system's memory contents. System administrators and programmers can generate a dump and analyze its contents when debugging new applications, device drivers, and other kernel extensions. In addition, a system dump may be initiated from the keyboard when the keyswitch is in the service position, using the key sequence CTRL-Option-NumPad1. This may be useful when the system appears to be hung (of course, the system must be responsive enough to handle keyboard input).

If your system stops with an "888" number flashing in the LCD display, the system has generated a dump and saved it to a dump device. Your dump device holds the information that a system dump generates, whether generated by a system or a user. You can copy this information to tape and deliver the data to your service provider for analysis.

NOTE: The system cannot recover from a dump. You must restart the server after a dump has been taken.

Dump Device Configuration
-------------------------
By default, the system dump will be placed on the system paging space, /dev/hd6. You can check the dump settings by using the command "sysdumpdev". The following are the default settings:
# sysdumpdev -l
primary/dev/hd6
secondary/dev/sysdumpnull
copy directory/var/adm/ras
forced copy flagTRUE
always allow dumpFALSE


It is not recommend, but you can change the dump device and other designations using the "sysdumpdev" command. Since a system dump may occur at any time, any other dump device must be dedicated to this purpose. The system paging space is an ideal dump device -- it will not be needed after the dump takes place, yet the space is still available for the system to use at other times. System installation should correctly configure the size of the paging space to be at lease as large as system memory (since the system dump is a snapshot of system memory this is the worst case scenario). To check the size of the system paging space, execute the following command:
# lsps -s
Total Paging Space
Percent Used
128MB
27%


You can get an estimate of the system dump size using the -e option to sysdumpdev.

NOTE: You cannot permanently set the dump device to a logical volume not in the rootvg volume group. During boot time, when the dump device is configured, only the root volume group is accessible.

System Dump Recovery
--------------------
When a system dump is taking place, status codes are displayed in the LCD display -- see Status Codes below. When the dump is complete, a 0c0 status code displays if the dump was user-initiated, a flashing 888 displays if the dump was system initiated. After the dump is complete, the system halts. You must restart the system to recover the data from the dump device. During system startup, the dump data will be copied to a filesystem specified by the copy directory, by default /var/adm/ras; the filename will be "vmcore.#", where # increases with the number of active dumps.

If there is insufficient space in the designed filesystem, the system will prompt you for removable media (such as a tape drive) on which to copy the dump data. After system startup has completed, copy the dump from the external media to a filesystem which has sufficient free space. Enter the following to copy the dump from /dev/rmt0, the default tape device.

# tar -x

After rebooting in normal mode and coping the system dump from tape if necessary, use the snap command to copy the dump data and other system configuration information to a blank tape for delivery to your service provider.

# snap -gfkD -o /dev/rmt0

If you recovered the system dump from tape, you must now append the dump data to the tar archive created by the snap command. The snap command will have failed to find the dump data in the copy directory.

# tar -r dump_file

NOTE: The AIX documentation describes two uses of the reset button related to system dumps which do not apply to Apple hardware. On some IBM systems, if the keyswitch is in the service position, the user may initiate a system dump by pressing the reset button. After a system-initiated dump, on some IBM systems, the user may access additional status codes by repeatedly pressing the reset button.

NOTE: If the dump fails and upon reboot you see an error log entry with the label "DSI_PROC" or "ISI_PROC", and the Detailed Data area shows an "EXVAL" of "00000005", this is probably a paging space I/O error. If the paging space is the dump device or on the same hard drive as the dump device, your dump may have failed due to a problem with that hard drive. You should run diagnostics against that disk.

NOTE: AIX supports designating a tape device as the primary dump device, but that capability does not currently work on the Apple Network Server.

System Dump LCD Status Codes
----------------------------
The system LCD may display the following three-digit codes during a system dump.

0c0 - The dump completed successfully. Send the dump data to your service provider

0c2 - A system-initiated or user-requested dump is not finished. Wait one minute for the dump to complete and for the operator panel display value to change. If the operator panel display value changes, find the new value on this list. If the value does not change, then the dump did not complete due to an unexpected error. Report the problem to your service provider.

0c3 - The dump is inhibited.

0c4 - the dump did not complete successfully. A partial dump was written to the dump device, but there is not enough space on the dump device to contain the entire dump. To prevent this problem from occurring again, you must increase the size of your dump device.

0c5 - A system-initiated or user-requested dump did not complete. Wait one minute for the dump to complete and for the operator panel display value to change. If the operator panel display value changes, find the new value on the list. If the value does not change, then the dump did not complete due to an unexpected error. Report the problem to your service provider.

0c7 - A network dump is in progress, and the host is waiting for the server to respond. the value in the operator panel display should alternate between 0c7 and 0c2 or 0c9. If the value does not change, then the dump did not complete due to an unexpected error. Report the problem to your service provider.

0c8 - The dump device has been disabled. The current system configuration does not designate a device for the requested dump. Use the sysdumpdev command to configure the dump device.

0c9 - A dump started by the system did not complete. Wait one minute for the dump to complete and for the operator panel display value to change. If the operator panel display value changes, find the new value on this list. If the value does not change, then the dump did not complete due to an unexpected error.
Published Date: Feb 18, 2012