Please keep in mind that Performance tuning is an art and at present we
do not have any numbers as to what is healthy performance on the Network
Servers. We will be getting better information after we and our
customers spend considerable time exercising these boxes.
Also, consider that any good UNIX tuning guide is usually between 200 and 400 pages long. IBM offers a 3 to 4 day class in this area. Also the Info Explorer List of Books contains "AIX Versions 4.1 Problem Solving Guide and Reference." This FAQ is only a brief introduction.
Four Areas To Check
===================
If an AIX system appears to be slow there are four general areas that need to be examined over time before making any suggestions to improve Performance: CPU Usage, Memory Usage, Disk and Local Peripheral I/O Performance, and Network Performance.
CPU Usage
"ps aux" will show you the memory usage of processes presently running.
# ps aux
USER | PID | %CPU | %MEM | SZ | RSS | TTY | STAT | STIME | TIME | COMMAND |
root | 516 | 78.1 | 0.0 | 0 | 4 | - | A | Apr 08 | 4617:22 | kproc |
root | 3254 | 19.7 | 7.0 | 1756 | 1756 | rcm0 | A | Apr 08 | 1165:17 | /usr/lpp/
X11/bin |
The columns of interest are SZ and RSS. Processes in UNIX consist of text (code), data, and stack segments. SZ is a measure of the virtual memory allocated for the data and stack segments of a running process and the text segment if it is not shared code. The RSS is a measure of the actual memory allocated for a process. Processes that are using a large percentage of the available memory might be candidates for either program optimization or jobs that could run when the system use is low, by using the cron or batch facilities. Also, nice or renice could be used to lower these processes priorities
"iostat" can tell you in general if CPU usage is high. If it is, sar -q will show you the run queue size under the heading runq-sz
# sar -q 1 2
AIX einstein 1 4 002AC5884C00 04/12/96 |
20:25:40 | runq-sz | %runocc | swpq-sz | %swpocc |
20:25:41 | 3.0 | 100 | 5.0 | 100 |
20:25:42 | 1.0 | 100 | 6.0 | 100 |
Average | 2.0 | 99 | 5.5 | 99 |
w can tell you load average. This is a count of the size of the run queue and can indicate if the cpu can handle the number of processes that are attempting to run at any one time. If this count is too high, some jobs may be good candidates to be run at times when the system use is low, by using the cron or batch facilities. Also nice or renice could be used to lower their priority.
# w
08:26PM up 4 days, 2:39, 4 users, load average: 1.16, 0.63, 0.42 |
User | tty | login@ | idle | JCPU | PCPU what |
root | lft0 | Thu07PM | 4days | 0 | 0 /usr/sbin/getty |
root | pts/0 | Thu07PM | 1day | 0 | 0 /bin/ksh |
root | pts/1 | Thu07PM | 2:15 | 17 | 0 /bin/ksh |
root | pts/2 | 08:19PM | 0 | 44 | 0 w |
Another method to control the usage of system resources by processes is by using /etc/security/limits. Please see the man page on limits for more details.
Memory Usage
vmstat will show memory usage. Remember to throw out the first entry since it is the sum total activity since the system booted.
# vmstat 2
kthr | memory | page | faults | cpu |
r b | avm fre | re pi po fr sr cy | in sy cs | us sy id wa |
0 0 | 6623 1613 | 0 0 0 0 0 0 | 2 659 252 | 20 278 0 |
0 0 | 6623 1613 | 0 0 0 0 0 0 | 2 814 314 | 25 471 0 |
vmstat can indicate if a high paging rate is slowing down the system. The pi and po fields under the page heading are of particular importance. pi may be meaningless since some processes page in at start time. po on the other hand, if the count is large, could be an indication of paging problems. This may indicate that more memory is needed if all the present processes need to be run at the same time.
Possible solutions to this are to run some jobs at later times using the cron and/or batch facilities. If code is written in house it might help to check to make sure code optimization techniques, such as shared libraries are used.
Make sure that there is sufficient paging space on all the disks on the system. As a general rule, paging should be spread throughout the first 4 or 5 disks on a system to minimize paging problems.
Disk and Local Peripheral I/O Performance
iostat can be used to determine usage. Remember to throw out the first entry since it is the sum total of activity since the system was booted.
# iostat
tty: | tin | tout | avg-cpu: | % user | % sys | % idle | %iowait |
0.0 | 0.9 | | 20.2 | 1.8 | 77.9 | 0.1 |
|
Disks: | % tm_act | Kbps | tps | Kb_read | Kb_wrtn | |
hdisk0 | 0.2 | 0.8 | 0.2 | 152435 | 141661 |
cd0 | 0.0 | 0.0 | 0.0 | 1050 | 0 |
iostat can indicate whether disk usage is well balanced or not. It may be possible to increase performance by moving certain well used logical volumes from a heavily used disk to a less used disk.
If the disk usage is well balanced iostat can also indicate if there are possible scsi or disk hardware problems.
Network Performance
netstat can indicate that there are excessive network errors. The Ierrs and Oerrs columns from "netstat -i" are of particular interest here. Ierrs and Oerrs should not greater than 1% of the Ipkts or Opkts, respectively. The Coll (collision) column should not be more than 5 or 10 percent of the network bandwidth generally with Ethernet. (There is some question as to if AIX is keeping track of this which we need to review). This may╩be an indication of faulty network components or network congestion.
# netstat -i
Name | Mtu | Network | Address | Ipkts | Ierrs | Opkts | Oerrs | Coll |
lo0 | 16896 | <Link> | | 12470 | 0 | 12491 | 0 | 0 |
lo0 | 16896 | 127 | loopback | 12470 | 0 | 12491 | 0 | 0 |
en0 | 1500 | <Link> | 0.5.2.54.a3.11 | 146177 | 0 | 1520 | 0 | 0 |
en0 | 1500 | 17.104.96 | einstein | 146177 | 0 | 1520 | 0 | 0 |
nfsstat can ind
icate that there are excessive network errors. This can be caused by overloaded NFS servers, or possible network congestion or hardware problems.
# nfsstat
Server rpc: |
calls | badcalls | nullrecv | badlen | xdrcall | | |
0 | 0 | 0 | 0 | 0 | | |
Server nfs: |
calls | badcalls | | | | | |
0 | 0 | | | | | |
null | getattr | setattr | root | lookup | readlink | read |
0 0% | 0 0% | 0 0% | 0 0% | 0 0% | 0 0% | 0 0% |
wrcache | write | create | remove | rename | link | symlink |
0 0% | 0 0% | 0 0% | 0 0% | 0 0% | 0 0% | 0 0% |
mkdir | rmdir | readdir | fsstat | | | |
0 0% | 0 0% | 0 0% | 0 0% | | | |
Client rpc: |
calls | badcalls | retrans | badxid | timeout | wait | newcred |
0 | 0 | 0 | 0 | 0 | 0 | 0 |
Client nfs: |
calls | badcalls | nclget | nclsleep | | | |
0 | 0 | 0 | 0 | | | |
null | getattr | setattr | root | lookup | readlink | read |
0 0% | 0 0% | 0 0% | 0 0% | 0 0% | 0 0% | 0 0% |
wrcache | write | create | remove | rename | link | symlink |
0 0% | 0 0% | 0 0% | 0 0% | 0 0% | 0 0% | 0 0% |
mkdir | rmdir | readdir | fsstat | | | |
0 0% | 0 0% | 0 0% | 0 0% | | | |
Which Processes to Kill
What processes can I safely kill on my system to perhaps make its performance a little faster?
If your system is not a router and you do not have more than one router on your network segment, there is no need to run routed.
Unless you need to let other people use the finger command to your system, there is no need to run rwhod.
If you are not an NFS server, there is no need to be running nfsd, rpc.mountd, rpc.statd, or rpc.lockd.
If you are not going to be mounting NFS file systems ever, on client machines there is no need to be running biod's, rpc.statd, or rpc.lockd.
If you are running your Apple Network Server strictly as a server, you may want to kill the CDE interface.
We have found that graphical screen savers from the CDE interface use a considerable amount of processing bandwidth. Setting the Screen Saver in the Style Manger's Screen section to "Blank Screen" is probably best here.