Communication on a network (using "ping", "telnet", "rlogin", and so on) requires the configuration of hardware and software to work correctly. If you have either a hardware failure, or a software communication problem, you may see anything from a slowdown in communication to 100% packet loss (no communication).
The "ping" command is often used to check network configuration. "ping" is a two part application, one part on the sending machine and the other on the receiving machine. When you send a ping to another host, the software dispatches packets. The packets activate a process at the receiving machine that responds by sending the packets back. If Machine A sends the packet, but the packet never reaches Machine B, you will see 100% packet loss from Machine A. If Machine B receives the packet, sends it back by a different route, and the packet gets lost, you will also see 100% packet loss on Machine A.
The following strategy for problem determination is divided as follows:
* Checking the Hardware
* Checking the Environment
* Checking the Configuration
Checking the Hardware
=====================
Hardware Tip 1
--------------
Ensure that all plugs are secured and screwed down on the adapters.
Hardware Tip 2
--------------
View the status of existing adapters and interfaces (the adapter is the physical hardware; the interface is the software that enables communication on that hardware):
Execute the following to check the adapters and interfaces: lsdev -C | pg
The following adapters may be listed:
ent# Standard <Ethernet> Adapter or High Performance <Ethernet> Adapter
tok# Token Ring High Performance Adapter
Verify that the adapter you are using is "Available". The term "Available" indicates that the Network Server recognized that this adapter was ready for use. If the adapter is "Defined", then you need to verify that your hardware is installed correctly. The term "Defined" indicates that the Network Server at one time knew it had available hardware in that slot but currently cannot identify that it has the hardware.
The following interfaces may be listed:
en# Standard <Ethernet> Network Interface
et# IEEE 802.3 <Ethernet> Network Interface
Verify that the interface you are using is "Available". If it is listed as "Defined", then you do not have your interface configured. The Standard <Ethernet> Adapter and the High Performance Adapter can utilize either the en# or et# interface. (These designate which protocols are available on the <Ethernet> style adapters.)
Hardware Tip 3
--------------
Check the error report by executing the following: errpt -a | more
Look at the Date/Time line. The error log is in LIFO order (last in, first out) so the last error logged will be the first one displayed. If the date is not today's date, then you may not have a hardware error. If it is the current date, check the ERROR LABEL field for errors such as:
<Ethernet>
--------
ENT_ERR2
ENT_ERR4
ENT_ERR6
The above errors will generally mention that the error is hardware related. Reverify that all plugs are secured and screwed down on the adapters. You may want to reseat the adapters in their slots (proceed with caution) and then ping again and see if any more errors are reported.
Checking the Environment
========================
Environment Tip 1
-----------------
Execute the following to check the network statistics: netstat -m
If the last three lines have something other than "0" then your system may be exhibiting an "mbufs full" problem. Refer to IBM AIX Version 3.2/4.1 Performance Monitoring and Tuning Guide (SC23-2365).
26 mbufs in use:
16 mbuf cluster pages in use
70 Kbytes allocated to mbufs
0 requests for mbufs denied
0 calls to protocol drain routines
Kernel malloc statistics: . . .
If the "requests for mbufs denied" line has something other than "0", your system may be exhibiting an "mbufs full" problem. Refer to the IBM AIX Performance Monitoring and Tuning Guide (SC23-2365-03).
Environmental Tip 2
-------------------
Determine which machine is having the communication failure:
From Machine A, ping Machine B. On Machine B, execute the following: arp -a
The output will look similar to:
ausvm3.austin.ibm.com (129.35.26.21) at 10:0:5a:ac:22:71 [token ring]
rt=a40:22a1:c211:bb11:d3a0
cia.austin.ibm.com (129.35.22.192) at 10:0:5a:a8:e1:9d [token ring]
risc.austin.ibm.com (129.35.28.168) at 10:0:5a:9:2c:b1 [token ring]
rt=830: 22 a1:c211:2270
ausname1.austin.ibm.com (129.35.17.2) at 10:0:5a:a8:2b:92 [token ring]
rt=a40: 22a1:c211:bb11:cff0
Check the listing for Machine A's hostname and IP address. If Machine A is NOT in the list, then packets never get from Machine A to Machine B. Either Machine A is the problem or something between Machine A and Machine B is the problem. If Machine B DOES have Machine A in the list, then either Machine B is the problem, or the return path to Machine A is a problem. Go back to the beginning of this fax and begin to work through the steps with Machine B.
Environmental Tip 3
-------------------
If NIS is running, it may interfere with pinging by hostname. You may want to disable this option until ping and telnet are working to simplify problem determination. Then, once you can ping, enable NIS and see if you have ping problems. If you do, your NIS configuration needs to be reviewed for correctness.
To disable NIS, start smit with "smit communications" and choose the following:
NFS:
Network Information Service (NIS)
Start / Stop Configured NIS Daemons
Then choose the appropriate stop items from those displayed:
Stop the Server Daemon, ypserv
Stop the Client Daemon, ypbind
Stop the yppasswdd Daemon
Stop the ypupdated Daemon
Environmental Tip 4
-------------------
Verify that your netmask is correct. (A full discussion of a netmask is outside the scope of this document.)
If your Address is | If your Netmask is | You can access Machines with IPaddresses listed below without additional routing information |
110.120.130.140 | 255.255.255.0 | 110.120.130.* |
110.120.130.140 | 255.255.0.0 | 110.120.*.* |
110.120.130.140 | 255.0.0.0 | 110.*.*.* |
Checking the Configuration
==========================
Configuration Tip 1
-------------------
To verify that the hostname is still the correct hostname for this machine, execute the following: hostname
The string returned should be the hostname of this machine. If the name returned was not what was expected, run "smit tcpip" and choose the following to set the hostname.
Further Configuration
Hostname
Configuration Tip 2
-------------------
Verify that the IP address is what is expected by executing the following: host your_hostname
The output should be similar to: zcomm1.austin.ibm.com is 129.35.31.99
If the output is not what was expected, you need to correctly configure the IP address for this adapter or check the name resolution (see steps below).
Configuration Tip 3
-------------------
Check to see if you are running Domain Name Service (DNS):
If /etc/resolv.conf exists, then you are using DNS. Disable DNS by renaming this file to some other filename:
mv /etc/resolv.conf /etc/resolve.conf.hold
If you can now ping, then something is wrong with DNS configuration. (In the /etc/hosts file, you may have to add the IP address and host of the machine you are trying to ping.)
Configuration Tip 4
-------------------
Examine the /etc/hosts file. Verify that your hostname is in the file only once and that there is no corruption in the file. If your hostname belongs to two IP addresses, then the first hostname it finds in the file will be the IP address that is used. Also, check for a duplicate IP address.
Configuration Tip 5
-------------------
Execute the following to ensure that software is loaded correctly: lppchk -v
This will execute for a while and then come to a prompt. If any error messages are displayed, it indicates a possible install or update problem; correct the error and then try pinging.
Configuration Tip 6
-------------------
Ping by hostname, then by IP address. Both should respond in the same manner. If they don't, check the /etc/hosts file again for duplicates.
Configuration Tip 7
-------------------
Ping other machines, routers, etc. If only one machine is failing on the ping, your machine could have one of the following:
a gateway problem
a route problem
Configuration Tip 8
-------------------
The following steps illustrate the procedure you will need to use to verify the adapter configuration: netstat -i
The above command should produce output similar to the following:
Name | Mtu | Network | Address | Ipkts | Ierrs | Opkts | Oerrs | Col l |
lo0 | 1536 | <Link> | - | 149827 | 0 | 149827 | 0 | 0 |
lo0 | 1536 | 127 | localhost.xxxxx | 149827 | 0 | 149827 | 0 | 0 |
tr0 | 1492 | <Link> | - | 5603085 | 48642 | 89675 | 0 | 0 |
tr0 | 1492 | 129.35.16 | xxxxxx.xxxxxx.x | 5603085 | 48642 | 89675 | 0 | 0 |
Some fields and values you may see in the above output are:
tr0 Represents token ring interface
en0 Represents standard <Ethernet> interface
et0 Represents IEEE 802.3 <Ethernet> interface
lo0 Represents the loopback mechanism
Ierrs/Oerrs Shows errors for incoming and outgoing packets
If you see only lo0, or if there is an "*" next to your interface, you need to configure the interface again.
Oerrs are bad and may point to a hardware error.
Ierrs generally indicate that your interface is receiving packets for which it does not recognize the format and is discarding them.
If you have checked everything and the ping is still not working and you are running <Ethernet>, try reversing protocols (en0 to et0 and vice versa).
As a final try, you can remove the interface and adapter and try starting again. You can do this from the command line:
ifconfig <interface> detach
rmdev -d -l <interface>
rmdev -d -l <adapter>
Then you will need to reconfigure the adapters and interfaces. You can do that in any of these ways:
* Reboot, or
* In smit, choose:
Devices
Configure Devices Added After IPL
or
* From the command line, execute: cfgmgr
The above procedures will configure interfaces in a defined state and adapters in an available state. Use normal procedures to customize the configuration for your system.