AIX: TCP/IP Communication Failure, Cannot "ping"

Tips for network communication failure, cannot "ping".
Communication on a network (using "ping", "telnet", "rlogin", and so on) requires the configuration of hardware and software to work correctly. If you have either a hardware failure, or a software communication problem, you may see anything from a slowdown in communication to 100% packet loss (no communication).

The "ping" command is often used to check network configuration. "ping" is a two part application, one part on the sending machine and the other on the receiving machine. When you send a ping to another host, the software dispatches packets. The packets activate a process at the receiving machine that responds by sending the packets back. If Machine A sends the packet, but the packet never reaches Machine B, you will see 100% packet loss from Machine A. If Machine B receives the packet, sends it back by a different route, and the packet gets lost, you will also see 100% packet loss on Machine A.

The following strategy for problem determination is divided as follows:

* Checking the Hardware
* Checking the Environment
* Checking the Configuration


Checking the Hardware
=====================

Hardware Tip 1
--------------
Ensure that all plugs are secured and screwed down on the adapters.

Hardware Tip 2
--------------
View the status of existing adapters and interfaces (the adapter is the physical hardware; the interface is the software that enables communication on that hardware):

Execute the following to check the adapters and interfaces: lsdev -C | pg

The following adapters may be listed:

ent# Standard <Ethernet> Adapter or High Performance <Ethernet> Adapter

tok# Token Ring High Performance Adapter

Verify that the adapter you are using is "Available". The term "Available" indicates that the Network Server recognized that this adapter was ready for use. If the adapter is "Defined", then you need to verify that your hardware is installed correctly. The term "Defined" indicates that the Network Server at one time knew it had available hardware in that slot but currently cannot identify that it has the hardware.

The following interfaces may be listed:

en# Standard <Ethernet> Network Interface
et# IEEE 802.3 <Ethernet> Network Interface

Verify that the interface you are using is "Available". If it is listed as "Defined", then you do not have your interface configured. The Standard <Ethernet> Adapter and the High Performance Adapter can utilize either the en# or et# interface. (These designate which protocols are available on the <Ethernet> style adapters.)

Hardware Tip 3
--------------
Check the error report by executing the following: errpt -a | more

Look at the Date/Time line. The error log is in LIFO order (last in, first out) so the last error logged will be the first one displayed. If the date is not today's date, then you may not have a hardware error. If it is the current date, check the ERROR LABEL field for errors such as:

<Ethernet>
--------
ENT_ERR2
ENT_ERR4
ENT_ERR6

The above errors will generally mention that the error is hardware related. Reverify that all plugs are secured and screwed down on the adapters. You may want to reseat the adapters in their slots (proceed with caution) and then ping again and see if any more errors are reported.


Checking the Environment
========================

Environment Tip 1
-----------------
Execute the following to check the network statistics: netstat -m

If the last three lines have something other than "0" then your system may be exhibiting an "mbufs full" problem. Refer to IBM AIX Version 3.2/4.1 Performance Monitoring and Tuning Guide (SC23-2365).

26 mbufs in use:
16 mbuf cluster pages in use
70 Kbytes allocated to mbufs
0 requests for mbufs denied
0 calls to protocol drain routines

Kernel malloc statistics: . . .

If the "requests for mbufs denied" line has something other than "0", your system may be exhibiting an "mbufs full" problem. Refer to the IBM AIX Performance Monitoring and Tuning Guide (SC23-2365-03).

Environmental Tip 2
-------------------
Determine which machine is having the communication failure:

From Machine A, ping Machine B. On Machine B, execute the following: arp -a

The output will look similar to:

ausvm3.austin.ibm.com (129.35.26.21) at 10:0:5a:ac:22:71 [token ring]
rt=a40:22a1:c211:bb11:d3a0

cia.austin.ibm.com (129.35.22.192) at 10:0:5a:a8:e1:9d [token ring]

risc.austin.ibm.com (129.35.28.168) at 10:0:5a:9:2c:b1 [token ring]
rt=830: 22 a1:c211:2270

ausname1.austin.ibm.com (129.35.17.2) at 10:0:5a:a8:2b:92 [token ring]
rt=a40: 22a1:c211:bb11:cff0


Check the listing for Machine A's hostname and IP address. If Machine A is NOT in the list, then packets never get from Machine A to Machine B. Either Machine A is the problem or something between Machine A and Machine B is the problem. If Machine B DOES have Machine A in the list, then either Machine B is the problem, or the return path to Machine A is a problem. Go back to the beginning of this fax and begin to work through the steps with Machine B.

Environmental Tip 3
-------------------
If NIS is running, it may interfere with pinging by hostname. You may want to disable this option until ping and telnet are working to simplify problem determination. Then, once you can ping, enable NIS and see if you have ping problems. If you do, your NIS configuration needs to be reviewed for correctness.

To disable NIS, start smit with "smit communications" and choose the following:

NFS:
Network Information Service (NIS)
Start / Stop Configured NIS Daemons

Then choose the appropriate stop items from those displayed:

Stop the Server Daemon, ypserv
Stop the Client Daemon, ypbind
Stop the yppasswdd Daemon
Stop the ypupdated Daemon

Environmental Tip 4
-------------------
Verify that your netmask is correct. (A full discussion of a netmask is outside the scope of this document.)

If your Address is
If your Netmask is
You can access Machines with IPaddresses listed below without additional routing information
110.120.130.140
255.255.255.0
110.120.130.*
110.120.130.140
255.255.0.0
110.120.*.*
110.120.130.140
255.0.0.0
110.*.*.*



Checking the Configuration
==========================

Configuration Tip 1
-------------------
To verify that the hostname is still the correct hostname for this machine, execute the following: hostname

The string returned should be the hostname of this machine. If the name returned was not what was expected, run "smit tcpip" and choose the following to set the hostname.

Further Configuration
Hostname

Configuration Tip 2
-------------------
Verify that the IP address is what is expected by executing the following: host your_hostname

The output should be similar to: zcomm1.austin.ibm.com is 129.35.31.99

If the output is not what was expected, you need to correctly configure the IP address for this adapter or check the name resolution (see steps below).

Configuration Tip 3
-------------------
Check to see if you are running Domain Name Service (DNS):

If /etc/resolv.conf exists, then you are using DNS. Disable DNS by renaming this file to some other filename:

mv /etc/resolv.conf /etc/resolve.conf.hold

If you can now ping, then something is wrong with DNS configuration. (In the /etc/hosts file, you may have to add the IP address and host of the machine you are trying to ping.)

Configuration Tip 4
-------------------
Examine the /etc/hosts file. Verify that your hostname is in the file only once and that there is no corruption in the file. If your hostname belongs to two IP addresses, then the first hostname it finds in the file will be the IP address that is used. Also, check for a duplicate IP address.

Configuration Tip 5
-------------------
Execute the following to ensure that software is loaded correctly: lppchk -v

This will execute for a while and then come to a prompt. If any error messages are displayed, it indicates a possible install or update problem; correct the error and then try pinging.

Configuration Tip 6
-------------------
Ping by hostname, then by IP address. Both should respond in the same manner. If they don't, check the /etc/hosts file again for duplicates.

Configuration Tip 7
-------------------
Ping other machines, routers, etc. If only one machine is failing on the ping, your machine could have one of the following:

a gateway problem
a route problem

Configuration Tip 8
-------------------
The following steps illustrate the procedure you will need to use to verify the adapter configuration: netstat -i

The above command should produce output similar to the following:
Name
Mtu
Network
Address
Ipkts
Ierrs
Opkts
Oerrs
Col l
lo0
1536
<Link>
-
149827
0
149827
0
0
lo0
1536
127
localhost.xxxxx
149827
0
149827
0
0
tr0
1492
<Link>
-
5603085
48642
89675
0
0
tr0
1492
129.35.16
xxxxxx.xxxxxx.x
5603085
48642
89675
0
0

Some fields and values you may see in the above output are:

tr0 Represents token ring interface
en0 Represents standard <Ethernet> interface
et0 Represents IEEE 802.3 <Ethernet> interface
lo0 Represents the loopback mechanism
Ierrs/Oerrs Shows errors for incoming and outgoing packets

If you see only lo0, or if there is an "*" next to your interface, you need to configure the interface again.

Oerrs are bad and may point to a hardware error.

Ierrs generally indicate that your interface is receiving packets for which it does not recognize the format and is discarding them.

If you have checked everything and the ping is still not working and you are running <Ethernet>, try reversing protocols (en0 to et0 and vice versa).

As a final try, you can remove the interface and adapter and try starting again. You can do this from the command line:

ifconfig <interface> detach
rmdev -d -l <interface>
rmdev -d -l <adapter>

Then you will need to reconfigure the adapters and interfaces. You can do that in any of these ways:

* Reboot, or

* In smit, choose:

Devices
Configure Devices Added After IPL

or

* From the command line, execute: cfgmgr

The above procedures will configure interfaces in a defined state and adapters in an available state. Use normal procedures to customize the configuration for your system.
Published Date: Feb 18, 2012