The central issue is that printers, file servers, and mail servers drop in and out of the Chooser. This brings up a couple of questions. How many of each type of device do you have in each zone? Could you do an Inter*Poll of the affected zones for each type of device in question and send us the information? Here is some information and a possible answer:
This is how the Chooser really works in a nutshell:
The Chooser sends out a Name Binding Protocol (NBP) packet looking for all devices of type XXXXXX (for example, type LaserWriter). It sets up a buffer of 512 bytes for the responses. The responses look like:
device name length 1 byte
device name variable bytes e.g. MyLaser-Hands off
type name length 1 byte
type name variable bytes e.g. LaserWriter
zone field length 1 byte
zone name field variable bytes probably *
The Chooser gets such a packet back for each device, i.e., each LaserWriter. When the 512-byte buffer is full of these packets, it stops looking for device names to display. This means that some LaserWriters might not be displayed immediately. If you leave the Chooser window open, however, the Chooser continues to send out NBP lookups every 1.47 seconds. Different LaserWriters could respond more quickly each time. In this case, you may see the Chooser show and hide various devices.
This means that the number of devices the Chooser can show really depends on how long the type name (like "LaserWriter") is and how long the device names are.
The number 18 is an average number, based on device names being about 13- or 14 characters long and the device name being about 10 or 11 characters long.
In System 7.0, the buffer size for the Chooser is increased to 1024. This means, on an average, about 36 devices will be able to be displayed.
There is a way to affect the manner in which the lookup is done, which could help in some environments, especially in wide-area-network environments where slow data links may be used. If you modify the GNRL resource in the Chooser document (AppleShare, for example), it will affect every NBP lookup that is done from the Chooser for that type of device.
The Chooser uses these values to determine the NBP lookup interval and retry values for the current NBP transaction. The default of 0705 tells the Chooser to send five NBP lookup requests at an interval of 7/8ths of a second. This process is repeated in an infinite loop, until the user closes the Chooser.
Chooser Event Flow Example:
User opens Chooser and selects the AppleShare CDEV
GNRL resource -4096 loaded value = (5002)
NBP lookup mechanism started
NBP Loop:
Get NBP ID for this transaction
(Note: All NBP request and replies for this loop will use this ID)
Send first lookup (NBP ID = "New")
Collect and display responses from the NBP lookup ID "New"
Wait 10.6 seconds
Send second lookup (NBP ID still = "New")
Collect and display responses from the NBP lookup ID "New"
Wait 10.6 seconds
Discard all buffers and data associated with NBP ID "New"
(Note: If a response is received for NBP lookup ID =
"New" after this point the reply data would be discarded
and the device would not be added to the list in the Chooser)
Do some other misc. cleanup (approx. time 1 sec)
goto NBP Loop
End NBP Loop:
With the retry timer set to such a large value the multiple retry count is really not necessary. On the other hand, it doesn't hurt either, and it effectively increases the time we'll wait for NBP replies to over 20 seconds for the current transaction. The idea behind the retry count is to send several lookup requests out in quick succession (default < 1 sec.), in case there are devices which were unable to respond because they were busy or because the previous packet never reached them.
The reason that increasing the interval timer helps in the case of remote servers is directly tied to the way the NBP mechanism works. The Chooser maintains only 1 NBP lookup request at a time, tracking all replies to that request by way of the NBP ID mechanism. Replies that are received that do not match the current request ID are discarded.
The request ID is maintained only for the current NBP request, the interval and retry counters for this request can be tuned via the GNRL resource. In other words, if you set the retry counter to 10 and the interval timer to 50 the NBP ID would be maintained for 10 requests at an interval of 10.6 seconds. The GNRL resource is documented in "Inside Macintosh, Vol. 4", page 216.
AARP Issue
----------
The symptoms you describe with the Apple Internet router AARPing for the same node over and over is probably attributed to the router being overloaded and not able to accept the response AARPs. You mention that the router is getting a fair amount of overflow errors, this would lead me to believe that the routers are indeed overloaded.
What ports on the routers are getting the overflow errors and how many are reported over the course of a day, a week? Overflow errors are caused by the router being too busy to process all incoming packets on an interface. The network interface chip set can detect that a packet was available but that the processor was too busy to get the packet from the interface before the next packet arrived. There may be some other issues related to your environment, but this is a good starting point.
Novell VAPs Sending to Random Routers
-------------------------------------
AppleTalk Phase 2 does provide an enhancement that lets your node cache network to router pairs for use when determining where to route a packet destined for a remote network. When the AppleTalk protocol DDP receives a packet from a remote network, it strips off the data-link source address of the packet. This is the address of the last router on the route from the original network. This router should generally be the optimal router in
terms of hop count back to the original network. You can then use that router for any future transactions to that network.
Now that I've explained all of that, you're saying, "Okay, that sounds right, but the Novell server is not doing what it's supposed to." The real story is that this enhancement is an optional, implementation-specific addition that is not required by the AppleTalk protocol. In this case, it would be normal for the Novell server to act the way that you described, if they did not implement the "Best Router" enhancement.
Conclusions
-----------
We first need to get a handle on the environment that you have, including numbers of devices per network and per zone. We need to take a close look at the statistics from the Apple Internet routers, average load on the various network segments as measured by the Internet Router, as measured by a network monitor/analyzer.
A close look at traffic patterns could also be helpful in determining where to best segment the network if it becomes necessary. It may be that the traffic on the main ring is so heavy that the Macintosh routers can't keep up. We don't really have any statistical, benchmark, or historic information that would tell us when the use of an AIR on a token ring is not going to offer the best performance. At this point, we just need to collect all the information and then take it one step at a time.
The LANalyzer traces wouldn't do us any good, because we have a Network General Sniffer. If we really need to see some trace data, we'll do some tests using a Sniffer on your end, or the LANalyzer data could be saved to an ASCII file and shipped to us on a tape. I don't think we need to worry about the trace data yet.