[nSLUG] linux home or workplace automation and Universal Powerline Bus

George N. White III gnwiii at gmail.com
Sun Sep 21 16:35:22 ADT 2008


On Sun, Sep 21, 2008 at 3:20 PM, Daniel Morrison <draker at gmail.com> wrote:

> 2008/9/21 George N. White III <gnwiii at gmail.com>:
>> On Sat, Sep 20, 2008 at 9:15 PM, Daniel Morrison <draker at gmail.com> wrote:
>
>> One of the problems with setting ports to full-100 was that equipment
>> moves around a lot and the windows drivers for the most common PC
>> interface (Intel) didn't work with fixed settings until last January's update.
>
> Vendor bugs... <sigh>
> All the more reason to use auto-neg!
>
>> This is what is supposed to be done, but sometimes the PC still ends up
>> set for half-duplex, and sometimes the switches get set for autonegotiation.
>> I suspect the fixed settings may be lost after updates or maybe just reboots
>> of the switches.
>
> If the settings are changing after they have been 'set', the only thing I
> can assume is that they have not been set properly.  On Cisco switches
> it's important to 'write' after making a change, so that it's not lost on
> reboot.  (Or maybe another sysadmin's configuration practices are
> overwriting your changes??)
>
>>> I think the best option is to set auto-negotiation on everywhere, unless
>>> there is a specific bug with some vendor's equipment, in which case set
>>> the best rate at both ends, and document it!
>>
>> I think that is true if you have lots of systems (WIn XP) that don't handle
>> fixed settings, but in principle fixed settings everywhere should
>> be more reliable.
>
> Using fixed settings everywhere would certainly be more reliable than
> auto-negotiation everywhere -- except that it is very difficult for staff
> to keep track of and configure every port correctly, _especially_ when
> equipment moves around.  In this situation, unless there is strict
> adherence to policy and documentation, it very quickly changes from 'more
> reliable' to 'shot in the dark'.

Apple's advice: use either fixed-half or autoneg should ensure that
a system configured for autoneg would still work with a fixed=half switch.
This is for the mac Pro, which does have two interfaces, so maybe
two half-1000 connections is good for most applications.

>> We tried that, but in practice we keep finding
>> duplex mis-matches, so Apple may be right that if we used fixed
>> it has to be half-duplex.
>
> ??? If the equipment can do full, there's no reason not to use it.  If
> Apple's default after failed auto-negotiation is half, then they're asking
> you to set to half on the switch so that new equipment will work "out of
> the box".  But unless their 'forced mode' settings are faulty, why not use
> full duplex?  Again, if everything is just set to auto-negotiate it may
> work out better.
>
>> Within the past month I have caught two different
>> linux distros and Mac OSX (all using Intel interfaces) connected at
>> half-duplex despite having been configured for full.
>
> Don't know what to tell you.  I don't trust any distributions 'automatic'
> method for configuring an interface... all this non-standard
> /etc/sysconfig/network/ garbage and weird 'ethtool' programs that do who
> knows what.  If there's any doubt, get mii-diag and configure it manually
> early in the boot (and maybe again late in the boot, if you can't disable
> the distro's own broken configuration).  mii-diag is at
>   ftp://ftp.scyld.com/pub/diag/
> Compile commands are at the end of mii-diag.c.  You'll also want libmii.c
> for full functionality.

<http://www.mjmwired.net/kernel/Documentation/networking/e100.txt>
advises using ethtool.

I wonder if the unwanted changes are coming from udev/hal and networkmanager
(which I don't intentionally use on the systems in question, but could have
crept in via some update as they seem to be the default these days)
dealing with the "cable disconnect".  If I was using NM on a system with
multiple interfaces I would expect it to try starting another interface (like
really unplugging a laptop cable).  I assume something similar happens
on Apple laptops, so may the Mac Pro gets the same treatment as their
laptops.

>> With SGI we build a custom kernel with fixed-100 set, and never find
>> them using half, so it looks like using ethtool is not enough --
>> something that happens on the switch must be causing the interface to
>> reset to half duplex.
>
> Sorry that I'm coming on a bit strong here, it just seems to me that weird
> things are happening to you which should not happen.  If I understand what
> you just wrote above: SGI is forced to 100-full in the kernel, and always
> works.  Linux boxes are forced with ethtool, and get reset somehow to
> half-duplex.

The SGI's still have "ethernet cable disconnected" messages, but it
gets "connected" after a few seconds and the duplex settings are still
good.  There have been a few times when the system lost the default
route, so wouldn't talk to anything that wan't on the same subnet.

> This does not suggest to me that the problem is on the switch.  The
> problem is on the Linux box!
>
> Do you use dhcp?  Could dhcpcd be resetting the interface?  Maybe add an
> ethtool or mii-diag command to force the interface in the dhcpcd if-up
> script.

Linux is a mix of fixed and dhcp, SGI and Apple are fixed.

> Unlike on a Cisco switch, where the config is saved with 'write', AFAIK
> there is no way to permanently set the mode on Linux (although compiling
> out any other modes from the kernel driver as you've done on SGI is a neat
> trick!) (also, some NICs have a configurable 'default' that can be written
> to NVRAM.  Don't know which ones though...).  Anyway, you're dependant on
> a run-time configuration program to set the mode for you.  Any power loss
> to the interface could result in it reverting to it's non-forced default
> -- which is often auto-negotiate.  But if the switch is forced, it's a
> crap shoot again.

I suppose I should check the linux drivers and try setting options
when the module
is loaded,  If that fails, I can think about building custom kernels.
 I see only a few
options for e100, but for e1000:

$ modinfo e1000
[...]
description:    Intel(R) PRO/1000 Network Driver
[...]
parm:           TxDescriptors:Number of transmit descriptors (array of int)
parm:           RxDescriptors:Number of receive descriptors (array of int)
parm:           Speed:Speed setting (array of int)
parm:           Duplex:Duplex setting (array of int)
parm:           AutoNeg:Advertised auto-negotiation setting (array of int)
parm:           FlowControl:Flow Control setting (array of int)
parm:           XsumRX:Disable or enable Receive Checksum offload (array of int)
parm:           TxIntDelay:Transmit Interrupt Delay (array of int)
parm:           TxAbsIntDelay:Transmit Absolute Interrupt Delay (array of int)
parm:           RxIntDelay:Receive Interrupt Delay (array of int)
parm:           RxAbsIntDelay:Receive Absolute Interrupt Delay (array of int)
parm:           InterruptThrottleRate:Interrupt Throttling Rate (array of int)
parm:           SmartPowerDownEnable:Enable PHY smart power down (array of int)
parm:           KumeranLockLoss:Enable Kumeran lock loss workaround
(array of int)
parm:           copybreak:Maximum size of packet that is copied to a
new buffer on receive (uint)
parm:           debug:Debug level (0=none,...,16=all) (int)
parm:           eeprom_bad_csum_allow:int
parm:           eeprom_bas_csum_allow:Allow bad eeprom checksums

What is a Kumeran lock?

> Do machines every hibernate or sleep?  That might power cycle the
> interface, but fail to re-run ethtool when it wakes up.

No, they tend to be pretty busy.

>> Maybe if the switch configuration is updated the port goes
>> to auto-neg temporarily and then back to the fixed setting (when
>> we check, it is always fixed).
>
> Seems unlikely... guess it would depend by switch manufacturer though.
> Even if it did... this wouldn't affect the other end it's it properly
> forced!
>
>> What would this do for other machines on that switch -- could it explain
>
> If the loop is entirely within one switch, it (crosses fingers) shouldn't
> cause any trouble for other users.  If the loop involves two or more
> switches, then all their uplink connections are part of the loop, so
> traffic between switches maybe momentarily suspended.  But there shouldn't
> be any 'ethernet disconnected' messages to individual ports, I don't
> think.
>
>> "ethernet disconnected" messages?  Does anybody know what causes
>> those (when the cable has not been touched, at least on the user end).
>
> - duplex mismatch
> - speed mismatch
> - bad/flakey NIC
> - bad flakey switch port
> - bad/flakey patch panel or cable/BIX job/mis-wiring/overlong cable run
> - network driver bugs
> - network storm
> - MAC address conflict (?)
>
> (just my initial ideas).

SGI never came up with a good answer, but said it occurs with some switches
and can often be fixed with a new system board, but we have seen the issue
with 4 different systems with 5 different system boards.   The SGI's use a very
early 100baseT chipset, so it is not surprising that it has a few
idiosyncracies.

> I wanted to add 'heavy network load' to that list, but it's really an
> exacerbating factor.  Many of the above list may not be noticed until a
> large amount of data is pushed through the pipe.

Correct, but much of our work is I/O bound, and more so as processors
get faster.

>>> Or, if for some reason (power glitch?) some intermediate switch or
>>> transceiver flickers out for a moment, it may take the devices on either
>>> side 30-40 seconds to re-enable the link.
>>
>> I think that is what we get, plus the load from all those PC's booting at once.
>
> Definitely I would be very concerned about your '800 desktops rebooting at
> once' issue.  If there's anyway you could arrange staggered boots, either
> by staggering the power, or maybe... each system does a 'sleep
> <myroomnumber>/100' early in the boot sequence!

I'll suggest that.

>> Thanks.  It certainly gives me some ideas for questions to ask, but in the
>> end it will take a lot to make me comfortable with relying on switched
>> ethernet for time-critical control functions during power "events".
>> STP should increase the chances that the systems that run on generator
>> will still have internet connectivity, so you want it, but 30-40 second
>> interruptions are not good when trying to get non-critical systems to
>> shut down.  In practice, however, some of them will probably be hung on
>> some network process and won't shut down until the network comes
>> back.
>
> Although I appreciate your concerns, it's still confusing.  The
> reliability of switched ethernet may be debatable, but the aspect of
> "power events" should have ZERO effect on that debate.  If it does, your
> UPS/generator/PDU systems have got issues.  STP should not need to do
> anything during power events, unless part of your network equipment is not
> on a UPS -- in which case: that's your trouble.

There is a UPS in every wiring closet, but a number have failed when put to
the test, and they don't get generator power.   I have seen lots of UPS
failures, so I know that it takes some real effort (e.g., buy the expensive
models that handle big spikes and have good self-tests and make sure
batteries are replaced on schedule).  The problem is that ethernet outage
when you are trying to shed A/C load so the machine room (this is not the
main machine room with the big databases, but a smaller room with
machines used to run numerical models, remote sensing and other data
collection systems that need to keep going 7/24) doesn't overheat before
the A/C can catch up is a signiticant risk.

> Just my (strongly voiced I realize!) two cents... it may very well be that
> I'm uninformed on something important, so feel free to jump in guys! (Who
> am I kidding, of course you will).



-- 
George N. White III <aa056 at chebucto.ns.ca>
Head of St. Margarets Bay, Nova Scotia



More information about the nSLUG mailing list