[nSLUG] XEN & Heartbeat

Dop Ganger nslug at fop.ns.ca
Thu Jun 4 09:06:37 ADT 2009


On Wed, 3 Jun 2009, Michael Crawford wrote:

> On Wed, Jun 3, 2009 at 4:27 PM, Hatem Nassrat <hnassrat at gmail.com> wrote:
>> I have briefly looked at fault tolerance during my MACS at dalhousie
>> and I remember reading about systems which can even handle hardware
>> failures by having standby replicas of the hardware within the machine
>> itself.
>
> While it's quite likely an urban legend, it was said that one could
> fire a shotgun through a Tandem box without it going down.

It would be possible to do that but you'd have to cross your fingers you 
don't take out more than half of any individual set of components (ie both 
connections from one cpu to the dynabus go and you're toast for that CPU, 
keep going and the machine runs out of processors to migrate to and goes 
down).

> Tandem specialized in redundant, fault-tolerant hardware.  But they
> were also very expensive, so I don't think they're around anymore.

They were bought out by HP (slightly ironic, since I believe the guys that 
set up Tandem originally escaped from HP). Last I heard the nonstop stuff 
was all on Itanium so god only knows how that works now (the Tandem 
hardware was extremely proprietary when I got a chance to poke at it 
rather longer ago than I care to admit).

> There are lots of hardware products designed to enable various kinds
> of fault-tolerance, but again they are all very expensive.

The law of diminishing returns hits quickly. Most people do not need the 
level of guaranteed availability that Tandems and their ilk provide, for 
high availability having a fully hot swappable (disk/cpu/ram) machine is 
enough - I have some AIX machines in the data centre with multi-year 
uptimes as hardware is just swapped out when it fails.

> Whether the expense is worth it depends on what the cost of your
> downtime or data loss would be.

Requiring this level of availability is rare and needs a combination of 
lots of money and a major risk if there's an outage - so Tandems were used 
a lot in banking transaction handling, for example. For stateless web 
services and the like my inclination would be to scale horizontally with 
cheap Linux boxes in an LVS style setup.

And regarding your later comment viz spec docs, I am in complete agreement 
- it reminds me of http://www.fastcompany.com/node/28121/print? which 
expresses very well why the somewhat casual attitude to documentation 
prevalent in the industry leads to buggy and unreliable software in the 
first place.

Cheers... Dop.



More information about the nSLUG mailing list