[nSLUG] XEN & Heartbeat
nslug at fop.ns.ca
Thu Jun 4 09:06:37 ADT 2009
On Wed, 3 Jun 2009, Michael Crawford wrote:
> On Wed, Jun 3, 2009 at 4:27 PM, Hatem Nassrat <hnassrat at gmail.com> wrote:
>> I have briefly looked at fault tolerance during my MACS at dalhousie
>> and I remember reading about systems which can even handle hardware
>> failures by having standby replicas of the hardware within the machine
> While it's quite likely an urban legend, it was said that one could
> fire a shotgun through a Tandem box without it going down.
It would be possible to do that but you'd have to cross your fingers you
don't take out more than half of any individual set of components (ie both
connections from one cpu to the dynabus go and you're toast for that CPU,
keep going and the machine runs out of processors to migrate to and goes
> Tandem specialized in redundant, fault-tolerant hardware. But they
> were also very expensive, so I don't think they're around anymore.
They were bought out by HP (slightly ironic, since I believe the guys that
set up Tandem originally escaped from HP). Last I heard the nonstop stuff
was all on Itanium so god only knows how that works now (the Tandem
hardware was extremely proprietary when I got a chance to poke at it
rather longer ago than I care to admit).
> There are lots of hardware products designed to enable various kinds
> of fault-tolerance, but again they are all very expensive.
The law of diminishing returns hits quickly. Most people do not need the
level of guaranteed availability that Tandems and their ilk provide, for
high availability having a fully hot swappable (disk/cpu/ram) machine is
enough - I have some AIX machines in the data centre with multi-year
uptimes as hardware is just swapped out when it fails.
> Whether the expense is worth it depends on what the cost of your
> downtime or data loss would be.
Requiring this level of availability is rare and needs a combination of
lots of money and a major risk if there's an outage - so Tandems were used
a lot in banking transaction handling, for example. For stateless web
services and the like my inclination would be to scale horizontally with
cheap Linux boxes in an LVS style setup.
And regarding your later comment viz spec docs, I am in complete agreement
- it reminds me of http://www.fastcompany.com/node/28121/print? which
expresses very well why the somewhat casual attitude to documentation
prevalent in the industry leads to buggy and unreliable software in the
More information about the nSLUG