Hosting again

Mon Oct 29 19:51:19 GMT 2007

On 29/10/2007, Martin A. Brooks <martin at antibodymx.net> wrote:
>
> Alistair McGlinchy wrote:
> > You've just hit one of my favourite niggles.  Why do you want redundant
> > networks and sites AND redundant memory, processors and power supplies
> > within a site? [*]  This a waste of money, you end up spending 4 times
> the
> > money for only N/2 resilience. If you want extra resilience, add another
> > site or buy components with better MTBF.
>
> I disagree.  A second power supply for one of our mail filtering nodes
> costs about £100.  Another server costs about £2000.
> We're spread across 4 different physical locations, none of which I want
> to visit because a server has been downed for want of a working PSU.

Eh?  You have three other working sites so the users don't see a problem.
Your have to go to the site to fix the fan at some point, so I'm not sure
what your "visit" point adds.

If we assume the average MTBF for a server is 1 fault every 6 months. I'd
also assume that a fan could be fixed in a server in less than four hours
but it sounds like your driving to the site yourself, hence assuming 1 day
to fix the fan 1/182 = 0.55% chance of second server failure during the
fault fix outage.

Scenario 1: Two fans per server:  If one fan fails the server stays up and
an engineer fixes the fault within the next 24 hours and service is
restored.

Scenario 2: One fan per server  If the fan fails, the server goes down but
the service runs at the other site automatically. The service is at risk of
going offline until the engineer has completed the fix.

Suppose the second server does go down sometime during the 24hr fix. On
average this fault will last 12 hours.  So in order for the extra £100 fan
to be cost effective you need to be making profit from this server at a rate
greater than £100/0.55%  = £18,250 in 12 hours.   So unless your making
£13Million *profit* per year off this service you're losing on the deal.

Add in the fact that more fans means
- more things to break and hence need site visits (something you were keen
on reducing)
- more spares needed in the back office
- more power consumption,
Then do you want dual RAID controllers, dual SAN connections, dual ethernet
for redundant NIC. (or multi-homed and redundant NIC).  At £100 - £1000 a
pop your almost half way to a new server.

One more point that needs to be added. How many times have you logged on to
a server to get a wall message or pop-up saying "Foo is busted" and discover
it has been so for the last 4 weeks. The real problem with resilience is
that if you don't monitor for it, you don't have it.   Spotting that servers
are down is easy, getting the SuperCool2000's fan monitoring software to
integrate with your systems management tool is hard.

Cheers

Alistair