Hardware Reliability

Simon Wilcox essuu at ourshack.com
Mon Jun 8 11:17:29 BST 2009


On 8/6/09 10:40, duncan.garland at ntlworld.com wrote:
> ---- Raphael Mankin <raph at mankin.org.uk> wrote: 
>> On Sun, 2009-06-07 at 12:13 +0100, Duncan Garland wrote:
>>
>>> I wonder if the problem can be approached from the other end. I wonder if
>>> there is a design standard (ISO or such like) which states that a
>>> manufacturer should aim for an MTBF of whatever.
>>>
>>> I'll let you know if I find anything.
>> MTBF, when quoted, is largely meaningless. The figures are computed,
>> purely theoretical. No-one actually runs a sufficiently large number of
>> items for long enough to get meaningful statistics. If they did, they
>> would miss the market. 
>>
>> Imagine having to run, say, 10000 disk drives for five years in order to
>> get meaningful MTBFs before you could put them on sale.
>>
>> Only people like Google, Microsoft or Yahoo actually have sufficient
>> data, and all they can tell you that is *useful* is that some
>> manufacturers are, in the long term, better than others. Nothing about
>> models that are not obsolete.

 > Calculated MTBF figures are not meaningless because they show what 
the manufacturer expected. The manufacturers base their warranty 
programmes and even whether or nor to go into production on them, Do you 
know where I can get some?

They're mostly meaningless though as we know that some drives fail 
within days of installation so some drives must last years past the date 
the MTBF might suggest.

Also, hardware itself is rarely the only factor these days, software 
faults in firmware are just as likely to cause downtime (in my 
experience) and as far as I know that's not allowed for in any MTBF 
calculations.

My rule of thumb is that most kit installed in a datacentre will last 3 
years if it lasts a week but once you start seeing disk errors you 
should plan to replace them. In my experience, just replacing kit 
because it's a certain age usually ends up with more problems, not less, 
if the kit being replaced is fault free, as a percentage of new kit will 
die in the first week of operation.

If your kit isn't in a datacentre (you didn't say how much or what sort 
of location you're interested in) then you're more likely to see fan 
faults or motherboard issues from sucking in half a pound of dead skin 
cells than you are a hard drive failing.

S.


More information about the london.pm mailing list