Ahead of the current™
Piercing the Reliability Myth – The Math Matters
Posted on: August 31st, 2012 by Bill

I’m going to apologize to my readers for this one – it’s long and mathematics-based.  But it does assert that there’s math to refute or support certain design or operatational habits.  Here’s what happens when you spend a lot of hours on plane and you have smart and challenging colleagues to kick some ideas around with. 

This all started about three months ago with David Shirmacher, the SVP of Technical Operations at Digital Realty Trust over a casual lunch (as if Dave and I have ever done anything casually).  Dave is one of my favorite guys in the business.  We’ve worked together for a long time, and he’s got one of the best minds in the field.  The subject of reliability and maintenance frequency came up during lunch at the Salt House in San Francisco.  The debate focused on both maintenance activity frequency and the complexity of that specific, multi-step activity.    That got me going on where did this start and what is the quant to back up the industry’s well-ingrained behaviors.

 While pundits sit and preach about system reliability, they fail to recognize one salient point:  It’s not about reliability, it’s about availability. 

 No one cares what happens behind the curtain in Oz if your data keeps right on flowing.

 An old saying in the data center business is that you don’t design systems for how they work – you design them for how they fail and how they are to be maintained.  While individual system focus has driven the reliability of components to very high Sigma levels, it’s how applications, platforms, systems and facilities are brought together, interact and are changed that drives your application availability.  So, what are the arguments and where did this all begin?

 Reliability calculations were born in 1966 during the Space Race in the form of MILSPEC Handbook 472, at the behest of the aerospace programs of military.  On page 2, 472 defines the tenets we use today – Mean Time Between Failures (MTBF) and the Mean Time to Repair (MTTR).  Go ahead and Google “MILSPEC 472”.  The handbook is available as a free PDF download on several websites.  Another excellent resource for the most modern version of 472 is the BICSI International Supplemental Information 001, Methodology for Selecting Data Center Design Class Utilizing Performance Criteria.  It’s available on the BICSI website at https://www.bicsi.org/dcsup/Default.aspx  (it’s free, but you have to fill out a request form).  Here, we’ve moved to performance-based metrics backed by the mathematics in this article.

 MILSPEC 472 was the first effort to develop the basics of system and component uptime.  It remains a  seminal work.  While the scorecards, matrixes and the network analysis methods are still quite relevant today, 472 does have its drawbacks.  It focuses on the examination of one system, not a series of physically separated but logically- and operationally-joined systems.  Nor does 472 consider concurrent maintenance and operations risk, where in the data center world, routine human intervention is both required and offers a varying risk proposition.

 Our availability assumptions have been reinforced by the historical performance of a variety of system architectures and features.  This manifests itself in the anecdotal Sigma levels you see published for a variety of electrical and mechanical system architectures.   However, these Sigma levels only consider the MTBF and MTTR for the system, with no outside influences, such as manually-initiated maintenance operations.  While 472 was developed for electronics, aircraft and spacecraft, a majority of system failures for those assemblies, short of a catastrophic failure from a series of events or force majeure (like a missile to the fuselage), simply means the system is offline, to be replaced by a squadron spare.  In the data center business, that squadron spare is that instantly-available redundant system that stands ready to assume a processing, storage, network, power or cooling load should a platform, component or connection fail.

 When you actually look at the probability’s mathematics on Page 1-16, it’s a compounding multiplier of:

 P(A1) x P(A2) x P(A3)…

 While this drives the steady-state to high very Sigma levels and a near zero theoretical outage value, it says nothing about manually-induced incidents, resulting either from maintenance or modifications, or in the case of software, new code loads.  And it’s in that frequency of change, coupled with the competency of the effort that has the most profound effect on your system availability. 

 Most facility guys happily follow the manufacturers’ edicts for maintenance frequency.  Most facilities will run their generators monthly.  But if you experience a real power outage, does that count for the gen test?  If you test weekly, would it better to test monthly or visa versa?  If you test monthly, would it better to test quarterly, assuming your batteries, fuel and other support systems appear to be in excellent health?  The fact is that every time you have to maintain a piece of equipment, you change the facility’s steady state operation. 

Let’s swing into the math.  You may possess exceptional topographic redundancy in your system and find comfort in the Law of Large Numbers (LLN) when looking to minimize incidents in your facility.  LLN does tangentially recognize the fact is that one activity in any large number pool will simply go haywire, the dreaded Black Swan event.  Or as I call it, Bad S*** Happens to Everyone at Least Once.

LLN clearly speaks to normalized returns over a large data sample pool and a long period of time.  While the Black Swan event is not likely to occur, best practices speak to changing our behavior and design choices to reduce your exposure to the “Swan” as much as practical.  Don’t get too comfortable.  Let’s now throw in the Gambler’s Fallacy.  The Gambler’s Fallacy, or the fallacy of the maturity of chances, is rooted in the false belief that deviations from an expected behavior or outcome observed in a repeated independent trials or events (of some random process like a dice roll), future deviations in the opposite direction are more likely. 

 Take a moderately large set of repeating events, like a 16-sided dice rolled 16 times for a single win or the 16 quarterly maintenance cycles on your 4 UPS modules in a year.  One would expect that if you had not had a failure during that set number of transactions, the probability of the failure would decrease as you proceed through the finite count.  That’s not true.  Each event is unique, and the outcome of the succeeding event is not known until it’s over. 

 That 16-sided die is the requisite, but separate maintenance or switching activities you undertake in your data center every year.  The chance of success in this series would be:

 1 – {number of successful events/number of total events} exp of the event in the sequence

 The math of the dice versus the UPS modules are very different for one reason – the number of transactions within the activity.  The chance of a successful event on the first attempt in a 16-series dice roll would be 1 – {15/16}16 = 64.39%.  On the succeeding event, the chance of a successful event is:

1 – {15/16}15 = 62.02%, as so forth.  But when you look at a single UPS module undergoing quarterly maintenance, you have to consider the individual steps (transactions) within the activity as noted in the MOP.  It’s not a single event.  Assuming 10 major switching or maintenance steps per module per activity, or 40 activities per year, this would result in: 1 – {39/40}40 = 36.31% for the first attempt, and 1 – {39/40}39 = 37.31% for the second activity.  When you reach the last activity, the percentage of success has dropped to essentially zero.  Herein lays the math verifying that less maintenance is better.

We all acknowledge that zero maintenance is not acceptable, but neither is over-maintenance.  The balance must be struck between component count, maintenance frequency and system complexity to yield the lowest risk to uptime.  Predictive maintenance certainly addresses activity frequency, where services is applied only when the systems requires it.  Less frequent but necessary maintenance or having less components to maintain might actually be your best approach.  Regardless of when the maintenance activity is undertaken, each activity poses a risk.  And the way the mathematics work, the total count of these maintenance activities has the most significant impact to system availability.  The fact is past success or failure does not promise future success or failure, just that each is possible in independent events.  That one event might be the one that drops only the “+1” of the service, or it may be the one that drops the load or facility.  The Black Swan has come to roost. 

 Some low-risk activities, like checking oil on the gens or non-invasive IR scanning, pose little risk to the concurrent maintenance and operation of the system.  When you look at taking a UPS module offline, it takes dozens of steps in an MOP to successfully complete the activity.  Each of those tasks in a given activity carries different weight on the uptime calculation.  Performing a close-transition key interlock circuit breaker transfer poses a far greater risk than operating a “UPS Module to Static Bypass” button (as long as you choose the correct module).   The key point here is to reduce both the total number of activities or changes of state on any given system and the more risky activities if you wish to radically reduce your uptime risk.

 Let’s look at a Tier/Class III 2N UPS data center with a single generator and no swing generator.  Assume:

  • One dedicated generator plus one redundant (N+1) generator.
  • Two UPS modules per “A” and “B” side, for a total of four modules, distributed parallel.
  • Quarterly UPS module invasive maintenance.
  • Annual invasive UPS module maintenance.
  • One UPS module automatically going to static bypass annually.
  • Monthly non-invasive generator maintenance.
  • Quarterly non-invasive generator maintenance.
  • Annual invasive generator maintenance.
  • One Utility outage per two years.
  • Two circuit breaker switching operations per year.
  • Annual IR scan on the main boards, 48 breakers total.

 This offers a total event profile would consist of 66 minor maintenance activities and 197 major maintenance or system state changes of state per year.  Over a ten year period, that equals over 2,600 activities or system changes.  If you reduce the system content to one module per side (from 4 to 2 on the  total module count) and change the UPS modules and generators to one major maintenance cycle per year, you drop the counts to 35 minor activities, 6 major activities for a total of 41 activities per year.

 If you simply make a minor topology change and modify your generator maintenance habits, you can reduce your exposure to human-induced incidents by 64.6%.  By having quantity two-less, UPS modules in a center would decrease critical load switching operations by ONE-HUNDRED AND TWELVE (112)  activities per year, or 1,120  switching operations in a 10 year period.  This is seriously good stuff!

 So, here’s what we see:

  1.  Fewer components equal less maintenance activities – thereby reducing risk of downtime from human error and intervention.  Non-invasive scanning methods, like ExerTherm, will allow you to monitor your system without opening cabinets or exposing your staff to arc flash hazards.
  2. To avoid killing a “side” when maintaining the UPS bypass, utilize cross tie between the systems at the UPS input or output in a 2N system.
  3. There’s a reason that most of the more sophisticated data center wholesale operators use Trane Intellipaks.  While each company puts their own “pimp” on the units, they are simple, robust and, as we say about a 1970s GM car, anyone can fix them.
  4. The more dynamic the system is, the more likely it is to drop a load.  Follow the suggestion of the ANSI/BISCI Data Center Best Practices Manual 002 – use only one step transfers or switching operations to arrive at the next steady state in any maintenance or failure mode of operation.  Do not have the automatic system response depend on something else happening.  In other words, avoid complexity.   For data center reliability, the KISS principle certainly applies to data center design.
  5. The likelihood of new code failing is directly proportional to the degree of burn in and test/debug it underwent.  State machines are very important.  Unfortunately, unless coders are classically trained in how to develop a complete state machine, “putting things back to normal state” in any given failure scenario is not a common practice.  Most software developers focus only on the most likely outcomes.  The successful ones try to create the unexpected (see Netflix’s Chaos Monkey).  Don’t forget that any of those PLCs in your switchgear or paralleling gear require software development as well.  Buyer beware if design, test and commissioning  do not follow the best practices for software development.  Outages will occur.
  6. Most of the truly epic data center outages have not been due to power or cooling.  It’s been the network or software.  Knight Trading, the Security Pacific ATM outage, the Bank America Securities outage and the “fail-to-fail over” at the central offices during 9/11 were all root caused to the network architecture or software systems.
Website Pin Facebook Twitter Myspace Friendfeed Technorati del.icio.us Digg Google StumbleUpon Premium Responsive

Tags: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,