Sections:Fails

When our services are failing it hurts.  It gives our stomachs knots.  When we can't immediately tell our customers what the problem is, we want to pull out our hair.

We work hard everyday planning and managing risk for our services not to fail.  But they do sometimes, and when they do it is emotionally and physically painful.  It hurts our pride, and we know it hurts our clients and perhaps worse, costs our clients money.

At the same time we have worked hard in our business, with our clients and our staff to be realistic and to pay attention to the math.

Unfortunately machines break, and more importantly, complex machines with multiple types of resources, tools and programs fail with certainty.

Part of our work is predicting that certainty despite all our best efforts to avoid fails.  Surprisingly, our historical data and observance of industry best practices leads to fairly reliable prediction of failure rates, despite our very best efforts to avoid failures all together.

For our current $50-$100 monthly SureDesk price we have a target of 99.9% reliability and trailing 12 month uptime is 99.75% which is about 8 hours more annual downtime than we target, but within our expectations given the size and age of our SureDesk 3.0 service.  Our SureMail service had similar uptime the first years of our service and is currently reliably fully functional over 99.9% of the time over the last 18 months.

Our target uptime is a function of the amount of redundancy built into the system, as well as our expertise managing the system.  We currently have a great deal of redundancy and failover and the testing necessary to go with it, but to achieve 99.99% uptime, or greater, which is less than 1 hour of expected downtime per year, we would need at least Triple our entire current infrastructure.  This would require us to have an exact duplicate of everything we currently have setup in a second location and would require us to build and test a process to failover to the second location without interruption to our users.

We would love to put this in place, and indeed plan to do so, but the cost for that level of service will need to be exceed our current costs.

Given this reality, our commitment for now is to utilize current expertise and resources to maximize a reasonable uptime target given the redundancies we are able to implement at our current price point, and to iterate our utilization of those resources and failover processes until we can reliably deliver our target 99.9% service.

In the case of yesterday, a backup process that affects a Disk Subsystem, was running a faulty process that consumed substantially all the resources of our network and disks.  The key problem is that our normal error monitoring and alarms did not see the malfunctioning behavior and it took us almost a full work day to identify the culprit and turn those processes off.  

Over the next week we will improve our alerts regarding this process and can make three absolute guarantees:

 

  1. This particular problem will never cause the same outage again.
  2. Something we don't plan for will invariably bite us where it hurts from time to time, and we'll tell you about it when it does.
  3. We will continue to use the latest technologies and expertise to ensure we continue, as we do now, to offer the absolute best, most flexible and reliable cloud services available anywhere for our clients at the $50-$100 price point we're committed to serve.

 

We appreciate your comments, feedback and suggestions anytime.

blog comments powered by Disqus