Our Net Went Down for an Hour, and I Survived!

or

How I stopped worrying and learned to love the cloud


In recent weeks and months, nearly every major cloud provider seems to have had some kind of an outage.  Whenever a cloud service demonstrates unreliability, it sends customers, potential customers, journalists, and commentators into a brief tizzy of hand-wringing concern.  The concern is understandable, but nearly always overstated.  There is a media effect going on -- cloud outages are public events, while other outages go unnoticed.

This isn't to say that cloud outages aren't important, but rather that we need to be a bit clearer in our thinking about them.  Outages in information services are never good, whether or not those services are cloud based.  But such outages are also inevitable, cloud or no.  Statistical thinking can be unnatural and challenging, but anecdotes shed no light on the most important question:  Is there any reason to believe that cloud services will be less reliable, overall, than services provided on premises?

At a superficial level,  it seems likely.  After all, the simplest way to provide a cloud service is to take the same service and software that was previously running on premises, and move it to a cloud environment.  It seems inevitable that this change would make the service less reliable, because you've added a new potential failure mode (wide area network outages) without losing any existing failure modes.  However, this is neither the correct nor even the usual way of moving a service to the cloud.

Cloud services might instead be expected to be more reliable, overall, than non-cloud services, if and when there are aspects of cloud services that increase reliability more than the risk of occasional network outages decreases it.  I see two major areas where this risk reduction is reasonable to hope for:

-- Professional, centralized administration.   Any sysadmin can tell you that, remarkably often, a serious problem or outage turns out to be due to ripple effects from a naive or mistaken action by a junior, amateur, or overburdened administrator, often in a remote location.  A system is only as reliable as its weakest link.  If a move to the cloud reduces the chances of outages triggered by distributed, overburdened, or undertrained administrators, it will therefore improve the overall reliability of the system.  The centralized, professional administration of cloud services isn't just a money saver for clients, but can also increase service reliability.

-- Rationalization of diverse IT services.  In most enterprises, IT services have grown over time as a hodgepodge of loosely or poorly interconnected services.  But nearly every time one of your company's people take on the job of integrating disconnected systems, they're doing a bare-bones job of it for reasons of time, training, and sometimes inclination.  In contrast, a cloud provider has a tremendous motivation to get system integration right, once and for all times, so that it doesn't have to do it over and over again.  The result is that if you move a set of services to the cloud, you are likely to end up with a simpler, more manageable, and more reliable underlying implementation.  (That, in fact, is why Mimecast has been slower than some providers to add non-email data to our archive; we aren't willing to just bolt on some older mechanism, but are holding out for tight, designed-in integration.)

Sounds great, right?  Not so fast.  The real problem is that there are inevitably both good and bad providers in the cloud.  What I've just written will be true of the best of them, I firmly believe.  But if you make the wrong choice and go with a less skillful cloud provider, you might actually see a decrease in reliability when you move to the cloud.  So the real bottom line is that moving to the cloud could make things better or worse, depending on the choices you make.  But how do you choose?

One way NOT to choose is based on the most recent anecdotes about service outages.  The best of services will still have occasional outages, and the worst will still have long stretches of apparently reliability.  When you're trying to gauge the relative frequency of fundamentally rare events, you can't trust a short sample period.   The solitary fact that Microsoft had a recent outage of Office 365, or Google of Google Docs, etc., tells you nothing useful for deciding whether or not to rely on their service.  

To be fair, even Mimecast's near-decade record of high reliability, though about as long-term as cloud data gets, is a smaller sample size than I'd like to see.   The cloud industry is new enough that there's not nearly enough useful data, especially given the skyrocketing quantities and rates of data that cloud vendors are currently absorbing.

Ultimately, I believe there's no substitute for a deep look at the inside of your cloud provider.  But since most customers can't do that -- and wouldn't really want to if they could -- we need independent auditors and standards to which cloud providers can be held.  This is why Mimecast has been active in extending the ISO 27000 standards for cloud services, for example.  We want it to be easy for our customers to get independent, expert confirmation of our operational excellence.

Recently, there's been another suggested solution, the so-called "supercloud" architecture <a href=http://www.itpro.co.uk/636067/all-hail-the-supercloud>
described by Tom Brewster in ITPro.</a>  In this model, applications would migrate seamlessly among independent cloud providers in the event that one suffered an outage.  While this sounds wonderful, I am very skeptical that it could made to work in practice for most IT services.  While it's easy to allow compute-intensive processes to migrate, as shown by such services as SETI@Home, things get much trickier for data-intensive services.  A service that includes archiving, such as Mimecast's, can't really migrate to another provider unless the data migrates as well, which in our case means petabytes of data.  It's not possible to expect to migrate that much data after the outage occurs, of course, but it also isn't practical to constantly replicate it across independent providers in anticipation of a rare outage.

Ultimately, if you can achieve reliability using multiple cloud providers, you can get the same reliability, more affordably, from a single provider that uses a similarly redundant architecture.   Is your cloud provider doing that?  It's hard to tell, and I know of no better way to judge than independent standards and audits, and long-term availability data of the kind most companies won't be able to offer for another decade.  

Given the advantages the cloud can offer today, it's probably a mistake to avoid the cloud (though life-critical services have special needs from cloud providers).   Whether services become more reliable or less so, in the endthe change will likely be small, and will at worst be offset by the other benefits of cloud computing.