Whats in your SLA?

People have been considering and comparing public (hosted) and private (on-premises) cloud solutions for some time in the messaging world, and at increasing rates for database and other application workloads.  I’m often surprised at how many people either don’t know the contents and implication of their service provider service level agreement (SLA), or fail to adjust the architecture of private cloud solution and then directly compare cost. 

Here are my five lessons for evaluating SAAS, PAAS, and IAAS provider SLAs:

Lesson 1: Make sure that what’s important to you is covered in the SLA

Lesson 2: Make sure that the availability guarantee is what you require of the service

Lesson 3: Evaluate the gap between a service outage’s cost to business and the financial relief from the provider

Lesson 4: Architect public and private clouds to the similar levels of availability for cost estimate purposes

Lesson 5: Layer availability features onto private clouds for business requirement purposes

I’ll use the Office 365 SLA to explore this topic – not because I want to pick on Microsoft,  but because it’s a very typical SLA, and one of the services it offers (email) is so universal that it’s easy to translate the SLA’s components into the business value that you’re purchasing from them.

Defining availability

The math is simple.  It’s a 99% uptime guarantee with a periodicity of one month:


If that number falls below 99, then they have not met their guarantee.  For what it’s worth, during a 30 day month, the limit will be about 44 minutes of downtime before they enter the penalty, or about 8.7 hours per year.

But what does “Downtime” mean?  Well, it’s stated clearly for each service.  This is the definition of downtime for Exchange Online:

“Any period of time when end users are unable to send or receive email with Outlook Web Access.”

Here’s what’s missing:

  • Data:  The mailbox can be completely empty of email the user has previously sent and received.  In fact the email can disappear as soon as they receive it.  As long they can log in via OWA, the service is considered to be “up”.
  • Clients:  Fat outlook, blackberry, and Exchange ActiveSync (iPhone/iPad/Winmopho, and most Android) clients are not covered in any way under the SLA

Lesson 1: Make sure that what’s important to you is covered in the SLA

Lesson 2: Make sure that the availability guarantee is what you require of the service

Balancing SLA penalties with business impact

My Internet service is important to me.  When it’s down, I lose more productivity than the $1/day or so I spend on it.  Likewise, email services are probably worth more than the $8/month/user or so that you might pay your provider for it.  That doesn’t mean that you should spend more than you need for email services.  But it does mean that if you do suffer an extended or widespread outage, there will likely be a large gap between the productivity cost of the downtime and the financial relief you’ll see in the form of free services you’ll see from the provider. 


Callahan Auto Parts also offers a guarantee

I’ll put this in real numbers.  Let’s say I have a 200 person organization.  I might pay $1600/month for email services from a provider.  If my email is down for a day during the month, my organization experiences 96% uptime for that month, and as a result, my organization is entitled to a month of free email from the provider, worth about $800.


The actual cost of my downtime will very likely exceed $800.  To calculate that cost we need the number of employees, the loaded cost per hour for the average employee, and and the productivity cost of the loss of email services.  For our example of 200 employees, let’s imagine a $50/hour average loaded cost to business and a 25% loss of productivity when email is down:

200 employees x $50 cost per hour x .75 productivity rate x 8 hour outage = $60,000 of lost productivity

Subtract the $800 in free services the organization will receive the next month, and the organization’s liability is $59,200 for that outage.

Now how do you fill that gap?  I’m not entirely sure.  It could be just the risk of doing business – after all, the business would just absorb that cost if they were hosting email internally and suffered an outage.  If the risk and impact were large enough, I would probably seek to hedge against it – exploring options to bring services in house quickly, or even looking to an insurance company to defray the cost of outages – if Merv Hughes can insure his mustache for $370,000, then surely you can insure the availability of your IT services.  Regardless, it’s wise not to confuse a “financially backed guarantee” with actual insurance or assurance against outage.

File Photo:  What a $370k mustache may look like.  Strong.

Lesson 3: Evaluate the gap between a service outage’s cost to business and the financial relief from the provider

Comparing Apples to Oranges


See what I did there?

Doing a cost comparison between public cloud designed to deliver 99.9% availability and a private cloud designed to provide 99.99% or 99.999% availability makes little sense, but I see people do it very frequently.  Usually it’s because the internal IT group’s mandate is to “make it as highly available as possible within the budget”.  So I’ll see a private cloud solution with redundancy at every level, capabilities to quickly recover from logical corruption, and automated failover between sites in the event of a regional failure, compared to a public cloud solution that provides nothing but a slim guarantee of 99.9% availability.  In this instance, it’s obvious why the public cloud provider is less expensive, even without factoring in efficiencies of scale.

To illustrate this, I usually refer to Maslow’s hand-dandy Hierarchy of Needs, customized for IT high availability.

image image

Single Site and Multi-site Hierarchies of Need

If I want to make an accurate comparison between a public cloud provider’s service and pricing and what I can do internally, I often have to strip out a lot of the services that are normally delivered internally.  Here’s the steps:

  1. Architect for equivalence.  If I have a public cloud provider just offering 3 9’s and no option for site to site failover, for my database services, I might just do a standalone database server.  Maybe I’d add a cheap rapid recovery solution (like snapshots or clones) to hedge against compete storage failure and cluster at the hypervisor layer to provide some level of hardware redundancy.  If my cloud provider offers disaster recovery, I’d figure out what their target RPO/RTO and insert some solution that matches that capability.
  2. Do a baseline price comparison.  Once I’ve got similar solutions to compare, I can compare price.  We’ll call this the price of entry.
  3. Add capabilities to the private cloud solution after the baseline.  I only start layering features that add availability and flexibility to the solution after I’ve obtained my baseline price.  Only then can I illustrate the true cost of those features, and compare them to the business benefits.

Lesson 4: Architect public and private clouds to the same levels of availability for cost estimate purposes

Lesson 5: Layer availability features onto private clouds for business requirement purposes

S#!t Happens (Or: What we can learn from the latest gmail outage)

So here's evidently what happened:

Some time around February 27, gmail was affected (mid-upgrade) by a bug that effectively deleted the mail data associated with about 40,000 email accounts.  Now, Google maintains multiple copies of users' data, so this bug affected all the available copies of the data for these users.  Google had the foresight to backup their data rather than relying on data replication as its sole protection against data loss, but that backup data resides on tape, which clearly takes time to restore.  Just to give you an idea of how much time, you need an idea of the scale of the data loss.  If each of those users had 5GB of data in their mailboxes, the restore operation requires about 200TB of data - not unmanageable, but clearly something that would take on the order of days to weeks to restore unless something really very cool is used.  One of the interesting aspects of the restore process is that users report having no access to the email services while their data is being restored.  An Exchange administrator would have the ability to spin up some dial tone databases and use something like recover-mailboxrecovery storage groups or a more robust tool like Ontrack PowerControls to merge data from the backup sets back into the dial tone databases.

Now, this is not a schadenfreude post.  I have a lot of respect for Google, especially around how they've transformed messaging, and provided consumers with a very viable and attractive alternative to what was a pretty miserable corner of IT when it was first introduced in 2004 (my, how time flies, huh?). They've delivered a remarkably reliable infrastructure for a massive number of users at an incredible price.

However, as a technologist, I'd like to look at what happened, as well as the users' reactions to get an idea of how I can architect messaging systems so that when stuff inevitably hits the fan, the impact can be minimized.

  1. Avoid backups at your own risk.  It's tempting, especially with three, four, or five copies, to think to yourself "Well, how many copies do I need before I don't need to back up?"  The fact is that all of those copies are in a single failure domain.  As my friend and colleague Jim Cordes says, "This will work, up to the point where it won't."  In this case, a storage bug (likely associated with Google Filesystem) created data loss.  But it could as easily have been an application bug, administrator error, or a security breach.  In this case, the data also resides outside of the failure domain (on tape).  It's generally advisable that critical data be available outside the context of the application.
  2. Users don't just care about service availability - they care about their data too.  Many people live in their email accounts.  Whether we like it or not, their email account is where they keep their most important data.  So make sure your SLAs (either internal or with a service provider) cover data availability and not just service availability.
  3. Users don't just care about service and data availability - they care about metadata too. Complaints about the loss of starred emails and labels abound.  This shouldn't surpise us.  If people live in their email accounts, then they'll organize it.  Think of it like a filing cabinet.  If your "backup" of your filing cabinet entails copying everything and putting in all in a fireproof canister, then "restoring" those files to a usable state where you can actually find something is going to be a problem.  This could be a problem for folks who use a compliance archive as a last-ditch resort for data restore.
  4. Set distinct SLAs for service and data/metadata availability.  "Distinct" doesn't mean "different" in this context.  With a robust email solution you can get service availability up quickly and cheaply after a disaster.  Getting the actual data back is the longer pole in the tent, and where the bulk of investment is required.  If you tier your data through the use of archives, cost can be mitgated by assigning different SLAs for the active and archive data.
  5. Make sure your backup/restore solution meet SLAs for data/metadata availability.  As we've seen, even the best run organizations running top notch software can experience data loss, even when multiple copies of the data are deployed.  If the copy of the data outside of the application failure domain takes 100 hours to restore, then that's the SLA you can sign.  If a disaster requires that you restore many terabytes of data and you have a data availability SLA of under a day, then it's advisable to look at hardware-based snapshots or bookmarks as a solution.
  6. Fast backups ≠ Fast restores.  There are many solutions out there that help people meet aggressive backup windows (incremental forever with synthetic fulls are widely available, for example).  Administrators and managers are well-advised to examine the restore speeds of these solutions.  Basically, if the solution calls for the full dataset to be moved from one place to another in order for it to be used, then you need to examine the restore speed.