Whats in your SLA?

People have been considering and comparing public (hosted) and private (on-premises) cloud solutions for some time in the messaging world, and at increasing rates for database and other application workloads.  I’m often surprised at how many people either don’t know the contents and implication of their service provider service level agreement (SLA), or fail to adjust the architecture of private cloud solution and then directly compare cost. 

Here are my five lessons for evaluating SAAS, PAAS, and IAAS provider SLAs:

Lesson 1: Make sure that what’s important to you is covered in the SLA

Lesson 2: Make sure that the availability guarantee is what you require of the service

Lesson 3: Evaluate the gap between a service outage’s cost to business and the financial relief from the provider

Lesson 4: Architect public and private clouds to the similar levels of availability for cost estimate purposes

Lesson 5: Layer availability features onto private clouds for business requirement purposes

I’ll use the Office 365 SLA to explore this topic – not because I want to pick on Microsoft,  but because it’s a very typical SLA, and one of the services it offers (email) is so universal that it’s easy to translate the SLA’s components into the business value that you’re purchasing from them.

Defining availability

The math is simple.  It’s a 99% uptime guarantee with a periodicity of one month:

image

If that number falls below 99, then they have not met their guarantee.  For what it’s worth, during a 30 day month, the limit will be about 44 minutes of downtime before they enter the penalty, or about 8.7 hours per year.

But what does “Downtime” mean?  Well, it’s stated clearly for each service.  This is the definition of downtime for Exchange Online:

“Any period of time when end users are unable to send or receive email with Outlook Web Access.”

Here’s what’s missing:

  • Data:  The mailbox can be completely empty of email the user has previously sent and received.  In fact the email can disappear as soon as they receive it.  As long they can log in via OWA, the service is considered to be “up”.
  • Clients:  Fat outlook, blackberry, and Exchange ActiveSync (iPhone/iPad/Winmopho, and most Android) clients are not covered in any way under the SLA

Lesson 1: Make sure that what’s important to you is covered in the SLA

Lesson 2: Make sure that the availability guarantee is what you require of the service

Balancing SLA penalties with business impact

My Internet service is important to me.  When it’s down, I lose more productivity than the $1/day or so I spend on it.  Likewise, email services are probably worth more than the $8/month/user or so that you might pay your provider for it.  That doesn’t mean that you should spend more than you need for email services.  But it does mean that if you do suffer an extended or widespread outage, there will likely be a large gap between the productivity cost of the downtime and the financial relief you’ll see in the form of free services you’ll see from the provider. 

image

Callahan Auto Parts also offers a guarantee

I’ll put this in real numbers.  Let’s say I have a 200 person organization.  I might pay $1600/month for email services from a provider.  If my email is down for a day during the month, my organization experiences 96% uptime for that month, and as a result, my organization is entitled to a month of free email from the provider, worth about $800.

image

The actual cost of my downtime will very likely exceed $800.  To calculate that cost we need the number of employees, the loaded cost per hour for the average employee, and and the productivity cost of the loss of email services.  For our example of 200 employees, let’s imagine a $50/hour average loaded cost to business and a 25% loss of productivity when email is down:

200 employees x $50 cost per hour x .75 productivity rate x 8 hour outage = $60,000 of lost productivity

Subtract the $800 in free services the organization will receive the next month, and the organization’s liability is $59,200 for that outage.

Now how do you fill that gap?  I’m not entirely sure.  It could be just the risk of doing business – after all, the business would just absorb that cost if they were hosting email internally and suffered an outage.  If the risk and impact were large enough, I would probably seek to hedge against it – exploring options to bring services in house quickly, or even looking to an insurance company to defray the cost of outages – if Merv Hughes can insure his mustache for $370,000, then surely you can insure the availability of your IT services.  Regardless, it’s wise not to confuse a “financially backed guarantee” with actual insurance or assurance against outage.

File Photo:  What a $370k mustache may look like.  Strong.

Lesson 3: Evaluate the gap between a service outage’s cost to business and the financial relief from the provider

Comparing Apples to Oranges

image

See what I did there?

Doing a cost comparison between public cloud designed to deliver 99.9% availability and a private cloud designed to provide 99.99% or 99.999% availability makes little sense, but I see people do it very frequently.  Usually it’s because the internal IT group’s mandate is to “make it as highly available as possible within the budget”.  So I’ll see a private cloud solution with redundancy at every level, capabilities to quickly recover from logical corruption, and automated failover between sites in the event of a regional failure, compared to a public cloud solution that provides nothing but a slim guarantee of 99.9% availability.  In this instance, it’s obvious why the public cloud provider is less expensive, even without factoring in efficiencies of scale.

To illustrate this, I usually refer to Maslow’s hand-dandy Hierarchy of Needs, customized for IT high availability.

image image

Single Site and Multi-site Hierarchies of Need

If I want to make an accurate comparison between a public cloud provider’s service and pricing and what I can do internally, I often have to strip out a lot of the services that are normally delivered internally.  Here’s the steps:

  1. Architect for equivalence.  If I have a public cloud provider just offering 3 9’s and no option for site to site failover, for my database services, I might just do a standalone database server.  Maybe I’d add a cheap rapid recovery solution (like snapshots or clones) to hedge against compete storage failure and cluster at the hypervisor layer to provide some level of hardware redundancy.  If my cloud provider offers disaster recovery, I’d figure out what their target RPO/RTO and insert some solution that matches that capability.
  2. Do a baseline price comparison.  Once I’ve got similar solutions to compare, I can compare price.  We’ll call this the price of entry.
  3. Add capabilities to the private cloud solution after the baseline.  I only start layering features that add availability and flexibility to the solution after I’ve obtained my baseline price.  Only then can I illustrate the true cost of those features, and compare them to the business benefits.

Lesson 4: Architect public and private clouds to the same levels of availability for cost estimate purposes

Lesson 5: Layer availability features onto private clouds for business requirement purposes

SQL Licensing and Virtualization: Lemons and Lemonade

Virtualizing SQL Server is not a new conversation for anybody who’s been around the technology over the last few years.  But recently there’s been a new twist to the conversation regarding licensing.

Microsoft has changed the licensing model for SQL Server Enterprise Edition 2012 (SQLEE).  It’s important to remind people here that Enterprise Edition is the only way to get high availability with SQL Server.  With SQL 2008, you could license Enterprise Edition under the “Server+CAL” model.  This meant that customers could buy virtually unlimited SQL processing with Enterprise features and, by limiting the number of clients directly accessing SQL, they could limit the cost of the SQL licensing.

The Lemons (SQL Licensing Changes)

“Per core” licensing is the only option available to SQL 2012 EE customers.  In a virtual environment, it’s “per vCPU”.  What’s more is that any server licensed for SQLEE needs a minimum of 4 licenses, regardless of the number of cores or vCPUs in use.  This is no surprise, and I don’t think any of us should hold this against Microsoft.  SQL Server is a very robust database platform, and its primary competitor has used this licensing model for many years now – Microsoft has been undervaluing their technology for the better part of a decade.

But let’s take a look at the kind of effect this can have on licensing costs.  Most of the time I see physical SQL Servers with 16 cores.  This makes sense – a decent server with 4 quad core processors is of pretty good value from almost any server vendor.  Through savvy application layer design, customers can limit the number of clients and devices directly querying the database; normally I see between 4 and 10 clients per server.  The list price for SQL 2008 EE was $8,500 per server, plus $150 per CAL.  So you could walk out the door with a highly available SQL Server for $10,000.  With SQL 2012 EE at $6,874 per core, this same server would cost $109,984 – nearly an 11x increase in price.

image

Even if you don’t have a processor intensive workload, this can still be costly.  Let’s say you can rip out 3 of those 4 processors and still run efficiently.  After all, not all business critical workloads are processor intensive, right?  But even the smallest possible server is going to cost nearly 3x the licensing of a 2008 Server+CAL model

image

The Sugar (All-you-can-eat SQL Server Enterprise)

Microsoft has embraced virtualization with its new SQL licensing model.  I might need HA for a workload, but it might not be processor intensive.  The graph below shows a prime example, where the CPU at peak is 31% busy.

image

This particular server has 16 cores, and as we’ve seen, cores = money in the SQL 2012 world.  If this customer could find a way to use those idle cycles for other SQL workloads, then they can save some significant money ($63,000 to be exact).  There are a few different ways to do this:

  • Putting multiple databases in a single instance
  • Putting multiple instances on a single physical machine
  • Putting multiple virtual machines on a single physical host

Microsoft has embraced virtualization here, because if you license the processors on a given physical machine SQL Server Enterprise, you can put all the SQL Server Enterprise VMs you want on that physical machine.  It doesn’t matter which hypervisor you’re using – VMware, Hyper-V, Xen, or Joe’s Hyperific Emporium and Bait Shack’s HyperVisor.  As long as it’s on the SVVP list, you can do it.  The two big caveats are:

  • If a VM is running SQL Server, either all the cores on the physical box hosting it must be licensed, or the VM must be licensed in a per vCPU manner
  • If you have gone with the core model in a virtualized environment, you need Software Assurance, or you are limited to one physical server move per 90 days.

The Water (Sub-Cluster licensing)

One of the biggest mistakes people make is licensing all the cores in a hypervisor cluster, or going to the trouble of building a dedicated cluster for SQL Server.  The first is far more expensive than it needs to be, and the second eats into the flexibility and value proposition of virtualization.  Let’s say I have an 8 node cluster, with 16 cores each.  This would cost about $900,000 to license.  I might be tempted to create a smaller cluster just dedicated to SQL Server.  But then I would be creating a processing silo, which wouldn’t be able to be shared amongst my other workloads.

The best option by far is sub-cluster licensing.  I could license just one of my servers, run as many VMs that can fit on that server.  I can still have HA, because passive nodes in a failover cluster do not need to be licensed.  If I move all my SQL VMs en masse to another host, I still have to license only one host.  Density is the key here.  Most people will do 4:1 vCPU:Core ratio and still get by just fine.  Some of Microsoft’s models do up to 12:1 vCPU:Core ratio, and if you can get assure performance with that model, you can get absolutely fantastic savings.  This graph compares 4 Physical Servers with the minimum of 4 cores each compared to a virtualized infrastructure with the 4 vCPUS and a conservative 4:1 vCPU:core ratio.  Most shops I see will consolidate more aggressively for more savings.

image

The Lemonade (Enterprise Edition for All)

Being able to choose SQL features like failover clustering and availability groups based on the workload rather than the licensing cost is by far the best aspect of this approach.  Imagine you have 8 SQL Standard Edition servers and you can consolidate those on a virtual platform with a 2 vCPUs:VM and a 4:1 vCPU:Core ratio.  You can get Enterprise features for those instances, and STILL save money on licensing.

image

The Blender (Performance Metrics)

For most folks, it’s not a matter of whether you can benefit from the approach, it’s how much you can benefit from it.  Everybody has some spare CPU on their physical servers, and a renewed server infrastructure would only create more spare CPU cycles.  So do some simple data gathering.  Look at how much CPU you have left over today.  Do some math to figure out how much spare CPU you’d have on a new compute infrastructure.  I’ve stepped through this exercise with numerous customers, and they’ve all been surprised at the extent to which this can not just save money, but add features and functionality to their environment.

Option 4: Third Party Replication (Or: How Stella Got Her Single Copy Cluster Back)

This is the fifth and final post in a series about the various options to achieve HA and DR with Exchange 2010.  In the first, I broke the DAG into its basic components (Active Manager and DAG replication).  In the second, I gave a quick overview of Native DAG.  In the third, I covered a hybrid approach that combined DAG replication and Active Manager for local HA, and array/SAN based replication for remote site recovery.  In the fourth, I described an option that deploys Exchange in a standalone configuration and leverages a hypervisor to achieve local high availability.

This one will cover Exchange’s Third Party Replication.  This one definitely has a lot of cool factor in it.  It actually leverages DAG (in the form of Active Manager) with array or SAN-based replication technology.  You get all the automatic failover, live patching, Exchange-aware coolness of DAG, zero data loss, and you only have to deploy one copy of the data at each site.

image

Although synchronous operation is possible with both Options 2 and 3, this is the only option shown that combines synchronous replication with automatic failover.

This option will use a plugin from EMC to coordinate the replication engine with Active Manager. This would be either Replication Enabler for Exchange 2010 (free!), or AutoStart 5.3 with the Exchange 2010 module. This option can also use a virtualization platform like Hyper-V or VMware, but a hypervisor is not necessary to leverage the benefits of this option.

Here are the cost factors:

  • Storage: 2
  • Network: 2

Architects and managers will typically consider this option when:

  • Lossless, automatic failover is required
  • Minimal hardware footprint is desired
  • There is minimal latency between the sites
  • A hardware VSS protection scheme is available for rapid recovery in the event of database corruption
  • Live patching is required

Option 3: Virtualized Host Clustering

This is the fourth in a series of posts about the various options to achieve HA and DR with Exchange 2010.  In the first, I broke the DAG into its basic components (Active Manager and DAG replication).  In the second, I gave a quick overview of Native DAG.  In the third, I covered a hybrid approach that combined DAG replication and Active Manager for local HA, and array/SAN based replication for remote site recovery.

I call this option “Virtualized Host Clustering”, because like the Virtualized Local DAG option, this options leverages a hypervisor, but it leverages the HA capabilities of the hypervisor instead.  The Exchange mailbox role is deployed in a VM as a standalone server.  This is by far the least expensive option from all perspectives: acquisition, operation, complexity, footprint, and power.

image

As you can see, there are some clear cost benefits to this option. You are replicating only one copy of the database, and since the HA function is handled by the hypervisor, a second copy of the database is unnecessary. It’s also the most flexible option – the full suite of workload management tools provided by the hypervisor can be used, and we add a fourth potential replication engine – VPLEX.

I suppose it’s also worth noting that one does not necessarily need a virtualization layer to accomplish this. With the right operational recovery plan to replace the server and restore from backup, three nines (99.9% availability) could easily be achieved without any HA facility whatsoever (either at the application or hypervisor layer).

This is by far the least expensive option from all perspectives: acquisition, operation, complexity, footprint, and power.

Here are the cost factors broken down:

  • Storage: 2
  • Network: 2 (SRDF, MirrorView), .5 (RecoverPoint)

Administrators and managers will typically choose this option when want to:

  • Minimize complexity of the deployment
  • Use advanced virtualization features such as Live Migration/Vmotion, DRS, etc
  • Achieve consistency with other line of business applications
  • Control failover with scripts or tools like VMware Site Recovery Manager
  • Control their RPO from zero data loss to minutes
  • Have multiple recovery points at each site
  • Control bandwidth utilized by replication
  • Meet failover requirements not achievable with native Exchange’s Best Copy Selection

This solution is not without its drawbacks however.  Here are a couple of things to consider:

  • This is the only option that where the administrator does not have the ability to do non-disruptive patching.  However, one should consider that boot times are pretty quick on virtual machines.  It’s very possible to achieve four 9’s (99.99% availability) with this solution despite the lack of live patching, and reboots can be scheduled for non-peak hours.
  • Since only one copy of the data is available at each site, a rapid recovery mechanism for logical and physical failure modes is well advised.  This is usually achieved through hardware based snapshots or bookmarks at minimal cost.  It’s usually a good idea to have a rapid recovery scheme outside of the context of the application anyway, for a variety of reasons.

Option 2: Virtualized Local DAG

This is the third in a series of posts about the various options to achieve HA and DR with Exchange 2010.  In the first, I broke the DAG into its basic components (Active Manager and DAG replication).  In the second, I gave a quick overview of Native DAG.  In this post, I’m going to cover a pretty popular option for folks who’ve already made the decision to virtualize their Exchange environment.

I call it “Virtualized Local DAG” because it uses DAG replication and Active Manager for local HA, but third party technologies for remote replication and failover.

This configuration utilizes a hypervisor such as Hyper-V or VMware’s. A two member DAG group is created, and then both copies are replicated from one site to the other. It’s important to note that because HA is dealt with by Exchange natively, any HA or workload migration features offered by the hypervisor (such as VMHA, DRS, Live Migration and VMotion) should be disabled for the mailbox VMs (other Exchange roles can utilize those features).

image

As for the network cost, it’s going to be more expensive operationally – you’re replicating two copies of the databases and two copies of the logs, and the storage cost is identical to Native DAG Replication.  However, this can be mitigated through the use of interesting compression and data reduction techniques made available by products like RecoverPoint.

  • Storage: 4
  • Network: 4 (SRDF, MirrorView), 1 (RecoverPoint)

Why would anyone choose this option? Well first, the network cost isn’t much more than Native DAG replication when you need to make sure you can reseed your databases in a reasonable amount of time. And if you use a replication appliance like RecoverPoint, you can end up using even less bandwidth than Native DAG replication, owing to the very good compression and data reduction an appliance like that offers.

Basically, for customers who've decided to virtualize Exchange, and already have working replication technologies for other applications, this option offers a lot of flexibility, integration with their data center strategies, and costs nothing in terms of storage footprint when compared to traditional DAG.

Ultimately, people may choose this option when there’s a need for or concern about:

  • Uptime while patching
  • Simplified and coordinated failover of all Exchange roles and services
  • The ability of the WAN to absorb re-seeding operations that will occur more frequently with log-based replication
  • Better control of failover operations than what native Exchange’s Best Copy Selection can provide
  • Full site failover of multiple applications (such as provided by PowerShell scripting with Hyper-V or VMware Site Recovery Manager)
  • Consistency with other line of business applications
  • Controllable RPO
  • Synchronous replication (loss-less cross-site failover)
  • Multiple recovery points are desired at each site
  • Controllable bandwidth utilization
  • Compliance and inclusion in a DR plan that includes other applications
  • Business requirements that can’t be met by DAG replication or Active Manager