Whats in your SLA?

People have been considering and comparing public (hosted) and private (on-premises) cloud solutions for some time in the messaging world, and at increasing rates for database and other application workloads.  I’m often surprised at how many people either don’t know the contents and implication of their service provider service level agreement (SLA), or fail to adjust the architecture of private cloud solution and then directly compare cost. 

Here are my five lessons for evaluating SAAS, PAAS, and IAAS provider SLAs:

Lesson 1: Make sure that what’s important to you is covered in the SLA

Lesson 2: Make sure that the availability guarantee is what you require of the service

Lesson 3: Evaluate the gap between a service outage’s cost to business and the financial relief from the provider

Lesson 4: Architect public and private clouds to the similar levels of availability for cost estimate purposes

Lesson 5: Layer availability features onto private clouds for business requirement purposes

I’ll use the Office 365 SLA to explore this topic – not because I want to pick on Microsoft,  but because it’s a very typical SLA, and one of the services it offers (email) is so universal that it’s easy to translate the SLA’s components into the business value that you’re purchasing from them.

Defining availability

The math is simple.  It’s a 99% uptime guarantee with a periodicity of one month:

image

If that number falls below 99, then they have not met their guarantee.  For what it’s worth, during a 30 day month, the limit will be about 44 minutes of downtime before they enter the penalty, or about 8.7 hours per year.

But what does “Downtime” mean?  Well, it’s stated clearly for each service.  This is the definition of downtime for Exchange Online:

“Any period of time when end users are unable to send or receive email with Outlook Web Access.”

Here’s what’s missing:

  • Data:  The mailbox can be completely empty of email the user has previously sent and received.  In fact the email can disappear as soon as they receive it.  As long they can log in via OWA, the service is considered to be “up”.
  • Clients:  Fat outlook, blackberry, and Exchange ActiveSync (iPhone/iPad/Winmopho, and most Android) clients are not covered in any way under the SLA

Lesson 1: Make sure that what’s important to you is covered in the SLA

Lesson 2: Make sure that the availability guarantee is what you require of the service

Balancing SLA penalties with business impact

My Internet service is important to me.  When it’s down, I lose more productivity than the $1/day or so I spend on it.  Likewise, email services are probably worth more than the $8/month/user or so that you might pay your provider for it.  That doesn’t mean that you should spend more than you need for email services.  But it does mean that if you do suffer an extended or widespread outage, there will likely be a large gap between the productivity cost of the downtime and the financial relief you’ll see in the form of free services you’ll see from the provider. 

image

Callahan Auto Parts also offers a guarantee

I’ll put this in real numbers.  Let’s say I have a 200 person organization.  I might pay $1600/month for email services from a provider.  If my email is down for a day during the month, my organization experiences 96% uptime for that month, and as a result, my organization is entitled to a month of free email from the provider, worth about $800.

image

The actual cost of my downtime will very likely exceed $800.  To calculate that cost we need the number of employees, the loaded cost per hour for the average employee, and and the productivity cost of the loss of email services.  For our example of 200 employees, let’s imagine a $50/hour average loaded cost to business and a 25% loss of productivity when email is down:

200 employees x $50 cost per hour x .75 productivity rate x 8 hour outage = $60,000 of lost productivity

Subtract the $800 in free services the organization will receive the next month, and the organization’s liability is $59,200 for that outage.

Now how do you fill that gap?  I’m not entirely sure.  It could be just the risk of doing business – after all, the business would just absorb that cost if they were hosting email internally and suffered an outage.  If the risk and impact were large enough, I would probably seek to hedge against it – exploring options to bring services in house quickly, or even looking to an insurance company to defray the cost of outages – if Merv Hughes can insure his mustache for $370,000, then surely you can insure the availability of your IT services.  Regardless, it’s wise not to confuse a “financially backed guarantee” with actual insurance or assurance against outage.

File Photo:  What a $370k mustache may look like.  Strong.

Lesson 3: Evaluate the gap between a service outage’s cost to business and the financial relief from the provider

Comparing Apples to Oranges

image

See what I did there?

Doing a cost comparison between public cloud designed to deliver 99.9% availability and a private cloud designed to provide 99.99% or 99.999% availability makes little sense, but I see people do it very frequently.  Usually it’s because the internal IT group’s mandate is to “make it as highly available as possible within the budget”.  So I’ll see a private cloud solution with redundancy at every level, capabilities to quickly recover from logical corruption, and automated failover between sites in the event of a regional failure, compared to a public cloud solution that provides nothing but a slim guarantee of 99.9% availability.  In this instance, it’s obvious why the public cloud provider is less expensive, even without factoring in efficiencies of scale.

To illustrate this, I usually refer to Maslow’s hand-dandy Hierarchy of Needs, customized for IT high availability.

image image

Single Site and Multi-site Hierarchies of Need

If I want to make an accurate comparison between a public cloud provider’s service and pricing and what I can do internally, I often have to strip out a lot of the services that are normally delivered internally.  Here’s the steps:

  1. Architect for equivalence.  If I have a public cloud provider just offering 3 9’s and no option for site to site failover, for my database services, I might just do a standalone database server.  Maybe I’d add a cheap rapid recovery solution (like snapshots or clones) to hedge against compete storage failure and cluster at the hypervisor layer to provide some level of hardware redundancy.  If my cloud provider offers disaster recovery, I’d figure out what their target RPO/RTO and insert some solution that matches that capability.
  2. Do a baseline price comparison.  Once I’ve got similar solutions to compare, I can compare price.  We’ll call this the price of entry.
  3. Add capabilities to the private cloud solution after the baseline.  I only start layering features that add availability and flexibility to the solution after I’ve obtained my baseline price.  Only then can I illustrate the true cost of those features, and compare them to the business benefits.

Lesson 4: Architect public and private clouds to the same levels of availability for cost estimate purposes

Lesson 5: Layer availability features onto private clouds for business requirement purposes

Is it time to say goodbye to Jetstress?

The short answer?  “Yes”  The long answer?  “Yyyyyesssss”

But first let me get this out of the way:  If you want to run Jetstress against any storage configuration I come up with, feel free.  I wouldn’t put it forward if I weren’t confident it could handle the workload. 

Prior to 2007, Jetstress REALLY mattered.  You had 500MB mailboxes that could easily 2-5 IO/s per mailbox.  Cached clients were rare – so storage latency was the primary driver of customer complaints.  Over the last ten years, Microsoft has put a lot of effort into making Exchange a much more storage-friendly application, and they’ve succeeded.  Today you have 0.1 IO/s per mailbox, and it’s spread over 2-5 GB.  Exchange is now Just Another Workload.  So why are we spending all this time and money (not to mention implementation delays) using an unwieldy purpose-built testing tool for something that’s Just Another Workload?

With Exchange 2010 and its very modest IO profile, I question the value of Jetstress as opposed to other testing tools.  The level of effort and sheer amount of time required to create the databases, replicate them, and then run the test are significant.  It can run to weeks for reasonably sized deployment.  Yes, you get assurance that your storage rig is operating properly, but you can get that assurance from tools like Iometer, which can take seconds to set up, and mere hours to complete.

For all the effort and time involved in a Jetstress run, I just expect more.  I’d expect that my entire infrastructure would be validated.  I’d expect assurance that I have enough RAM and CPU in my virtual machines, that my network is up to snuff, access to my domain controllers and global catalog servers is sufficient… but I don’t get any of that with Jetstress.

If I’m going to put in the kind of time and effort into my testing that Jetstress requires, I’m going to fire up an entire infrastructure and use Loadgen and verify my entire configuration – not just my storage.  On the other hand, if I’m going to test my storage independently from my server and network, I would:

Roll my own Exchange IO test with Iometer in 30 minutes

  • Set up your storage on a your production mailbox server
  • Determine the file sizes for your database and logs  You can find that on the LUN Requirements tab of the Exchange Mailbox Storage Calculator

image

  • Using fsutil (built in Windows command line tool), create files called iobw.tst sized according to the DB Size + Overhead and Log Size + Overhead using fsutil.  For our example, we’re looking at 1595 GB database files and  34 GB log files.  This part is not strictly necessary, but I like it.  Creating a file called iobw.tst in the root directory of the target will prevent ioMeter from creating thick files that occupy the entire LUN.
    • fsutil file createnew e:\iobw.tst 1672478720 <———-simulated 1.5 TB database file
    • fsutil file createnew f:\iobw.tst 35651584 <————-simulated 34 GB log file
  • Download Iometer and the Exchange 2010 .icf file I’ve created.  Launch ioMeter and open the icf file.
    • If you’re using mount points instead of drive letters, download the latest Iometer Release candidate for mount point support
  • Determine the target IO throughput for the databases and logs.  This can be determined from the Role requirements tab of the Exchange 2010 Storage calculator

image 

  • Modify the transfer delay in “Exchange 2010 DB Workload” Global Access specification so it will generate the desired number of IOs. The math is: 1000 ÷ target IO/s. Our example requires 30 IO/s per database, and 1000 ÷ 30=33.3.  So we’ll set it to 33.  The original in the icf file is 25, which would generate 40 IO/s.

image

  • Modify the transfer delay in the “Exchange 2010 Log Workload” Global Access Specification so it will generate the desired number of IOs.  Our example requires 7 IO/s per log LUN, and 1000 ÷ 7=142.8, so we’ll set it to 143.  The orignial in the .icf file is 100, which would generate about 10 IO/s.

image

  • Assign the DB Worker and BDM Worker to the database LUNs
  • Assign the Log Worker to the Log LUNs
  • Click the Green Flag and start.  Let it run for 5-10 minutes for a quick sanity check, and stop it.  Make sure it’s driving the IO you want at the latencies you expect, and you’re not gated by CPU or anything like that.
  • Start a perfmon data collection (perfcollect is good for this).
  • Modify the test tab for a however long you’d like.  I recommend at least a few hours.

image

  • Take a nap
  • Go for a run
  • Eat some food
  • Watch some TV
  • When the test completes, open up your perfmon file, look at your disk latencies, make sure they’re steady, there were no spikes, and there were no aberrations in number of IO/s
    • If you’re an EMC customer and use perfcollect, zip up the perfcollect data collection and send it to your TC, or reseller TC, and ask for a WPA (miTrend) report on the server(s).  You’ll get a nicely formatted report with graphs and tables and twenty-seven 8×10 color glossy pictures with circles and arrows and a paragraph on the back of each one

Using this method, you can get in and out of testing mode within easily 36 total hours, and your time will be less than an hour of setup and analysis.  That translates into weeks of time where your users can spend enjoying your cool new messaging infrastructure.

Enterprise Flash Drives: not just performance

I often encounter the misconception that EFDs are not beneficial unless you need to either reduce latencies below what traditional disks can get you, or you’re short-stroking your disk in order to maintain performance.  So I figured I’d go through the three general use cases I talk about with EFDs:

  1. Do more stuff
  2. Do the same stuff faster
  3. Do the same stuff, but with less gear

These are not mutually exclusive.  In most cases, EFDs allow people to do more stuff, faster, with less gear.  But your goals for EFDs will certainly flavor how to best deploy them.

Do more stuff (increase scale)

Let’s use an entirely contrived order processing system (like a trading desk). Let’s say this system can support 1,000 trades a minute. But during peak trading times, you’re getting more trades than you can process.

The business case for EFDs here would be that you can increase revenue by processing more orders.  Here’s an example where six EFDs supported seven times the transactions of six traditional fibre channel drives.  And this is a perfect example of how increased scale and reduced latency are not mutually exclusive – the response times on the EFD drives were 7 times lower than the response times on the spinning disks.

image

source

Note that both of the cases thus far really depend on how much your EFDs cost, and how much productivity improvement you’re going to see from their deployment.  That’s significantly different than this one:

Do the same stuff faster (reduce latency):

Let’s take a large manual order-entry system, where user wait time for a query is 5 seconds, and users do about a one query per minute. Let’s say the performance gate in this scenario is storage and it’s getting about 5-7 ms latency (about as good as you can get with a performance HDD at scale due to rotational latency).

The business case for EFDs here would be that employees in this role spend about 8% of their time waiting on the database. If you can reduce that to 1.6%, you realize massive productivity improvements.

Here’s some data: Note the graphs are not on the same scale.

image image

source

Do the same stuff, but with less gear (decrease footprint): 

Let’s say that you’ve got an application that’s fat and happy residing on ninety 10k performance HDDs.  You’re not short-stroking them too badly, but it’s still taking about $6,000 a year to power them, $2,000 a year to cool them, about 18U of rack space to store them, not to mention the maintenance cost associated with them.  But the fact is that in most environments, only part of the data set is actually frequently accessed.  An example of this might be an order processing or inventory system that may go back years or even decades.  The old data isn’t accessed very frequently at all, while the newer data is getting hammered.

Using either manual or automated tiering, you put the older data on cheaper, denser, and more power efficient nl-SAS drives, while the most frequently accessed data can reside on faster, less dense (but still more power efficient) EFDs.

Here’s some supporting data.  Now, the performance was slightly increased, but the bulk of the savings came in the form of footprint – acquisition, power, management, and so forth.  Had the goal been to increased scale or reduce response time, then the deployment method would have differed.

image   image

source

Best Practices for Windows Mounts Points

Mount points – they’re understandably popular.  And although they’ve been around for quite a while, some people have questions around their implementation.  Yes, they’re really easy to set up, but you should follow some guidelines so you can take leverage advanced technologies down the road.

For the uninitiated, here’s an quick overview of mount points:

Traditionally, a windows volume (disk drive, LUN, etc) had to be mounted to a drive letter for Windows to access it.  This results in a limitation of 24 hard drives that can be mounted to a system.  To overcome this limitation, Microsoft introduced the ability to mount a drive on an empty directory within any NTFS filesystem.  You can try it yourself next time you format a USB stick.  It’s quite neat – now you can have a virtually unlimited number of drives attached to your system.  You know, like UNIX.  Winking smile.  It also makes it easier to find stuff.

Here’s an example:  Let’s say I have a database with 2 data files, and 1 transaction log, and I want to keep them on separate drives for recovery and performance purposes.  Instead of mounting them to letters g:\, h:\ and i:\, I can do this:

  • g:\dbname\dbfile01
  • g:\dbname\dbfile02
  • g:\dbname\tlog

Where dbfile01, dbfile02, and tlog are all empty directories on which a drive is mounted (mount point).  The directory structure is clear to anyone who looks at it, and if I need to add more drives to my database I just create a directory in dbname, and put a drive on it.  I try not to put the mount points on my system drive (c:\), although I’m allowed to.  The reason is that the mount point looks just like a directory – you have to know that it’s a mount point.  When I have it on another drive letter, it’s clear that there’s another drive there. 

So what’s a nested mount point?  As the name implies, it’s where you have a mount point within a mount point.  In our case, the dbname directory could be a mount point.

Are nested mount points categorically a bad idea?  Absolutely not – in fact it’s a pretty common practice.  The key is that there shouldn’t be unique data higher in the directory structure than a mount point.  For example, this is a bad idea:

  • g:\dbname\dbfile
  • g:\dbname\dbfile\tlog

Where both dbfile and tlog are both mount points.  First, it wouldn’t occur to another DBA that tlogs is actually a different drive.  It also has implications for backup and recovery if you’re trying to leverage a volume-based backup and recovery system.  The reason for this becomes clear when you start playing through a recovery scenario.  Let’s say I want to restore my database, but keep my transaction log around for replay.  VSS and the SQL Virtual Device Interface (VDI) backup and restore at the disk level rather than the file level.  So I’ve completely replaced not just g:\dbname\dbfile, but also g:\dbname\dbfile\tlog, where the transaction log lives.

I haven’t necessarily lost any data – I can just go find the tlog drive, mount it up, and continue my recovery.  But it clearly poses problems if I want to automate the process.

So this causes all sorts of confusion around whether nested mount points are supported.  In the case of Replication Manager, the answer is yes, nested mount points are supported, as long the volumes in the application set are not nested within each other.  To give examples, this is supported:

  • g:\dbname\dbfile
  • g:\dbname\tlog

Where dbname, dbfile, and tlog are all mount points, but only dbfile and tlog are part of the application set being protected.

On the other hand, this is not supported:

  • g:\dbname\dbfile
  • g:\dbname\dbfile\tlog

Where dbfile and tlog are mount points and both part of the application set being protected.

If you have any lessons learned about mount points you think are worth sharing, post a comment below!

SharePoint Conference Season

Hi all,

While May, June and August are the era for the big platform events such as EMC World, TechED, and VMWorld…
October is the season for my two major application events.

I am happy to announce that EMC are proud Gold Sponsors at:-

  • SharePoint Conference USA                    Anaheim, California – Oct 3-6
  • SQL PASS Summit                                    Seattle, WA – Oct 11-14

EMC @ the SharePoint Conference

  • Large booth where key experts from the EMC Business Units will be able to describe to you how to make your life easier with SharePoint
  • Demonstrations, mini-lectures, and Q&As
  • Free give-aways.  Yes, again, like TechEd, we will have free t-shirts and on the final day many, many cash spot-prizes for wearing your EMC T-shirt

      Two Sessions

Speaker(s):  James Baldwin, Eyal Sharon  (James & Eyal show)
Level: 200
Understand technical best practices to design and deploy a virtualized SharePoint that leverages FAST Search. Understand how design a flexible and robust architecture that supports your advanced collaboration requirements. Understand how to architect a solution that addresses IT challenges for data growth, application availability and simplified management that also enables your users to find and leverage the right business information to make better decisions.

Speakers:-  Matt Roberts, Nate Treloar
Level 300
Demonstrate how to integrate external video metadata generation services with native SharePoint Search capabilities

Dont forget Europe!

The European SharePoint Conference is taking place in Berlin, Germany   – October 17-20.

I will be there presenting the following session:-

Optimize, Store and Protect SharePoint 2010 Server…Best Practices     Wednesday 15:00 – Session W21

Learn about the critical best practices and considerations for optimizing and growing SharePoint farms, storing user data efficiently and securely, while backing up TB’s a data in minutes. RBS (Remote Blob Store) and Virtualization, are just two of the many techniques discussed in this session. Realize the considerations for providing fast, automated disaster recovery for the entire SharePoint environment through SAN-based technology.

EMC @ SQL PASS Summit

We will have something kinda special at the SQL PASS Summit.  Can’t say more.

But what I can say…

  • Large booth area in the Pavillion, with SQL Experts from EMC including two heros from our team, Tony Wu and Bruce Ye, travelling all the way from Shanghai.
  • Demos, booths, best practices and most importantly application-led conversations around;
  • SQL Server scalability – Infrastructure
  • Optimized Data Protection
  • High availability to where? Same SAN? Same site? next door? next state? next country?  – All of the above <—
  • Something Flashy
  • Proven Solutions around high-speed SQL deployments, one of which is in build right now with Michael and David in our Cork labs.

Hope to see you there.

James.