Windows Server 2012 Storage Features

Windows 2012 introduces a lot of new features and capabilities helping IT organizations to lower cost when using these built-in
features. Last week, Michael Otey  posted a blog talking about new Windows 2012 Storage Features and I wanted to add
additional information regarding EMC capabilities and integrations with this new technology.

 

EMC is leading the Datacenter transformation and in Big Data and currently supports many of these new features and capabilities found in Windows 2012

 

One example is data deduplication, a new feature that performs data deduplication in the background without any performance impact to primary workloads and includes the option to schedule the process for the volume or files.

The best candidates for deduplication are file shares, software deployment shares and virtualization deployment shares such as VHD libraries. Applications which continue access like Microsoft SQL and Exchange Server are not good candidates for deduplication.

 

Deduplication also could be a great benefit for backup and restore process, Microsoft provides a VSS writer for data deduplication backup and restore process.  For customers who run data deduplication on Windows Server 2012 with supported EMC arrays, Windows will use the deduplication processes directly from the array extending the storage feature through to Windows.

 

SMB 3.0 protocol is another feature that brings new capabilities  including performance and high availability improvements. This is great for new implementations and deployments and in the near future we will see applications such as Hyper-V and SQL Server be
implemented on SMB file shares in our Datacenters where EMC storage is present.

 

We can also expect to see different DR Solutions where we can leverage some of those functionalities, for instance latency reduction over WAN is an example of performance improvement in this new SMB version.

 

Overall the new features for Windows 2012 SMC protocol includes:

 

  • SMB Transparent Failover
  • SMB Scale Out
  • SMB Multichannel
  • SMB Direct
  • SMB Encryption
  • VSS for SMB file shares

 

These new features and capabilities provide flexibility and reliability to Windows Server 2012 and Hyper-V deployments with SMB-based storage on EMC storage arrays.  EMC fully supports SMB 3.0 within its unified storage platforms, such as EMC VNX.

 

For more information about Windows 2012 SMB please visit http://support.microsoft.com/kb/2709568

 

Thin Provisioning technology provides efficiency for storage provisioning and business applications. This new feature of Windows 2012 is integrated with EMC arrays, so that EMC virtual provisioning gives storage administrators flexibility in deploying storage to Windows Server 2012 and Hyper-V hosts.

 

Windows Server 2012 can detect thin-provisioned storage on EMC storage arrays and reclaim unused space, including when Windows Server 2012 is deployed within a Hyper-V virtual machine.

 

Offloaded Data Transfer (ODX).Is a  feature that enables you to be more efficient when you are moving data in a shared storage array.
This reduces CPU and network resources consumption on the physical host and increase data movement speed. This is a great functionality in virtual environments when we have to move Virtual Machines between different locations.

 

Windows Server 2012 and EMC intelligent arrays make ODX-enabled file functions transparent to applications, which means that Windows Server 2012 and Hyper-V hosts that use EMC storage arrays automatically optimize file and move functions without
administrator intervention.

 

More information about ODX in this link http://msdn.microsoft.com/en-us/library/windows/desktop/hh848056%28v=vs.85%29.aspx

 

EMC is excited to support these and many more new features found in Windows Server 2012!

For more information, be sure to visit the Everything Microsoft at EMC Community.

Whats in your SLA?

People have been considering and comparing public (hosted) and private (on-premises) cloud solutions for some time in the messaging world, and at increasing rates for database and other application workloads.  I’m often surprised at how many people either don’t know the contents and implication of their service provider service level agreement (SLA), or fail to adjust the architecture of private cloud solution and then directly compare cost. 

Here are my five lessons for evaluating SAAS, PAAS, and IAAS provider SLAs:

Lesson 1: Make sure that what’s important to you is covered in the SLA

Lesson 2: Make sure that the availability guarantee is what you require of the service

Lesson 3: Evaluate the gap between a service outage’s cost to business and the financial relief from the provider

Lesson 4: Architect public and private clouds to the similar levels of availability for cost estimate purposes

Lesson 5: Layer availability features onto private clouds for business requirement purposes

I’ll use the Office 365 SLA to explore this topic – not because I want to pick on Microsoft,  but because it’s a very typical SLA, and one of the services it offers (email) is so universal that it’s easy to translate the SLA’s components into the business value that you’re purchasing from them.

Defining availability

The math is simple.  It’s a 99% uptime guarantee with a periodicity of one month:

image

If that number falls below 99, then they have not met their guarantee.  For what it’s worth, during a 30 day month, the limit will be about 44 minutes of downtime before they enter the penalty, or about 8.7 hours per year.

But what does “Downtime” mean?  Well, it’s stated clearly for each service.  This is the definition of downtime for Exchange Online:

“Any period of time when end users are unable to send or receive email with Outlook Web Access.”

Here’s what’s missing:

  • Data:  The mailbox can be completely empty of email the user has previously sent and received.  In fact the email can disappear as soon as they receive it.  As long they can log in via OWA, the service is considered to be “up”.
  • Clients:  Fat outlook, blackberry, and Exchange ActiveSync (iPhone/iPad/Winmopho, and most Android) clients are not covered in any way under the SLA

Lesson 1: Make sure that what’s important to you is covered in the SLA

Lesson 2: Make sure that the availability guarantee is what you require of the service

Balancing SLA penalties with business impact

My Internet service is important to me.  When it’s down, I lose more productivity than the $1/day or so I spend on it.  Likewise, email services are probably worth more than the $8/month/user or so that you might pay your provider for it.  That doesn’t mean that you should spend more than you need for email services.  But it does mean that if you do suffer an extended or widespread outage, there will likely be a large gap between the productivity cost of the downtime and the financial relief you’ll see in the form of free services you’ll see from the provider. 

image

Callahan Auto Parts also offers a guarantee

I’ll put this in real numbers.  Let’s say I have a 200 person organization.  I might pay $1600/month for email services from a provider.  If my email is down for a day during the month, my organization experiences 96% uptime for that month, and as a result, my organization is entitled to a month of free email from the provider, worth about $800.

image

The actual cost of my downtime will very likely exceed $800.  To calculate that cost we need the number of employees, the loaded cost per hour for the average employee, and and the productivity cost of the loss of email services.  For our example of 200 employees, let’s imagine a $50/hour average loaded cost to business and a 25% loss of productivity when email is down:

200 employees x $50 cost per hour x .75 productivity rate x 8 hour outage = $60,000 of lost productivity

Subtract the $800 in free services the organization will receive the next month, and the organization’s liability is $59,200 for that outage.

Now how do you fill that gap?  I’m not entirely sure.  It could be just the risk of doing business – after all, the business would just absorb that cost if they were hosting email internally and suffered an outage.  If the risk and impact were large enough, I would probably seek to hedge against it – exploring options to bring services in house quickly, or even looking to an insurance company to defray the cost of outages – if Merv Hughes can insure his mustache for $370,000, then surely you can insure the availability of your IT services.  Regardless, it’s wise not to confuse a “financially backed guarantee” with actual insurance or assurance against outage.

File Photo:  What a $370k mustache may look like.  Strong.

Lesson 3: Evaluate the gap between a service outage’s cost to business and the financial relief from the provider

Comparing Apples to Oranges

image

See what I did there?

Doing a cost comparison between public cloud designed to deliver 99.9% availability and a private cloud designed to provide 99.99% or 99.999% availability makes little sense, but I see people do it very frequently.  Usually it’s because the internal IT group’s mandate is to “make it as highly available as possible within the budget”.  So I’ll see a private cloud solution with redundancy at every level, capabilities to quickly recover from logical corruption, and automated failover between sites in the event of a regional failure, compared to a public cloud solution that provides nothing but a slim guarantee of 99.9% availability.  In this instance, it’s obvious why the public cloud provider is less expensive, even without factoring in efficiencies of scale.

To illustrate this, I usually refer to Maslow’s hand-dandy Hierarchy of Needs, customized for IT high availability.

image image

Single Site and Multi-site Hierarchies of Need

If I want to make an accurate comparison between a public cloud provider’s service and pricing and what I can do internally, I often have to strip out a lot of the services that are normally delivered internally.  Here’s the steps:

  1. Architect for equivalence.  If I have a public cloud provider just offering 3 9’s and no option for site to site failover, for my database services, I might just do a standalone database server.  Maybe I’d add a cheap rapid recovery solution (like snapshots or clones) to hedge against compete storage failure and cluster at the hypervisor layer to provide some level of hardware redundancy.  If my cloud provider offers disaster recovery, I’d figure out what their target RPO/RTO and insert some solution that matches that capability.
  2. Do a baseline price comparison.  Once I’ve got similar solutions to compare, I can compare price.  We’ll call this the price of entry.
  3. Add capabilities to the private cloud solution after the baseline.  I only start layering features that add availability and flexibility to the solution after I’ve obtained my baseline price.  Only then can I illustrate the true cost of those features, and compare them to the business benefits.

Lesson 4: Architect public and private clouds to the same levels of availability for cost estimate purposes

Lesson 5: Layer availability features onto private clouds for business requirement purposes

Perfcollect video series

Perfcollect is really easy to use, but generates a bunch of interesting data which can be bewildering to a new user of the tool.  So I’ve started a little video series on the topic, which perhaps will blossom into a series on Windows performance analysis and triage in general.

I’ve got three up now, all linked off the perfcollect page.  The videos are hosted on EMC’s Everything Microsoft at EMC site – you don’t need to have a login to watch the videos.

Right now, here's what's up there:

 

Enterprise Flash Drives: not just performance

I often encounter the misconception that EFDs are not beneficial unless you need to either reduce latencies below what traditional disks can get you, or you’re short-stroking your disk in order to maintain performance.  So I figured I’d go through the three general use cases I talk about with EFDs:

  1. Do more stuff
  2. Do the same stuff faster
  3. Do the same stuff, but with less gear

These are not mutually exclusive.  In most cases, EFDs allow people to do more stuff, faster, with less gear.  But your goals for EFDs will certainly flavor how to best deploy them.

Do more stuff (increase scale)

Let’s use an entirely contrived order processing system (like a trading desk). Let’s say this system can support 1,000 trades a minute. But during peak trading times, you’re getting more trades than you can process.

The business case for EFDs here would be that you can increase revenue by processing more orders.  Here’s an example where six EFDs supported seven times the transactions of six traditional fibre channel drives.  And this is a perfect example of how increased scale and reduced latency are not mutually exclusive – the response times on the EFD drives were 7 times lower than the response times on the spinning disks.

image

source

Note that both of the cases thus far really depend on how much your EFDs cost, and how much productivity improvement you’re going to see from their deployment.  That’s significantly different than this one:

Do the same stuff faster (reduce latency):

Let’s take a large manual order-entry system, where user wait time for a query is 5 seconds, and users do about a one query per minute. Let’s say the performance gate in this scenario is storage and it’s getting about 5-7 ms latency (about as good as you can get with a performance HDD at scale due to rotational latency).

The business case for EFDs here would be that employees in this role spend about 8% of their time waiting on the database. If you can reduce that to 1.6%, you realize massive productivity improvements.

Here’s some data: Note the graphs are not on the same scale.

image image

source

Do the same stuff, but with less gear (decrease footprint): 

Let’s say that you’ve got an application that’s fat and happy residing on ninety 10k performance HDDs.  You’re not short-stroking them too badly, but it’s still taking about $6,000 a year to power them, $2,000 a year to cool them, about 18U of rack space to store them, not to mention the maintenance cost associated with them.  But the fact is that in most environments, only part of the data set is actually frequently accessed.  An example of this might be an order processing or inventory system that may go back years or even decades.  The old data isn’t accessed very frequently at all, while the newer data is getting hammered.

Using either manual or automated tiering, you put the older data on cheaper, denser, and more power efficient nl-SAS drives, while the most frequently accessed data can reside on faster, less dense (but still more power efficient) EFDs.

Here’s some supporting data.  Now, the performance was slightly increased, but the bulk of the savings came in the form of footprint – acquisition, power, management, and so forth.  Had the goal been to increased scale or reduce response time, then the deployment method would have differed.

image   image

source

SQL Server Logical Disk Layout

A few weeks ago, I was talking to a meetup of SQL Server folks here in the Hartford area about storage and SQL Server. We were going through the ideal storage layout of SQL server, and someone in the group summarized it as "so I need a minimum of five disks (LUNs) for a SQL Server?" I'd never really thought of it in those terms, so let's get down to the thinking:

First, this has mostly to do with logical layout of data. Aside from separating logs and databases on different physical media, it has nothing to do with physical layout of data. Even if you have only a couple of disks to allocate to the server, you can still logically partition it so that it's easy to evaluate performance later on.

As with almost everything else related to SQL Server, it's going to depend on the workload. The key aspects are:

  • Performance sensitivity – This is not necessarily a performance intensive workload – it could be nearly idle. The key question is "if performance of this application suffers, which business processes will suffer, and how many users will suffer?" If there's a chance that someone important is going to call you up and ask you to fix a performance problem with this SQL server, it's performance sensitive. Even if you're resource constrained, it would help to have things laid out so you can evaluate performance without reconfiguring the database (which in reality includes nearly as much effort as migrating to entirely new storage).
  • Recoverability – This would be dictated by whether you'll need to be able to perform up to the minute recovery of the database. If you do, you'll need to consider the physical layout of the data so that the failure of a physical disk or RAID group doesn't take out the database and its transaction logs at the same time.

Here's a quick version of the rules:

  1. If your database is neither performance sensitive and you do not need the ability to recover data up to the last transaction, then you only need one or two:
    1. OS/Apps
    2. Databases and transaction logs
  2. If the data in your database is critical enough that you'll want to be able to recover up to the latest transaction, you'll need three.
    1. OS
    2. Databases
    3. Transaction logs (you also need to make sure that these are physically separated from the databases, so that you don't lose data from both of them at the same time).
  3. If your data is performance sensitive (regardless of whether the data is sensitive), you need a minimum of five, and possibly more:
    1. OS
    2. System databases (other than tempdb)
    3. User databases
    4. User database transaction logs
    5. One for tempdb and its logs

The principle behind this separation is the performance evaluation of these components independently of each other. If I have my user databases mixed with tempdb, and I'm having performance issues, I have no real way of telling which database is presenting the IO, and which database is starving. All I know is that performance stinks equally on both databases. More importantly for transaction logs, you want to both avoid contention between the IO that log flushes create and normal user IO, and you also want to make sure that transaction log disks get a write response time of well less than 10 milliseconds.

These five LUNs are a starting point – I often see systems with a dozen or even a couple dozen LUNs. What's the reason for adding LUNs above this five?

  • Additional segregation of the unrelated workloads. I can put two different databases on two different LUNs.
  • Segregation of related, but different workloads. For example, I could put my non-clustered indexes on different disk than my data files.
  • Simply adding queues – each disk gets an additional queue
  • Increase the granularity of restore options if you're using hardware-based snapshots, clones, or CDP. In this case, the management boundary is the LUN itself, so if you want to do rapid restore of a failed or corrupt data file using this method, then you would restore all the databases on that LUN together (whether they are corrupt or not). There are ways to do selective restore in this case, but it can take longer to perform the restore.

Remember that multiple LUNs can actually share the same physical disks, so your workloads can still step on each other if that's the case. However, it makes the troubleshooting process much easier. And if you're using the right technology, fixing the problems can be completely seamless.