Mobilize SharePoint with Syncplicity

End users love file sync and share utilities – it helps them do their jobs efficiently.  Enterprises love their Enterprise Content Management (ECM) infrastructure – they enable compliance with any number of data governance and  eDiscovery regulations and policies.  Wouldn’t it be great if you didn’t have to choose?  Wouldn’t it be great if you could bring your end users the functionality they want while retaining the ECM functionality you’ve already invested in?

Some folks will just invest in a new service to provide that sync and share functionality, then figure out how to apply their data governance policies.  This approach has a couple of pitfalls:

  • File collaboration is just one of the applications that ECMs like SharePoint support, and integration between file collaboration and other applications provides significant value.
  • Enterprises have real data governance needs, whether it’s DLP, legal compliance, eDiscovery, HIPAA or any number of other requirements.  Many enterprises have built their data governance models around ECMs they already have in place.  Re-establishing data governance capabilities around a specific service can be difficult (or impossible), and introduces path dependency to a particular service.   Nobody likes lock-in.

With Syncplicity, you don’t have to choose.  The Syncplicity SharePoint Connector allows you to seamlessly extend mobile access to your ECM systems like SharePoint and Documentum.  You can also choose to augment those systems with native sync and share functionality provided by the core Syncplicity service.  It’s all part of Syncplicity’s “no silos” approach.

A picture is worth a thousand words:

 

Sharepoint-account-in-device.png

 

See what we did there?  You can extend access to SharePoint to iOS, Android, and Windows Phone users.  There’s a fantastic (but brief) writeup on the functionality here.

The (somewhat) surprising economics of Office 365

Wikibon recently published some results of a study analyzing the economics of Unified Communications as a Service, with Office 365 as a case study.

Here are a few things that jumped out at me:

On-premises deployments cost significantly less than Office 365.

This might seem counterintuitive for people who look at just the subscription costs.  It’s important to remember that cloud service offerings reduce, but do not eliminate the need for on-premises services.  You’ll still need front level helpdesk services and even some level of Exchange expertise in house to service a sizable user population.  It also increases costs in some areas like networking, and the upfront cost of migration can be significantly higher than simply moving to a new version of Exchange.  Check out the picture below for details on cost breakdown.

The “cost crossover” point is 300 users

The notion that “owning” becomes more economical than “renting” once you use enough of a resource is intuitive.  However, I’ve always thought that the crossover point was somewhere around 2,000 users.  Wikibon’s analysis shows it happens far earlier, at around 300 users.

You should upgrade to the 2013 suite now

The UC Suite isn’t just Exchange – it’s SharePoint, Lync and the rest of it.  There are marked productivity improvements over the 2010 suite coming from enhanced communication and collaboration options, as well as much improved mobile support.  It’s easy to imagine how useful features like one-click video calls, mobile Lync clients,  enhanced mobile support in SharePoint increase productivity.  And 5% may not sound like a lot, but this could mean millions to an organization.  An organization with a $100 loaded employee cost will see $1,200/month in productivity improvements, at an outside cost of $30/month.  Even if the UC 2013 suite delivers on a fraction of that, it’s a very quick ROI.

DAS still doesn’t make sense for the UC 2013 suite

“DAS vs SAN” debates are just so, like, 2010.  Most customers I talk to now deploy applications on some sort of converged system – whether it’s a reference architecture like VSPEX or an engineered system like vBlock.  But even if you’re going to roll your own infrastructure, it helps the bottom line to think about the entire ecosystem before you deploy.  Just like SaaS doesn’t eliminate the need for on-premises equipment and expertise, the fact that you can put Exchange on direct attached storage doesn’t negate the benefits of shared storage.  Nor do Exchange’s features extend throughout the entire UC suite.  It’s good to ponder the following when planning the hardware infrastructure:

  • Footprint and cost reduction from features like thin provisioning
  • Agility from being able to scale performance and capacity independently
  • Ability to integrate the disaster and backup and recovery policies and capabilities of the entire UC stack
  • Oh, and it’s less expensive

Final thoughts

Wikibon’s blog post is a pretty interesting read.  Of course, even if the economics don’t work out, you may still have good reasons to move to a public cloud – perhaps you feel that service levels will be better with an external provider.  Or perhaps you just don’t want to be in the business of providing unified communications services to your users.  If that’s the case, be aware that there are other offerings than Office 365.  EMC and its partners provide cost-competitive, feature-rich hosted unified communication solutions for environments of any size.

Write Order Fidelity, Consistency Groups Databases. Oh my.

WARNING:  GEEK BLOG POST

When you talk to a storage vendor about asynchronous block replication, your first two questions should be:

  1. Do you preserve write order fidelity within a single LUN?
  2. Can you preserve write order fidelity between multiple LUNs?

Consistency Group (CG) technology is cool.  When you put all your databases and associated logs in a CG, you can replicate asynchronously and still have your database come up at the DR site every time.  When you don’t have it, you need to enforce consistency by entering a state in which the database can be backed up while the database is mounted.  With SQL, this would mean using a VDI or VSS requestor to enter that state, taking a snapshot with a hardware provider, and finally replicating that snapshot. 

It’s not that snap and replicate is a bad thing – people have been doing it for years.  But it does limit your achievable recovery point objective to double the frequency with which you can comfortably quiesce and replicate your database.  It also limits your achievable recovery time objective because often extra steps are needed to recover your database.

This is all tribal knowledge amongst storage and database folks.  But people often don’t know why, either because storage and database administrators are mortal enemies or they speak different languages. 

So here’s why:

Let’s start with a concept known as “Write Order Fidelity” (WOF).  When applied to asynchronous remote replication technology, this means that the writes at the disaster recovery (DR) site are applied in the same order as they were applied at production site. 

async_noWOF

Async replication without WOF

In the instance above, when you try to attach that database, it will appear wholly inconsistent and may not attach.  Worse, it could attach a corrupt database successfully.

WOF preservation looks like this:

asyncWOF

Async replication with WOF

In this case, you’re replicating asynchronously, but the writes are applied to the DR site in the same order they were applied at the production site.  So at any given time, the data at the DR site looks as if the server had simply stopped working at the production site.  There’s data loss, and transactions may need to be rolled back, but that’s an automatic, normal operation with a database like SQL, the JET database backing Exchange, or Oracle.  In fact, that’s what SQL does every time there’s an unplanned cluster failover. 

But why don’t we need WOF with synchronous replication?  That’s an interesting question.  First, WOF is implied with true synchronous replication.  Second, true synchronous replication actually writes to the DR site before writing to the production site:

sync

Sync replication – WOF is always enforced

In this case, the DR site is always in complete synchronicity with the production site, writes must be acknowledged at the DR site prior to being considered “applied” at the production site.  Of course, this presents the optimal situation – replicated data with no potential for data loss.  However, it comes at a cost:  any network latency you have will be added to the storage latency.  So in effect the distance you can replicate is limited by the storage latency your application can tolerate.  For those of you keeping notes, you generally want to keep your write latencies to your transaction logs under 10 ms, which makes for a pretty limited distance.

So that’s the reason behind the first question you’re asking the storage vendor.  What’s up with write order fidelity among multiple LUNs?

It turns out that most people will follow their database vendor’s advice and put their database and transaction logs on separate LUNs.  It’s sorta outside the scope of this post, but in general it’s to ensure recoverability in the event of a lost LUN.  It’s also for performance purposes – your transaction log is sensitive only to write latency and is always sequential in nature, whereas your database is more sensitive to read latency and can be random, sequential, or anything in between.

The function of preserving write order fidelity across multiple LUNs is generally performed by a “Consistency Group” (CG) in EMC parlance.  Usually other vendors will use that term – I don’t believe it’s trademarked.  CG technology is integrated into RecoverPoint, SRDF/A and even MirrorView/A.  Remember, it’s not needed with any true synchronous technology.  But most people have asynchronous replication requirements.

And Groups are really, really important for databases

This has to do with the ACID properties of databases that are in wide use today (if you want a brief but cool read on the history of the modern database, wander on over here).  Specifically iy has to do with the atomicity part of the ACID properties.  If part of a database transaction fails – no matter the reason - the entire transaction gets rolled back.

That’s one of the big reasons the transaction log even exists.  Lots of storage people think the log is there only for rolling forward in the event of a failure.  Not true.  It can be used to roll back in the event that a transaction fails.  In fact, storage failure is not the only reason a transaction fails.  Go look at the ACID properties to see other reasons why a transaction might fail. 

So anyway, with atomicity in mind, consider the following scenario:  You’re replicating asynchronously, and you’ve verified that your storage vendor honors write order fidelity within a single LUN.  However, write order fidelity is not honored among multiple LUNs, and you’ve followed best practices in separating your databases and logs.  A failure scenario might look like this:

noCG

Multiple LUNs without consistency group technology

In this case, the database is slightly “ahead” of the transaction log.  The RDBMS (like SQL or Oracle) would say, “well I’ve got only part of a transaction here.  No problem.  I’ll roll it back.  I’ll refer to my transaction log to see how I might achieve exactly that”.

Keep in mind I don’t write software for a living.  I’m paraphrasing.

However, when it refers to the transaction log, it doesn’t see stuff in there relevant to how it might go about rolling back the transaction.  In my snazzy animation, it needs data from blocks six and nine to roll back the transaction.  The RDBMS promptly gives up, goes for a latte, leaving you to restore from a backup.

Enter a consistency group.  As I’ve mentioned, this technology enforces write order fidelity across multiple disks.  So you can have your cake and eat it too.

CG

Multiple LUNs with consistency group technology

In this case we see a failure happen in mid-transaction.  Of course this can happen any time even without any sort of remote replication.  However, if the database and transaction log are in the same consistency group, the transaction log will always have the data necessary to automatically roll back the transaction and begin processing.

That’s about all there is to it.  When I call this “crash consistency” the emphasis is on “consistency”.  As long as all the data associated with the database (logs and DB file) are consistent, the RDBMS will be able to recover.  It’s a normal, regular, every day operation that happens whenever a fault is sensed within a SQL Server resource group.  Emphasizing “crash” as in “car wreck” is misleading.

Lastly, it’s only a matter of time before someone at Pixar notices my awesome animations and calls me with some sort of really cool job offer.  So I’m not sure how long I’ll be around here.

Whats in your SLA?

People have been considering and comparing public (hosted) and private (on-premises) cloud solutions for some time in the messaging world, and at increasing rates for database and other application workloads.  I’m often surprised at how many people either don’t know the contents and implication of their service provider service level agreement (SLA), or fail to adjust the architecture of private cloud solution and then directly compare cost. 

Here are my five lessons for evaluating SAAS, PAAS, and IAAS provider SLAs:

Lesson 1: Make sure that what’s important to you is covered in the SLA

Lesson 2: Make sure that the availability guarantee is what you require of the service

Lesson 3: Evaluate the gap between a service outage’s cost to business and the financial relief from the provider

Lesson 4: Architect public and private clouds to the similar levels of availability for cost estimate purposes

Lesson 5: Layer availability features onto private clouds for business requirement purposes

I’ll use the Office 365 SLA to explore this topic – not because I want to pick on Microsoft,  but because it’s a very typical SLA, and one of the services it offers (email) is so universal that it’s easy to translate the SLA’s components into the business value that you’re purchasing from them.

Defining availability

The math is simple.  It’s a 99% uptime guarantee with a periodicity of one month:

image

If that number falls below 99, then they have not met their guarantee.  For what it’s worth, during a 30 day month, the limit will be about 44 minutes of downtime before they enter the penalty, or about 8.7 hours per year.

But what does “Downtime” mean?  Well, it’s stated clearly for each service.  This is the definition of downtime for Exchange Online:

“Any period of time when end users are unable to send or receive email with Outlook Web Access.”

Here’s what’s missing:

  • Data:  The mailbox can be completely empty of email the user has previously sent and received.  In fact the email can disappear as soon as they receive it.  As long they can log in via OWA, the service is considered to be “up”.
  • Clients:  Fat outlook, blackberry, and Exchange ActiveSync (iPhone/iPad/Winmopho, and most Android) clients are not covered in any way under the SLA

Lesson 1: Make sure that what’s important to you is covered in the SLA

Lesson 2: Make sure that the availability guarantee is what you require of the service

Balancing SLA penalties with business impact

My Internet service is important to me.  When it’s down, I lose more productivity than the $1/day or so I spend on it.  Likewise, email services are probably worth more than the $8/month/user or so that you might pay your provider for it.  That doesn’t mean that you should spend more than you need for email services.  But it does mean that if you do suffer an extended or widespread outage, there will likely be a large gap between the productivity cost of the downtime and the financial relief you’ll see in the form of free services you’ll see from the provider. 

image

Callahan Auto Parts also offers a guarantee

I’ll put this in real numbers.  Let’s say I have a 200 person organization.  I might pay $1600/month for email services from a provider.  If my email is down for a day during the month, my organization experiences 96% uptime for that month, and as a result, my organization is entitled to a month of free email from the provider, worth about $800.

image

The actual cost of my downtime will very likely exceed $800.  To calculate that cost we need the number of employees, the loaded cost per hour for the average employee, and and the productivity cost of the loss of email services.  For our example of 200 employees, let’s imagine a $50/hour average loaded cost to business and a 25% loss of productivity when email is down:

200 employees x $50 cost per hour x .75 productivity rate x 8 hour outage = $60,000 of lost productivity

Subtract the $800 in free services the organization will receive the next month, and the organization’s liability is $59,200 for that outage.

Now how do you fill that gap?  I’m not entirely sure.  It could be just the risk of doing business – after all, the business would just absorb that cost if they were hosting email internally and suffered an outage.  If the risk and impact were large enough, I would probably seek to hedge against it – exploring options to bring services in house quickly, or even looking to an insurance company to defray the cost of outages – if Merv Hughes can insure his mustache for $370,000, then surely you can insure the availability of your IT services.  Regardless, it’s wise not to confuse a “financially backed guarantee” with actual insurance or assurance against outage.

File Photo:  What a $370k mustache may look like.  Strong.

Lesson 3: Evaluate the gap between a service outage’s cost to business and the financial relief from the provider

Comparing Apples to Oranges

image

See what I did there?

Doing a cost comparison between public cloud designed to deliver 99.9% availability and a private cloud designed to provide 99.99% or 99.999% availability makes little sense, but I see people do it very frequently.  Usually it’s because the internal IT group’s mandate is to “make it as highly available as possible within the budget”.  So I’ll see a private cloud solution with redundancy at every level, capabilities to quickly recover from logical corruption, and automated failover between sites in the event of a regional failure, compared to a public cloud solution that provides nothing but a slim guarantee of 99.9% availability.  In this instance, it’s obvious why the public cloud provider is less expensive, even without factoring in efficiencies of scale.

To illustrate this, I usually refer to Maslow’s hand-dandy Hierarchy of Needs, customized for IT high availability.

image image

Single Site and Multi-site Hierarchies of Need

If I want to make an accurate comparison between a public cloud provider’s service and pricing and what I can do internally, I often have to strip out a lot of the services that are normally delivered internally.  Here’s the steps:

  1. Architect for equivalence.  If I have a public cloud provider just offering 3 9’s and no option for site to site failover, for my database services, I might just do a standalone database server.  Maybe I’d add a cheap rapid recovery solution (like snapshots or clones) to hedge against compete storage failure and cluster at the hypervisor layer to provide some level of hardware redundancy.  If my cloud provider offers disaster recovery, I’d figure out what their target RPO/RTO and insert some solution that matches that capability.
  2. Do a baseline price comparison.  Once I’ve got similar solutions to compare, I can compare price.  We’ll call this the price of entry.
  3. Add capabilities to the private cloud solution after the baseline.  I only start layering features that add availability and flexibility to the solution after I’ve obtained my baseline price.  Only then can I illustrate the true cost of those features, and compare them to the business benefits.

Lesson 4: Architect public and private clouds to the same levels of availability for cost estimate purposes

Lesson 5: Layer availability features onto private clouds for business requirement purposes

Is it time to say goodbye to Jetstress?

The short answer?  “Yes”  The long answer?  “Yyyyyesssss”

But first let me get this out of the way:  If you want to run Jetstress against any storage configuration I come up with, feel free.  I wouldn’t put it forward if I weren’t confident it could handle the workload. 

Prior to 2007, Jetstress REALLY mattered.  You had 500MB mailboxes that could easily 2-5 IO/s per mailbox.  Cached clients were rare – so storage latency was the primary driver of customer complaints.  Over the last ten years, Microsoft has put a lot of effort into making Exchange a much more storage-friendly application, and they’ve succeeded.  Today you have 0.1 IO/s per mailbox, and it’s spread over 2-5 GB.  Exchange is now Just Another Workload.  So why are we spending all this time and money (not to mention implementation delays) using an unwieldy purpose-built testing tool for something that’s Just Another Workload?

With Exchange 2010 and its very modest IO profile, I question the value of Jetstress as opposed to other testing tools.  The level of effort and sheer amount of time required to create the databases, replicate them, and then run the test are significant.  It can run to weeks for reasonably sized deployment.  Yes, you get assurance that your storage rig is operating properly, but you can get that assurance from tools like Iometer, which can take seconds to set up, and mere hours to complete.

For all the effort and time involved in a Jetstress run, I just expect more.  I’d expect that my entire infrastructure would be validated.  I’d expect assurance that I have enough RAM and CPU in my virtual machines, that my network is up to snuff, access to my domain controllers and global catalog servers is sufficient… but I don’t get any of that with Jetstress.

If I’m going to put in the kind of time and effort into my testing that Jetstress requires, I’m going to fire up an entire infrastructure and use Loadgen and verify my entire configuration – not just my storage.  On the other hand, if I’m going to test my storage independently from my server and network, I would:

Roll my own Exchange IO test with Iometer in 30 minutes

  • Set up your storage on a your production mailbox server
  • Determine the file sizes for your database and logs  You can find that on the LUN Requirements tab of the Exchange Mailbox Storage Calculator

image

  • Using fsutil (built in Windows command line tool), create files called iobw.tst sized according to the DB Size + Overhead and Log Size + Overhead using fsutil.  For our example, we’re looking at 1595 GB database files and  34 GB log files.  This part is not strictly necessary, but I like it.  Creating a file called iobw.tst in the root directory of the target will prevent ioMeter from creating thick files that occupy the entire LUN.
    • fsutil file createnew e:\iobw.tst 1672478720 <----------simulated 1.5 TB database file
    • fsutil file createnew f:\iobw.tst 35651584 <-------------simulated 34 GB log file
  • Download Iometer and the Exchange 2010 .icf file I’ve created.  Launch ioMeter and open the icf file.
    • If you’re using mount points instead of drive letters, download the latest Iometer Release candidate for mount point support
  • Determine the target IO throughput for the databases and logs.  This can be determined from the Role requirements tab of the Exchange 2010 Storage calculator

image 

  • Modify the transfer delay in “Exchange 2010 DB Workload” Global Access specification so it will generate the desired number of IOs. The math is: 1000 ÷ target IO/s. Our example requires 30 IO/s per database, and 1000 ÷ 30=33.3.  So we’ll set it to 33.  The original in the icf file is 25, which would generate 40 IO/s.

image

  • Modify the transfer delay in the “Exchange 2010 Log Workload” Global Access Specification so it will generate the desired number of IOs.  Our example requires 7 IO/s per log LUN, and 1000 ÷ 7=142.8, so we’ll set it to 143.  The orignial in the .icf file is 100, which would generate about 10 IO/s.

image

  • Assign the DB Worker and BDM Worker to the database LUNs
  • Assign the Log Worker to the Log LUNs
  • Click the Green Flag and start.  Let it run for 5-10 minutes for a quick sanity check, and stop it.  Make sure it’s driving the IO you want at the latencies you expect, and you’re not gated by CPU or anything like that.
  • Start a perfmon data collection (perfcollect is good for this).
  • Modify the test tab for a however long you’d like.  I recommend at least a few hours.

image

  • Take a nap
  • Go for a run
  • Eat some food
  • Watch some TV
  • When the test completes, open up your perfmon file, look at your disk latencies, make sure they’re steady, there were no spikes, and there were no aberrations in number of IO/s
    • If you’re an EMC customer and use perfcollect, zip up the perfcollect data collection and send it to your TC, or reseller TC, and ask for a WPA (miTrend) report on the server(s).  You’ll get a nicely formatted report with graphs and tables and twenty-seven 8x10 color glossy pictures with circles and arrows and a paragraph on the back of each one

Using this method, you can get in and out of testing mode within easily 36 total hours, and your time will be less than an hour of setup and analysis.  That translates into weeks of time where your users can spend enjoying your cool new messaging infrastructure.