Community for Sharing Storage Resource Management (SRM) Custom Reports

February 21, 2014 Corporate IT, EMC SRM, EMC VMAX No comments

EMC SRM 3.0

EMC SRM 3.0I was in an EMC and EMC Partner training class around SRM 3.0 (Storage Resource Management Suite) today, and came away very excited about one of the latest software assets in EMC’s portfolio (SRM home page).  SRM 3.0 is a comprehensive tool for reporting on capacity and performance of the largest storage enterprises in the world.  Using passive or standards based active discovery, SRM can provide comprehensive capacity and performance reporting and trending across your entire IT stack: VM’s, physical servers, SAN infrastructure, and storage elements.

EMC is becoming very good at providing a simple and easy out-of-the-box experience for its software products while preserving a very rich underlying functionality and flexibility that elevates EMC products above the competition. This balance is very evident in SRM 3.0, where predefined active dashboards show summary data, and allow click-through to the underlying metrics and detailed reports. (Watch a demo video here).

SRM-ReportDuring the class lecture, (and the first hands-on experience I had with the tool) I was able to modify one of the standard storage pool reports to show the data reduction associated with Thin Provisioning on VMAX and VNX. I discussed some of these formulas in a previous post (Some EMC VMAX Storage Reporting Formulas).

One of the best parts of SRM 3.0 is the ability to export and import Report Templates. As such, I exported the report I customized, and posted the XML here for you to download and import into your own SRM 3.0 installation.

I’d like to invite all of you to export reports that you have customized for your environment with the community, so that we can engage the power of everyone to advance the art of storage resource management.

My first python and JSON code

December 11, 2013 Devops, EMC VMAX 3 comments , , , ,

I’m not a developer. There I said it.

I’m a presales technologist, architect, and consultant.

But I do have a BS in Computer Science. Way back there in my hind brain, there exists the ability to lay down some LOC.

I was inspired today by a peer of mine Sean Cummins (@scummins), another Principal within EMC’s global presales organization. He posted an internal writeup of connecting to the VMAX storage array REST API to pull statistics and configuration information. Did you know there is a REST API for a Symmetrix?! Sad to say most people don’t.

He got me hankering to try something, so I plunged into Python for the first time, and as my first example project, I attached to Google’s public geocoding API to map street addresses to Lat/Lng coordinates. (Since I don’t have a VMAX in the basement)

So here it is. I think it’s a pretty good first project to learn a few new concepts. I’ll figure out the best way to parse a JSON package eventually. Anyone have any advise?

###################
# Example usage
###################
$ ./geocode.py
Enter your address:  2850 Premiere Parkway, Duluth, GA                 
2850 Premiere Parkway, Duluth, GA 30097, USA is at
lat: 34.002958
lng: -84.092877

###################
# First attempt at parsing Google's rest api
###################
#!/usr/bin/python

import requests          # module to make html calls
import json          # module to parse JSON data

addr_str = raw_input("Enter your address:  ")

maps_url = "https://maps.googleapis.com/maps/api/geocode/json"
is_sensor = "false"      # do you have a GPS sensor?

payload = {'address': addr_str, 'sensor': is_sensor}

r = requests.get(maps_url,params=payload)

# store the json object output
maps_output = r.json()

# create a string in a human readable format of the JSON output for debugging
#maps_output_str = json.dumps(maps_output, sort_keys=True, indent=2)
#print(maps_output_str)

# once you know the format of the JSON dump, you can create some custom
# list + dictionary parsing logic to get at the data you need to process

# store the top level dictionary
results_list = maps_output['results']
result_status = maps_output['status']

formatted_address = results_list[0]['formatted_address']
result_geo_lat = results_list[0]['geometry']['location']['lat']
result_geo_lng = results_list[0]['geometry']['location']['lng']

print("%s is at\nlat: %f\nlng: %f" % (formatted_address, result_geo_lat, result_geo_lng))

Enterprise Storage Scale Out vs Scale Up

November 22, 2013 Corporate IT, EMC VMAX No comments

A lot of what we hear in the cloud washed IT conversation today is that scale out architectures are the preferred method of deployment over scale up designs. As the theory goes, scale out architectures provide more and more compute connectivity and storage capacity as you add nodes.

Scale outMost vendor marketing has jumped onto the scale out bandwagon leaving implementations that also scale up indistinguishable from the rest. Properly conceived scale out architectures allow us to add additional discrete workloads to an existing system by adding nodes to the underlying infrastructure. The additional nodes accommodate processing required to support the additional separate workloads. This kind of share nothing design, allows us to stack a large quantity of discrete workloads together and manage the entire system is one entity, even though it can be very hard to balance workload across the nodes (silo’ed).

But what about the large quantity of monolithic workloads that need scalable performance but are not able to be distributed across independent nodes? Take for example a single database instance that begins life with just a few users, and is supported well on a single node of a scale out architecture. As this instance grows, more users are added, the database grows beyond the confines of the single node. In many pure scale out implementations, adding additional nodes is ineffectual. The additional nodes do not share in the burden of the workload from the first node. They are designed merely to accommodate additional discrete workloads.

Scale upEMC scale out and up products such as VMAX Isilon, provide the cost benefits of a scale out model with the performance benefits of a scale up model. Purchasing only a few nodes initially, IT gains the cost benefit of starting small, and as workloads increase, additional nodes can be brought online, and data and workload is automatically redistributed across all nodes. This creates a share everything approach that supports the ability to scale up discrete applications and while also providing the ability to scale out stacking additional applications into the array along with the first. Software-based controls provide for multi-tenancy, tiering and performance optimization, workload priorities, and quality of service. This is as opposed to scale out architectures where rigid hardware defined partitions separate components of the solution.

I hope this helps demystify “scale out” a little. Just because you can manage a cluster of independent nodes, doesn’t mean it’s an integrated system. Unless workload tasks are shared across the entire resource pool, scale out does not confer the benefits of scale up, where scale up-and-out designs provide the best of both worlds.

Some EMC VMAX Storage Reporting Formulas

November 21, 2013 Corporate IT, EMC VMAX 7 comments

What goes around comes around. Back in the day everything was done on the command line. Storage administrators had their scripts, their hands never had to leave the keyboard, they could provision storage quickly and efficiently, and life was good. Then the age of the easy to use GUI came into effect. Non-storage administrators could deploy their own capacity, and vendors espouse how much easier and more effective IT generalists could be in managing multiple petabytes of capacity with the mouse. Now we’re into the age of the API. Storage along with all other IT resource domains has its reporting API and developers are now writing code to automatically report on and provision storage in the enterprise. And many shops have realized that the command line is as effective an API for automating reports and doing rapid deployments as any REST based API could be.

So I’m getting a lot of questions again how to do some basic reporting on a VMAX storage array from command line generated or API generated data. Customers are writing their own reporting dashboards to support multiple frameworks.

The most fundamental question is how much of my storage array have I consumed? Assuming one is inquiring about capacity and not performance, there are two aspects to that question. The first question is from the business, “How much capacity has been allocated to my hosts?” In other words how much capacity do my hosts perceive they have been assigned? In an EMC VMAX array this has to do with the Subscription rate. For better or for worse we provide a Subscription percentage of a pool that has been Subscribed, as opposed to an absolute value of capacity.

To calculate how much of your array has been Subscribed, sum the products of the Pool % Subscribed by Pool Total Available GB to get the total capacity Subscribed to the hosts.

Symcff list thin pools detail

 

EFD Pool Subs % * EFD Pool Total GBs = EFD Subscribed GB
FC Pool Subs % * FC Pool Total GB = FC Subscribed GB
SATA Pool Subs % * SATA Pool Total GB = SATA Subscribed GB
Sum the totals above = Array Subscribed GB

Using the example above, this would be:

0% * 4,402.4 = 0.0  EFD Subscribed GB
177% * 77,297.4 = 136,816.398 FC Subscribed GB
73% * 157,303.7 = 114,831.701 SATA Subscribed GB
251,648.099 Array Subscribed GB

Then compare that number to the sum of the Total GBs provided by each of your three pools to reveal the ratio of the subscribed array capacity versus the total available capacity.

If you’re Array Subscribed GB is larger than your Array Total GB, then tread carefully since you have oversubscribed your array’s capacity!

Subscribed GB / Array Total GB = Array Subscription Rate
251,648.099 / 239,003.5 = 105.29% oversubscribed in this case!

To understand how much capacity has been filled physically allocated on the array after all data reduction technologies have done their job (thin provisioning, compression, de-dupe), simply sum the Used GBs capacity from each of your pools to show an overall array consumed capacity. Compare this number to the previously mentioned subscription capacity to show the benefit of data reduction technologies in the array.

Sum of Used GBs above is ( 3855.6 + 38708.4 + 43987.6 ) = 86,551.6 GB
251,648.099 Subscribed GB / 86,551.6 Consumed GB = 2.9 : 1 Data Reduction provided by Thin Provisioning!  Not bad at all!

Likewise sum the free capacity measures from each pool together to get the available capacity remaining in the array.

There you have it. If you’re not into the GUI, just a little math provide you the dashboard data you need to show your executives the capacity utilization of the array.

Happy hacking!

Array Sizing based on ITaaS Principles

November 21, 2013 Corporate IT, EMC VMAX No comments

There are two hats we wear when creating an IT service offering.  The first is the customer-facing marketing hat where customer needs are defined, and service offerings are advertised at various price levels in a catalog.  The second hat is the backend product management hat where the front-facing service catalog is translated to various reference architectures at well defined cost levels.

The goal when running IT as a business (aka: ITaaS) is that the fully burdened costs associated with the reference architecture are recouped over a reasonable period of time by the pricing set in the service catalog.

A few rules to live by in this world view related to enterprise storage:

  • The customer (business user) only has visibility to the Service Catalog.  All communications with the customer are based on the language used in the Catalog.
  • The customer (business user) never sees the Reference Architecture that supports each of the Services listed in the Catalog.  Never have customer communications about arrays, drive mix, RAID type, methods, processes, etc.
  • If the customer is unhappy with the service they are receiving, an evaluation must be made as to whether or not they are receiving the expected Service Level Objective as defined in the Service Catalog.  If not… it’s IT’s problem.  If they are and are still unhappy, then a discussion must follow about moving them to a different Tier of service as defined in the Service Catalog.
  • If a differentiated Tier of service does not exist in the Service Catalog, and there is a defined customer need for that class of service, then an end-to-end service definition must occur.  IT must create and price a new offering in the Service Catalog.  IT must also create a corresponding new definition in the Reference Architecture to explain how that service will be accomplished and at what cost.

What does this have to do with sizing a storage array?

Whether or not IT is living out the utopian vision of ITaaS, some concrete realities emerge from the ITaaS vision that show up in array sizing, namely:  Subscribed vs. Allocated capacity planning, and performance planning measured in IOPS/GB.

Warning:  There are lots of different considerations when it comes to array performance:  Write%, Skew, Cache Hit Rate, Sequentiality, Block Size, app sensitivity to Latency vs. Throughput, advanced data protection services etc.  Let’s take these for granted for now and focus on IOPS/GB, because frankly that’s how the Service Providers do it.

How much capacity do I need to purchase?

Customer facing Service Catalog capacity planning should start with the Subscribed capacity.  This defines the promise IT makes to the user about how much capacity is available to store data and at what tier. This is the advertised capacity to which pricing is set.

IT internal Reference Architecture capacity planning should worry more about the Consumed capacity after any available data reduction technologies do their work (thin, de-dupe, compress, etc).  This is the actual capacity (plus buffer) to be purchased to which costs are allocated.

What kind of performance do I need?

Customer facing Service Catalog performance planning can be done by SLA or SLO.  Do you want to establish a worst case scenario floor that you promise to exceed every time (SLA)?  Or do you establish a best case performance target ceiling you promise to try and deliver (SLO)?  The world is moving away from SLA’s to SLO’s.  The IOPS/GB performance level is defined as the IOPS to the Host divided by the Subscribed GB (IOPS / GBs).

IT internal Reference Architecture performance planning should be done against IOPS to the Host divided by the Thin Consumed GB (IOPS / GBc).

An IOPS / GB sizing example that leads to poor performance

A Business customer has an existing application in a legacy storage environment.  They’ve got 40TB of capacity assigned and are doing about 4,000 IOPS to the array.

That’s 4,000 / 40,000 = .1 IOPS/GBs (Host IOPS per Subscribed GigaByte).

A purchase is made for 56 x 1TB SATA drives RAID 6 6+2 providing ~40TB capacity, and  4,480 IOPS (56 * 80 IOPS) to fit the .1 IOPS/GB workload profile rather well (ignoring too many factors).

But … after Thin Provisioning, only 40% of the capacity is consumed (16TB), and the IT Finance Dept says, wow, I’ve still got another 34TB of capacity available.  Let’s use it!!  The problem is, they’ve consumed all their IOPS, leaving a lot of capacity stranded and unable to be used.

An IOPS / GB sizing example that leads to good performance

A Business customer has an existing application in a legacy storage environment.  They’ve got 40TB of capacity assigned and are doing about 4,000 IOPS to the array.  Planning for Thin Provisioning, we expect only the first 40% (16TB) will actually be Consumed.

That’s 4,000 / (40,000 * 40%) = .25 IOPSh/GBc (Host IOPS per Consumed GigaByte).

A purchase is made for 42 x 600GB 10K drives RAID 5 3+1 providing ~17TB capacity, and  ~5,000 IOPS (42 x 120) to fit the workload profile rather well (ignoring many factors, see “Disk IOPS” discussion further down).  This also assumes the risk that Thin Provisioning or other data reduction technologies are able to do their job effectively and allow you to purchase less capacity than in a “thick” world.

Over time, the app performs well, and is more properly balanced in both the IOPS and TB dimensions.

What if your boss says you must purchase the Thick capacity?

Some pointy haired bosses who don’t trust that Thin Provisioning will provide real benifit will say that you must purchase the “Thick” capacity amount due to various sociological / political / economic conditions within your organization.

In this case, make two assumptions:

  1. Your goal is to reach 80% capacity consumed over the life of your array
  2. Thin will cause 40% Consumption of every Subscribed GB

Use the formula Host IOPS  / ( Thick GB * 40% ) to arrive at your target IOPS/GB for the entire array’s worth of capacity!  This will be a much bigger  number than before, and will equate to a much higher performance solution.  The trick is, when Thin Provisioning (or other data reduction) kicks in, you’ll have so much head room, that someone will want to stack additional applications onto the array to consume all the capacity.

You’ll need the additional performance to accommodate this inevitable consolidation.

Use the above calculated IOPS/GB * (Thick GB * 80%) to reveal the target Host IOPS needed to achieve full efficient use of the installed capacity.  It’s never a good idea to consume more than 80% of a presumably over-subscribed resource.  It’s ok to target your performance to equate to 80% capacity consumption.

Let’s add RAID Type and Read/Write % into the mix

So far, I’ve been focused on “Host IOPS” when discussing IO Density.  Now that we’ve calculated the Host IOPS and Capacity properly, how do we configure an array to achieve a certain Host IOPS Density?  We need to translate Host IOPS down into “Disk IOPS.”
By Disk IOPS, I mean the amount of work the underlying Flash or HDD Spindles must do to accomplish the Host IOPS goal.  I won’t have room to discuss all of the potential factors that go into a model like this, but we’ll get started down the path.

FYI, some of those factors are: block size, cache benefits, data reduction (benefit or overhead), replication, etc.

Most implementations of traditional RAID types RAID-1, RAID-5, and RAID-6 carry with them some form of write IO penalty. When a host writes to a RAID-1 mirror device, 2 IOs are generated by the backend storage representing the write to the first spindle, and the second write to the mirror pair.  So we say that the write penalty for RAID-1 is 2.  When a host writes to a RAID-5 protected device and the write I/O is not the full stripe width, then potentially 4 IOs must happen on the backend of the array. One read I/O from the parity portion, a second read I/O from the data stripe, CRC calculations are done in memory, and then a third IO is issued to write the parity portion, and the fourth and last IO is issued to write the data stripe. So we say that RAID-5 has a 4 IO write penalty. For RAID-6, there must be a read of the first parity, a read of the second parity, a read of the data stripe, and after CRC calculations are done three writes must occur to finish the IO. So we say that RAID-6 carries a 6 IO write penalty.

Coupled with RAID type, we must know the read/write percentage of our workload. We assume that reads do not carry any penalty to the backend spindles, so for every host read there is a single disk I/O.  Obviously at this point we are not factoring in any cache benefits.  We must multiply every host write by the write IO penalty previously described. So knowing that we need to support a 10,000 IO workload which is 60% writes, we can calculate how many IOPS the backend disks must be able to provide.

IOPS to Disk  = %READS * Host-IOPS + %WRITES * Host-IOPS *  RAID-Penalty

Where the RAID-Penalty is 2, 4, or 6 as described above.

Now that we know how many disk IOPS our spindles must be able to produce, how many spindles do we need? Use the following chart to reference how many IOPS per spindle technology type can be provided with no more than one outstanding queued I/O to the spindle (aka:  good response times).

2500 IOPS per EFD (200GB EFD = 12.5 IOPS/GB)
180 IOPS per 15K RPM spindle (2.5″ or 3.5″ doesn’t matter)  (300GB = .6 IOPS/GB)
120 IOPS per 10K RPM spindle (2.5″ or 3.5″ doesn’t matter)  (600GB = .2 IOPS/GB)
80 IOPS per 7.2K RPM spindle (typically SATA)  (2TB = .04 IOPS/GB)

A Final Example

A Business customer has an existing application in a legacy storage environment.  They’ve got 40TB of capacity assigned and are doing about 4,000 IOPS to the array.  Planning for Thin Provisioning, we expect only the first 40% (16TB) will actually be Consumed.  Our pointy haired boss is going to require that we purchase all 40TB of capacity… hmph.

That’s 4,000 / (40,000 * 40%) = .25 IOPSh/GBc (Host IOPS per Consumed GigaByte).

Now factoring in the inevitable growth that we know will happen due to all of the free space that will remain in the array, we say:

.25 IOPS/GB * (40,000 * 80%) = 8,000 Host IOPS that the array solution will need to support over it’s lifespan

Ignoring the complexity of multi-tiered solutions, let’s look for a single spindle type that will meet our needs.  Our IOPS/GB target is .25.  From the previous table, it looks like a 600GB 10K is .2 and a 300GB 15K is .6… Hmmm, what about a 600GB 15K drive?  That would be:

180 IOPS / 600GB = .3 IOPS/GB  It looks like a match!

Since this is a consolidated general purpose VMware farm that we’re supporting, let’s assume the Read/Write ratio will be 40% Reads and 60% Writes (this is VERY consistent across VMware clusters, FYI).  Also, since we’re using fast FC spindles, let’s see what a RAID-5 3+1 solution will look like, shall we?

Here’s our formula repeated from above:
IOPS to Disk  = %READS * Host-IOPS + %WRITES * Host-IOPS *  RAID-Penalty

40% * 8,000 + 60% * 8,000 * 4 = 22,400 IOPS to Disk
Wow, that’s a lot of write penalty, since there we only needed 8,000 Host IOPS.  Our 600GB 15K RPM drive can do 180 IOPS per spindle, so 22,400 / 180 = ~124 spindles… let’s round up to 128 since that’s a nice power of 2 to achieve the performance needed to support our hosts.

Now, does this provide enough capacity?  Well, 128 * 600GB * 3/4 (RAID Overhead) = 57,600 GB.  It looks like we’ve over-shot our capacity numbers a little.  Perhaps a 300GB drive would be more efficient?  Maybe a 300GB 10K?  Maybe we should look at a 3 tier approach with a little EFD, some FC, and most of the capacity in SATA?

And this dear reader is an exercise for you ;-)

Take this to the extreme with all flash arrays

Always on in-line deduplication provided by quality All Flash Arrays take this notion to the extreme.  These arrays don’t allocate anything unless it’s globally unique data.  A single EMC XtremIO X-brick provides 7.5TB of usable capacity with 70TB of logical de-duped capacity.  This 10X increase in the potential IOPS/GB density requires Flash at the core and innovative workload management algorithms to enable extreme performance that’s hitting a very limited set of allocated data regions.

What about the benefits of cache?

It is a dangerous and complex game to try and figure out the benefit of cache in any enterprise storage array. Cache serves two purposes. First, cache provides a buffer to accept spikes in host generated workload for asynchronous destage to the backend media post acknowledgment to the host. This function provides good latency to the host while leveling out the workload characteristics seen by the backend spindles.  This does not necessarily reduce the amount of work that the spindles must do overall.  The second function of cache is to legitimately reduce the amount of work the backend spindles must accomplish. This is done by caching host writes for subsequent reuse, and by being a prefetch buffer for reads that are predicted for near future use.  In my experience prefetched reads reduce backend spindle work to about 30% of what it would normally be.  Accommodating host rewrites to active cached data before it is destaged to the backend saves the spindles from having to do any of that work.  The question is how do you predict the amount of read cache hits, read cache misses, prefetched reads, writes to backend, and re-writes from any given host workload?  These arrays specific implementation details that depend on your specific host workload profile, and make it very dangerous to assume much cache benefit.  Just let it come naturally and be the icing on the cake of your good performing solution.

Shining some light on All Flash Array performance

October 25, 2013 Corporate IT No comments

mooreslawflashThe industry is finally seeing widespread adoption of All Flash Arrays (AFA) now that the cost of flash technology has made these things reasonably affordable for Enterprise customers.  These represent the next technological jump in storage technology that will cause storage professionals to unlearn what we have learned about performance, cost to value calculations, and capacity planning.

The first arrays to market used existing architectures just without any moving parts.

notdesigned

Not much thought was put into their designs, and flash just replaced spinning disk to achieve questionable long term results.  They continue to use inefficient legacy RAID schemes and hot spares. They continue to use legacy processor + drive shelf architectures that limit scalability.  If they introduced deduplication (a natural fit for AFAs) it was an afterthought post process that didn’t always execute under load.

IDC has recently released a report titled All – Flash Array Performance Testing Framework written by Dan Iacono that very clearly outlines the new performance gotchas storage professionals need to watch out for when evaluating AFAs.  It’s easy to think AFAs are a panacea of better performance.  While it’s hard not to achieve results better than spinning media arrays, IDC does a great job outlining the limitations of flash and how to create performance test frameworks that uncover the “horror stories” as IDC puts it in the lab before purchases are made for production.

Defining characteristics of flash based storage

You can’t overwrite a cell of flash memory like you’d overwrite a sector of a magnetic disk drive.  The first time a cell is written, the operation occurs very quickly.  The data is simply stored in the cell, basically as fast as a read operation.  Every subsequent re-write though, first you must erase a block of cells, and then program them again with new data to be stored.  This creates a latency for incoming write IO after the first, and should be accounted for in testing to make sure enough re-writes are occurring to uncover the performance of the device over time.

Flash wears out over time.  Each time a flash cell is erased and re-written it incurs a bit of damage or “wear.”  Flash media is rated by how many of these program erase (PE) cycles can occur before the cell is rendered inoperable.  SLC flash typically is rated at 100,000 PE cycles.  Consumer MLC (cMLC) is rated around 3,000, where enterprise MLC (eMLC) must pass higher quality standards to be rated for 30,000 PE cycles.  Most drives provide a wear-levelling algorithm that causes writes to be spread evenly across the drive to mitigate this.  Workload patterns, though might cause certain cells to be overwritten more than others, however so this is not a panacea in all cases.

Erase before write activity can lock out reads for the same blocks of cells until the write completes.  Different AFA vendors handle data protection in different ways, but in many cases, mixed read/write workload environments will exhibit greatly reduced IOPS and higher latencies than the 100% read hero numbers most vendors espouse.  This is yet another reason to do realistic workload testing to reset your own expectations prior to production usage.

How these flash limitations manifest in AFA solutions.

Performance degrades over time.  Some AFA solutions will have great performance when capacity is lightly consumed, but over time, performance will greatly diminish unless the implementation overcomes the erase-before-write and cell locking issues.  Look for technologies that are designed to scale, with architectures that overcome the cell locking issues inherent to native flash.

Garbage collection routines crush the array.  Under load, some garbage collection routines that are used to clean up cells marked for erasure, etc. if not handled properly can crush array performance.  In IDC’s testing, this lead to wildly fluctuating AFA performance — sometimes good, sometimes horrible.  Not all arrays exhibit this behavior, and only testing will show the good from the bad (because the vendors won’t tell you).

$ per usable GB is surprisingly inflated due to inefficient thin + dedup or best practice requirements to leave unused capacity in the array.  Comparing the cost of the raw installed capacity of each array is the wrong way to measure the true cost of the array.  Make sure you look at the true usable capacity expectations after RAID protection, thin provisioning, deduplication, spare capacity, requirements to leave free space available, or other mysterious system capacity overheads imposed but undisclosed by the vendor.  The metric you’re after is dollar per usable GB.

Check out the IDC report. It’s a great education about AFAs, and provides a fantastic blueprint to use when testing AFA vendors against each other.

Overselling Public Cloud Idealism to Enterprise IT Private Cloud Customers

October 4, 2013 Corporate IT No comments

Private Cloud in Enterprise IT has a lot of strong value propositions. Running Enterprise IT more like a Service Provider creates a huge potential win for everyone involved by transforming to a just in time financial model, removing fragility and risk using more automated systems deployments, being more respon

sive to business needs and quicker to market through radical standardization, and so on. This added value of Private Cloud has been discussed adover the last several years, and should be well understood at the rate customers are adopting the architectures.

I’m beginning to see that the hype around cloud is causing Enterprise IT decision makers to overlook the basic blocking and tackling of performance analysis, resiliency, availability, recoverability, etc. Developers by and large don’t have these skill sets. It’s the infrastructure architects, engineers, and operators that must continue to provide these aspects of Enterprise computing.

Here are a few idealistic (and perhaps false) characteristics of the Public Cloud that will not be readily available to most Enterprise IT within the next 5 years, and are being currently oversold to the Enterprise buyer.

Cloud is Automatically More Resilient to Failure

Does the “cloud” provide any additional reliability to the application? No, not as such. It’s still a bunch of technology sitting in some datacenter somewhere. Drives fail, nodes fail, connectivity fails, power fails, tornadoes and floods happen, etc. It’s very dangerous to assume that just because we call an infrastructure a “cloud” means it’s any more resilient to failure as legacy infrastructure designs. “Cloud” infrastructures can quickly become very large baskets in which to put all our eggs. I’m not going to say the chance of failure is any greater, but certainly the impact of failure can be much more widespread and difficult to recover from. Site failures still take out large chuncks of infrastructure unless traditional D/R solutions are in place to provide Business Restart, or where next-gen active active infrastructures provide Business Continuity.

I’ve discussed this assumption with many directors of IT that tell me the applications they run are the most unstable aspect of their IT environment. The promise of “cloud” is that the resilience issues will be handled by the application itself. Why do we honestly expect Enterprise application developers to quickly trasform themselves into savvy cloud-scale availability experts in the near term? Applications will continue to be buggy and unstable. Enterprises will continue to invest in products that provide infrastructure level reliability, recoverability, and continuity.

Scale Out of Any Given Application or Database

Let’s address the traditional three tier (presentation, app logic, database) model of Enterprise application deployment. If you’re lucky, the application you’ve purchased for use in the Enterprise allows a certain level of scale out. The Presentation and App tiers are now designed to allow additional nodes to support additional workload, but the database tier is a monolith. I say if you’re lucky, because many COTS apps are still entirely monolihic in nature, and do not follow this “standard model.”

Can you take a monolithic workload and put it on a “cloud” and have it magically adopt a scale out capability the cloud infrastructure provides? No, of course not. If the app is not aware of additional nodes it can’t “scale out.” It can’t inherently load balance, etc. The best you can do is virtualize this application with other monolithic applications and consolidate them onto a common infrastructure. We’ve all been doing this for years.

… and What About Scale Up?

Think about that monolithic database tier, or that other app (you know the one) that demands a larger footprint than the small node-based scale out architectures can provide. Enterprise IT has traditonally been a Scale Up type environement. Consolidation has been in play for a long time prompting the development of large infrastructure elements to hold these individual components. Times are shifting to more and more scale-out models, but it’ll take years for most of the Scale Up architectures to retire and many never will. The best infrastructures can scale up and out, providing node based designs that allow for affordable growth, but architecturally add additional firepower to exising apps to allow them to scale up to meed additional demand.

Cloud is Automatically Less Expensive

Scale-out promises deferred spend as needs arise, but ultimately if you want to purchase a pool of resources (build it and they will come), you’re necessarily over-purchasing for what you need today. The discipline of Capacity Planning is even more important to engage, because there is an important counterbalance at play: the agility of having a pool of expensive resources sitting idle vs. the cost benefits of just in time purchases and rapid deployments which may slow down “time to serve” applications.

Scale is an important factor in the cost of “cloud.” Many enterprises will need to invest significantly in their first forray into a converged cloud infrastructure, because it starts off so small. Small in my mind today is anything less than 100TB. Much of the financial benefit the public cloud providers gain is due to their large scale. Cost per GB / IOP/ Socket / Port goes down quickly when those fixed costs are prorated to an ever growing population of application customers.

Enterprise IT Professionals Automatically Know How to Deploy and Operationalize Cloud Models

The transformation and retooling of our people is a large obstical to Private Cloud models. Silicon Valley can create all kinds of new infrastructure technologies to host our virtual machine environments, but it’ll take years before most Enterprise IT shops are capable enough to utilize them. The bridge to the future is running existing technology stacks in new ways. The technical skills of our people can still be leveraged, and the new processes involved in Private Cloud deployments can be experimented and refined. Ultimately new software driven architectures will supplant the current hardware infrastructure based models, but not until they provide the same site level continuity capabilities enjoyed by IT.

I hope I didn’t rain on your cloud. I do absolutely understand that the Cloud / Social / Mobile / Big Data world is the 3rd Platform of computing technology. I fully embrace the added value to the Enterprise Business Users of the Cloud models. I just think that we can’t assume Public Cloud hype (much of which isn’t true anyway) will be fully applicable to Enterprise Private Cloud in the near term.

Thoughts?

EMC Xtrem VMAX Integration

March 28, 2013 Corporate IT, EMC VMAX No comments

image EMC XtremSF (formerly VFCache) is a PCIe flash card deployed in the server to dramatically improve application performance by reducing latency and accelerating throughput. XtremSF can be used as a local storage device (think RAM disk) to accelerate read / write IO/s, but on a persistent Direct Attach Storage (DAS) that exists only on the individual server. Coupled with XtremSW, the solution becomes an ultra high performance Read Cache allowing writes to pass to back-end shared storage arrays and providing incredible performance with data remaining part of the enterprise data protection solution.

One of the issues with server-side caching is that administrators have yet another user interface to deploy to get performance statistics and availability alerts.

imageWith XtremSF, EMC has integrated the communication channel with existing intelligent arrays like the EMC VMAX to provide statistics and reporting through the existing Unisphere user interface storage administrators are already using. As seen in the screenshot above, Unisphere detects and displays the performance metrics associated with the server-side cache in addition to the metrics associated directly with the VMAX array. In addition, the XtremSF cards will leverage this communication channel with the VMAX to send out EMC dial-home event notifications in case of trouble.

Expect to see more of this “vertical” integration between the Xtrem family and the VMAX and VNX families in the future. The vision is to provide not only a common user interface, but also a vertically integrated Fully Automated Storage Technology (FAST) solution that places the right data in the right place, just before the right time that it’s needed (predictive).

Let’s discuss ITaaS rather than IaaS vs. PaaS

July 31, 2012 Corporate IT No comments , ,

imageWhether or not the infrastructure teams know it, almost all business development teams have investigated or are actively working with a Public Cloud provider to spin up new app / dev environments outside of IT control.  This “Shadow IT” is becoming pervasive and on the surface represents a clear and present danger to the relevance of IT to the business.  At a deaper level IT will always be vital to the Enterprise as the broker of IT services (whether public or private), security / risk / availability / compliance manager (governance), and as a strategic consultant to help the business map to emerging technological capabilities (consulting).

imageIT must Transform into managing itself as a business to become more agile to deploy service, become easier to do business with, and bring the brokerage, governance and consulting benefits.  This way it will compete successfully against Shadow IT by offering the needed agility the Business requires.  IT will need a new Business Model focused on
delivering the service of application hosting (not just boxes, switches, and bodies)
being market driven, first listening to their customers, then building solutions to meet their needs
focusing on Profit & Loss to measure efficiency of delivery and value of offerings
brokering service across internal services or Public external cloud services while maintaining control and knowledge of which data assets reside in each environment
The enabling technology already exists for the most part.  Public Cloud is great and has its purpose to fulfill Business needs, but should be governed by IT policy, not done in the “shadows.”  Other workloads are a better fit for the resiliency and security risk mitigation of Private Cloud infrastructures.  Over the next few years all Enterprise it shops will be managing a Hybrid Cloud comgining the most cost effective elements of Public and Private systems to conduct their business.

imageThis transformation is as much about People and Processes as it is about new technologies.  IT will need “front office” sales and product management capabilities to sell its own services to the internal business customers.  New roles to build and report on service level compliance are important to show value to the business and maintain operational control.  IT also needs to maintain presence with an Office of the CTO help Business users stay on top of emerging trends and capabilities.

Iaas or Paas isn’t as relevant to Enterprise IT shops as “IT as a Service” (ITaaS).  It’s much more important that IT itself can transform into a service level integrated cloud broker priced as a consumed service under GRC management rather than whether Iaas or Paas is the proper development stack for any given business unit.  If IT runs itself aaS, it is capable of making the proper technology decisions for any new need that emerges.

So what does the Hybrid Cloud model look like?  First and most obviously, it does include a standardized infrastructure Private Cloud combined with Public Cloud services.  The integration and federation between them is best accomplished with products like VMware’s vFabric Connector that allows deployment across infrastructure providers.  Leveraging virtualization the orchestration and security compliance features can be implemented. 

imageLegacy apps are containerized and run on the virtualized hybrid infrastructure.  “Next Gen” apps are SaaS purchased or developed specifically for PaaS, preferably against a common framework like Cloud Foundry which allows deployment against internal or external scale out PaaS with data and app mobility across different PaaS providers.  Let’s not forget too, the presentation of all of these apps to the next generation of user access devices, mobile, tablets, and PC’s, and provide credentialed user access to the various Hybrid Cloud assets on any device from anywhere.

So where is your IT organization?  Can you effectively present a web based services catalog that offers your internal business customers a competitively priced monthly consumption rate model of various services?  Does that model measure itself against public providers?  Do you have an internal IT sales team that promotes IT consumption to the business, or does IT hoard a scarce technology resource pool fearing the business will grow?  Is your IT org an effective consultant to your business users about emerging capabilities?  Let’s chat!

Silicon Valley IT vs Enterprise IT

July 26, 2012 Corporate IT No comments , , , ,

Recently @RFFlores recently posted an excellent article describing the difference he sees in “Silicon Valley PaaS” and “Enterprise PaaS” when thinking about what comes after a foundational IaaS. 

imageHe points out that “while enterprises do release new apps, the vast majority of their effort is expended on evolving the current applications that run their business,” and “the level of standardization dictated by Silicon Valley PaaS isn’t realistic for most large companies.”  Go read his article for more on this. 

I’d like to extend that thinking into the nature and differences of IT as a whole within the two euphemistic categories, “Silicon Valley IT” and “Enterprise IT.”  The Clouderati are in many ways talking past the ITIListas.  They live in separate worlds, and as we’ll see have differing resources and methodologies.  The methods of the Clouderati are proper for that world, and the methods of the ITIListas are proper for theirs.  The real question is in what form will new ideas in Service Management govern the transition from these two emerging extremes into a hybrid IT operational model.

Silicon Valley IT is characterized by several key assumptions that make the hyper standardized Clouderati rhetoric work well:

  • The Business is the technology / data and not just supported by it.  Business processes, customer engagement, the product for sale, and the distribution model are all inextricably tied to one another through the application(s) built to run across the internet.
  • There are 1’s of hyper standardized scale out infrastructures required to be supported.  The zealous drive to drive out cost ensures that everything runs on a single infrastructure model.  This infrastructure could be private cloud (Facebook) or public cloud (Netflix) or a hybrid of both.  The infrastructure is built and ready (or provisioned from the Public Cloud on demand) for new projects to leverage and reuse.
  • Apps are developed in house as scale out web services to take advantage of the scale out infrastructure.  All applications must necessarily conform to the scale out model of processing, are generally x86 based, leverage modest scale up in terms of # of cores and amount of memory, etc.
  • The tight linkage between the Business and the Technology produces a DevOps group that scales out as the infrastructure grows.  The classic example (can that term be used?) is Netflix that has gone to a DevOps model where each business unit creates and manages their own piece of the loosely coupled production application set.

Enterprise IT is characterized by several opposing assumptions that make the hyper standardized cloud rhetoric nearly impossible to achieve for an ITILista:

  • The Business processes were developed prior to the technology revolution, and IT is used to augment or support them.  The business deals with objects or people in the “real world” and is dependent on many factors outside the realm of software. 
  • There are 10’s to 100’s of very distinct and customized infrastructure silos required by COTS software vendors.  IT is seen as a cost center, and funding is provided on a project by project basis often without regard for leveraging existing assets purchased outside the project budget.  New infrastructure stacks are architected, built, and run to support every major new initiative.
  • Apps are as often purchased and customized as developed in house. They may run on different CPU architectures, require wholly different CPU / Memory / Storage / Network resources, and require a significant systems integration capability to function with other systems.
  • There is a ship-load of skillsets, MBO’s, metrics, and goals that have been developed to manage the diversity of platforms required.  That ship is indeed hard to turn.

Look for future posts exploring how the various cloud players are looking to transform ITSM into simply Service Management of Hybrid Public + Private technology services.