Eight Cloud and Big Data Predictions for 2014

clip_image002Prediction has always been a fun exercise. It forces you to take a step back from your day to day activities, look at the market “crystal ball” and figure out what the future looks like. This year I found this exercise to be particularly difficult as the amount of new innovation and conflicting trends that are taking place at the same time can be very confusing.

After spending quite some time reading market analyses from different sources (see references at the end) and compiling them with all the things I’ve seen and heard throughout the year, I came to the following observation on how 2014 will look for Cloud and Big Data.  I’m happy to exchange thoughts on that regard and learn how you envision 2014 will look.


Public and Private Cloud in 2014

1. Amazon continues to distance itself while Google and Microsoft are catching up

Amazon continues to lead in public cloud and distance itself from the rest of the market. This year it reached 5 times the size of other cloud vendors combined, as noted in a recent Gartner report. Google and Microsoft are slowly closing the gap as the closest alternatives, but at a much slower pace than one would expect given the significant investment of both companies in this area.


2. 2014 will be the year of Enterprise Clouds

Traditional enterprise players, such as IBM, HP, Cisco and Red Hat, continue to fight for the remaining share of the market, mostly around enterprise adoption. Yet, enterprises have been slower to embrace and execute on private cloud strategy. Part of the reason for slow adoption is the gap between the solution provided by most of the regular contenders – who are still competing on selling an end to end story – and the reality that most enterprises are looking for Open and Hybrid cloud strategy, especially given that there is no clear winner.

3. OpenStack will be the most popular choice for Enterprise Clouds in 2014

OpenStack is now high on the radar for Enterprise Cloud, mostly threatening VMware’s strong leadership in that domain. Most of the main contenders have embraced a strong OpenStack strategy, including Red Hat, Ubuntu, Suse, HP, IBM and Cisco, with Cisco taking a surprising leading position over IBM and HP, according to a recent Forrester survey.
VMware’s response of embracing OpenStack is still questionable, as the transition to OpenStack is not only about technical integration, but also involves a big shift in the value chain. More enterprises are less keen on paying high cost for hypervisor licenses, which is to date the main revenue channel in the VMware pie.

4. Native OpenStack alternatives will disrupt many of the existing cloud solutions

The rapid adoption of OpenStack will also disrupt many of the existing cloud solutions that were built in a pre-OpenStack world. New solutions that take a more native-to-OpenStack approach will pop up and replace many of the current solutions. A good example of this is Project Solum, which aims to provide a native alternative to existing PaaS solutions, such as CloudFoundry and OpenShift as I outlined in this recent post. I expect that will see more of these disruptions expanding to other frameworks in 2014.

5. Orchestration & Automation will be the next big thing in 2014

Having said all that, the remaining challenge of enterprises is to break the IT bottleneck. This bottleneck is created by IT-centric decision-making processes, a.k.a “IaaS First Approach,” in which IT is focused on building a private cloud infrastructure – a process that takes much longer than anticipated when compared with a more business/application-centric approach. One of the ways to overcome that challenge is to abstract the infrastructure and allow other departments within the organization to take a parallel path towards the cloud, while ensuring future compatibility with new development in the IT-led infrastructure.

DevOps has definitely been a key example for a business-led initiative that determines the speed of innovation and, thus, competitiveness of many organizations. The move to DevOps forces many organizations to go through both cultural and technology changes in making the biz and application more closely aligned, not just in goals, but in processes and tools as well.

Enterprises with legacy environments and customers take a two-step approach. They first tackle continuous delivery – automating the packages of their software deliverable and keeping tighter control over new production rollout. Once enterprises feel comfortable with their process and environment, they will automate the entire deployment process into production.

Configuration management, orchestration and workflow automation become key enablers in enterprise transition to cloud, and will gain much attention in 2014 and 2015. Again, Amazon has already recognized that need, introducing into the space a new offering called OpsWorks. It is only a matter of time until will see similar offerings embedded with other cloud providers in both public and private clouds.

As the focus moves to orchestration with greater focus on standardization of deployments and packages, DSL and API become equally important. Existing standards, such as TOSCA and CAMP, will undergo massive modifications to make them simpler to use and fit the cloud environment.

Big Data Predictions

There are many advancements happening within the Big Data/NoSQL domain that I’m not going to touch on. I will focus on two main areas that are close to my work.

6. 2014 will see a major increase of Big Data in the Cloud

Cloud infrastructure is closing the gap with the existing data center, as we see the emergence of support for Bare Metal, High CPU, High Memory and Flash-Disk. Many cloud providers are already offering Big Data as Service, such as Elastic Map Reduce and Redshift, and are continuously expanding their offering on that regard.

This advancement in cloud infrastructure removes almost all of the technical barriers for running I/O intensive workloads, such as Big Data analytics, on the cloud. I expect that in 2014, running Big Data analytics on the cloud will become the first choice for any new project, while the use of Big Data in non-cloud environments will be minimized to only niche use cases with extreme regulations and security constraints.

7. In-Memory Data backed by flash-disk will become a popular choice for Real-Time Big Data analytics
Flash disk pricing changes the economics behind the cost/performance ratio of disk-based solutions, making it possible to achieve the performance of an In-Memory-based solution at a price closer to a disk-based solution. So far, attempts have been made to use flash disks as a fast disk alternative to magnetic drives. This path inherits many of the limitations of a disk-based drive and, therefore, doesn’t capture the full potential of flash disk which, like RAM, can provide highly parallel access to data.

At the same time, memory-based solutions like In-Memory Database or In-Memory Data Grid cover a small niche in the Big Data ecosystem, mostly due to the high cost/performance ratio.

I believe that the combination of flash drive, In-Memory data-grids and databases at the front-end change that dynamic and will make memory-based solutions backed by flash disk much more attractive to a bigger niche, specifically as it relates to real-time analytics of Big Data.

In some cases, the combination of memory-based solutions is also integrated with existing Big Data frameworks and, thus, provides seamless performance acceleration. A good example is GigaSpaces’ integration with Storm and GridGain’s integration with Hadoop.

8. Real-time analytics turns mainstream

The most interesting indication that real-time analytics is becoming mainstream is Amazon’s support for real-time analytics, with many of the existing analytics solutions providing real-time analytics capabilities as built-in parts of their reports. Another good example of that is Google Analytics Real Time View.

New disruptive forces in Cloud worth watching in 2014
Disruptive technologies change the market landscape in ways that are difficult to anticipate by their very nature. Rather than predicting their impact, I felt it would be simpler to list them.

  • Networking - The networking segment of IT is experiencing a major disruption as of late with the move to Software Defined Network, OpenStack Neutron project and Network Function Virtualization. These developments will change networking from not only a technology perspective, but they are also driving a completely different business model which will be based on utilization or a subscription-based model.
  • Linux Container - Linux containers are gaining interest both as light-weight software packaging as well as VMS’s. The most commonly used use case for Linux containers has as an underlying container for PaaS. With the introduction of new projects, such as Docker – which makes the Linux container easy to use, we will see wider and more pervasive use of containers as high performance virtualization, as packaging tools in continuous deployment scenarios, etc.
  • Bare Metal Cloud - Bare metal cloud has been a small niche in the cloud space, mostly due to the fact that it is usually offered in a static configuration setup. Bare metal clouds provide the same degree of elasticity to bare metal devices. With OpenStack, for example, you can spawn a new bare metal device just as you would provision any other VM. The development would reduce one of the last remaining barriers for bringing mission critical applications to the cloud.
  • OpenStack as an Innovation Accelerator - Many items on the list of disruptive technologies that I listed above are not that new, and to a certain degree, have existed for years, like in the case of Linux Container. Often, a disruptive force needs to gain certain critical mass before it can break into massive adoption. OpenStack creates an ecosystem that provides a platform for many users to integrate new technologies in a way that could be consumed by end users immediately and that plays a major role in the acceleration of adoption of many of those disruptive technologies.

Where do we go from here?

There are many individual technologies and trends that have a disruptive force behind them. Having said that, I think that it is the intersection between those technologies that holds the most promising potential. Here are few examples:

  • Big Data and the Cloud Application Orchestration - Big Data analytics for operational information could serve as the new “brain” behind a new class of orchestration engine that would combine artificial intelligence decision-making based on trends and historical analysis and handle complex failure and scaling scenarios automatically.
  • Putting Network and Applications together - Putting network and applications together holds a lot of promise in the way we scale applications across regions and multiple sites, as well as how we control application SLA’s in a shared environment. For example, giving priority to customer-facing services versus batch analytics or optimizing the network routing based on the locality of the data, etc.

(via: Nati Shalom’s Blog)


We Finally Cracked The 10K Problem – This Time For Managing Servers With 2000x Servers Managed Per Sysadmin

In 1999 Dan Kegel issued a big hairy audacious challenge to web servers:

It’s time for web servers to handle ten thousand clients simultaneously, don’t you think? After all, the web is a big place now.

This became known as the C10K problem. Engineers solved the C10K scalability problems by fixing OS kernels and moving away from threaded servers like Apache to event-driven servers like Nginx and Node.

Today we are considering an even bigger goal, how to support 10 Million Concurrent Connections, which requires even more radical techniques.

No similar challenge was issued for managing servers in a datacenter, but according toDave Neary from Red Hat, in a recent FLOSS Weekly episode, we have passed the 10K barrier for server management with 10,000 or more servers managed per sysadmin.

Should We Let This Milestone Pass Without Mention?

Absolutely not! It’s a stunning accomplishment with 200x-2000x increases in productivity. Dave said he remembered in the 1990s it took one sysadmin to manage 4 or 5 Windows servers. A Linux sysadmin could manage 50 to 60 servers.

Now companies are managing over 10,000 servers per sysadmin. This huge change is rooted both in IaaS, treating a datacenter as an elastic programmable resource, divorcing operations from infrastructure deployment, and in the DevOps revolution, with its emphasis on tools, culture, automation, metrics, sharing of resources, and infrastructure as code.

What Will It Take To Manage 10 Million Servers Per Sysadmin?

Who might know? Google of course.

As James Hamilton says, Counting Servers is Hard, but Microsoft says they have 1 million servers, and Google is planning for 10 million servers, so it may take a while before we can get to 10 million servers per sysadmin.

But when it does happen the base will be built on:

At a high level the approach of scaling to 10 million connections per server and scaling 10 million machines per sysadmin are the same: scalability is specialization.

But at lower level they differ completely. Scaling to 10 million connections is about removing layers and doing the work yourself. Scaling to 10 million servers is all about putting the intelligence into smarter and smarter layers. A lot like how human body utilizes trillions of individual components mediated by many autonomous systems all directed by a parallelized and decentralized brain.

(source: HighScalability.com)

Google On Latency Tolerant Systems: Making A Predictable Whole Out Of Unpredictable Parts

In Taming The Long Latency Tail we covered Luiz Barroso’s exploration of the long tail latency (some operations are really slow) problems generated by large fanout architectures (a request is composed of potentially thousands of other requests). You may have noticed there weren’t a lot of solutions. That’s where a talk I attended, Achieving Rapid Response Times in Large Online Services (slide deck), byJeff Dean, also of Google, comes in:

In this talk, I’ll describe a collection of techniques and practices lowering response times in large distributed systems whose components run on shared clusters of machines, where pieces of these systems are subject to interference by other tasks, and where unpredictable latency hiccups are the norm, not the exception.

The goal is to use software techniques to reduce variability given the increasing variability in underlying hardware, the need to handle dynamic workloads on a shared infrastructure, and the need to use large fanout architectures to operate at scale.

Two Forces Motivate Google’s Work On Latency Tolerance:

  • Large fanout architectures. Satisfying a search request can involve thousands of machines. The core idea is that small performance hiccups on a few machines causes higher overall latencies and the more machines the worse the tail latency. The statistics of the situation are highly unintuitive. Only 1% of requests will take over a second with a server that has a 1ms average response time and a one second 99th percentile latency. If a request has to access 100 servers, now 63% of all requests will take over a second.
  • Resource sharing. One of the surprising aspects of Google’s architecture is they use most of their machines as generalized job execution engines. Not even Google has infinite resources, so they share resources. This also fits where their “the data warehouse is the computer” philosophy. Every node runs linux and participates with a scheduling daemon that schedules work across the cluster. Jobs of various types (CPU intensive, MapReduce, etc)  will be running on the same machine that runs a Bigtable tablet server. Jobs are not isolated on servers. Jobs impact each other and those impacts must be managed.

Others try to reduce variability by overprovisioning or running only like workloads on machines. Google makes use of all their resources, but sharing disk, CPU, network, etc, increases variability. Jobs will burst CPU, memory, or network usage, so running a combination of background activities on machines, especially with large fanouts, leads to unpredictable results.

You may think complete and total control would solve the problem, but the more you tighten your grip, the more star systems will slip through your fingers. At a small scale careful control can work, but the disadvantages of large fanout architectures and shared resource execution models must be countered with good design.

Fault Tolerant Vs Latency Tolerant Systems

Dean makes a fascinating analogy between creating fault tolerant and latency tolerant systems. Fault tolerant systems make a reliable whole out of unreliable parts by provisioning extra resources. In the same way latency tolerant systems can make a predictable whole out of unpredictable parts. The difference is in the time scales:

  • variability: 1000s of disruptions/sec, scale of milliseconds
  • faults: 10s of failures per day, scale of tens of seconds

Tolerating variability requires having a fine trigger so the response can be immediate. Your scheduling system must be able to make decisions within these real-time time frames.

Mom And Apple Pie Techniques For Managing Latency

These techniques the “general good engineering practices” for managing latency:

  • Prioritize request queues and network traffic. Do the most important work first.
  • Reduce head-of-line blocking. Break large requests into a sequence of small requests. This time slices a large request rather than let it block all other requests.
  • Rate limit activity. Drop or delay traffic that exceeds a specified rate.
  • Defer expensive activity until load is lower.
  • Synchronize disruptions. Regular maintenance and monitoring tasks should not be randomized. Randomization means at any given time there will be slow machines in a computation. Instead, run tasks at the same time so the latency hit is only taken during a small window.

Cross Request Adaptation Strategies

The idea behind these strategies is to examine recent behavior and take action to improve latency of future requests within tens of seconds or minutes. The strategies are:

  • Fine-grained dynamic partitioning. Partition large datasets and computations. Keep more than 1 partition per machine (often 10-100/machine). Partitions make it easy to assign work to machines and react to changing situations as the partitions can be recovered on failure or replicated or moved as needed.
  • Load balancing. Load is shed in few percent increments. Shifting load can be prioritized when the imbalance is severe. Different resource dimensions can be overloaded: memory, disk, or CPU. You can’t just look at CPU load because they may all have the same load. Collect distributions of each dimension, make histograms, and try to even out work to machines by looking at std deviations and distributions.

    Different resource dimensions can be overloaded: memory, disk, or CPU. You can’t just look at CPU load because they may all have the same load. Collect distributions of each dimension, make histograms, and try to even out work to machines by looking at std deviations and distributions.

  • Selective partitioning. Make more replicas of heavily used items. This works for static or dynamic content. Important documents or Chinese documents, for example, can be replicated to handle greater query loads.
  • Latency-induced probation. When a server is slow to respond it could be because of interference caused by jobs running on the machine. So make a copy of the partition and move it to another machine, still sending shadow copies of the requests to the server. Keep measuring latency and when the latency improves return the partition to service.

My notes on this section are really bad, so that’s all I have for this part of the talk. Hopefully we can fill it in as more details become available.

Within-Request Adaptation Strategies

The idea behind these strategies is to fix a slow request as it is happening. The strategies are:

  • Canary requests
  • Backup requests with cross-server cancellation
  • Tainted results

Backup Requests With Cross-Server Cancellation

Backup requests are the idea of sending requests out to multiple replicas, but in a particular way. Here’s the example for a read operation for a distributed file system client:

  • send request to first replica
  • wait 2 ms, and send to second replica
  • servers cancel request on other replica when starting read

A request could wait in a queue stuck behind an expensive query or a packet could be dropped, so if a reply is not returned quickly other replicas are tried. Responses come back faster if requests sit in multiple queues.

Cancellation reduces the frequency at which redundant work occurs.

You might think this is just a lot of extra traffic, but remember, the goal is to squeeze down the 99th percentile distribution, so the backup requests, even with what seems like a long wait time really bring down the tail latency and standard deviation. Requests in the 99th percentile take so long that the wait time is short in comparison.

Bigtable, for example, used backup requests after two milliseconds, which dramatically dropped the 99th percentile by 43 percent on an idle system. With a loaded system the reduction was 38 percent with only one percent extra disk seeks. Backups requests with cancellations gives the same distribution as an unloaded cluster.

With these latency tolerance techniques you are taking a loaded cluster with high variability and making it perform like an unloaded cluster.

There are many variations of this strategy. A backup request could be marked with a lower priority so it won’t block real work. Send to a third cluster after a longer delay.Wait times could be adjusted so requests are sent when the wait time hits the 90th percentile.

Tainted Results – Proactively Abandon Slow Subsystems

  • Tradeoff completeness for responsiveness. Under load noncritical subcomponents can be dropped out. It’s better, for example, to search a smaller subset of pages or skip spelling corrections than it is to be slow.
  • Do not cache results. You don’t want users to see tainted results.
  • Set cutoffs dynamically based on recent measurements.

(via HighScalability.com)