The Stunning Scale Of AWS And What It Means For The Future Of The Cloud

16238755496_4a3014ebbb_mJames Hamilton, VP and Distinguished Engineer at Amazon, and long time blogger of interesting stuff, gave an enthusiastic talk at AWS re:Invent 2014 on AWS Innovation at Scale. He’s clearly proud of the work they are doing and it shows.

James shared a few eye popping stats about AWS:

  • 1 million active customers
  • All 14 other cloud providers combined have 1/5th the aggregate capacity of AWS (estimate by Gartner)
  • 449 new services and major features released in 2014
  • Every day, AWS adds enough new server capacity to support all of Amazon’s global infrastructure when it was a $7B annual revenue enterprise (in 2004).
  • S3 has 132% year-over-year growth in data transfer
  • 102Tbps network capacity into a datacenter.

The major theme of the talk is the cloud is a different world. It’s a special environment that allows AWS to do great things at scale, things you can’t do, which is why the transition from on premise x86 servers to the public cloud is happening at a blistering pace. With so many scale driven benefits to the public cloud, it’s a transition that can’t be stopped. The cloud will keep getting more reliable, more functional, and cheaper at a rate that you can’t begin to match with your limited resources, generalist gear, bloated software stacks, slow supply chains, and outdated innovation paradigms.

That’s the PR message at least. But one thing you can say about Amazon is they are living it. They are making it real. So a healthy doubt is healthy, but extrapolating out the lines of fate would also be wise.

One of the fickle finger of fate advantages AWS has is resources. At one million customers they have the scale to keep the engine of expansion and improvement going. Profits aren’t being taken out, money is being reinvested. This is perhaps the most important advantage of scale.

But money without smarts is simply waste. Amazon wants you to know they have the smarts. We’ve heard how Google and Facebook build their own gear, Amazon does too. They build their own networking gear, networking software, racks, and they work with Intel to get faster processor versions of processors than are available on the market. The key is they know everything and control everything about their environment, so they can build simpler gear that does exactly what they want, which turns out to be cheaper and more reliable in the end.

Complete control allows quality metrics to be built into everything. Metrics drive a constant quality increase in all parts of the system, which is why against all odds AWS is getting more reliable as the pace of innovation quickens. Great pools of actionable data turned into knowledge is another huge advantage of scale.

Another thing AWS can do that you can’t is the Availability Zone architecture itself. Each AZ is its own datacenter and AZs within a region are located very close together. This reduces messaging latencies, which means state can be synchronously replicated between AZs, which greatly improves availability compared to the typical approach where redundant datacenters are very far apart.

It’s a talk rich with information and…well, spunk. The real meta-theme of the talk is how Amazon consciously uses scale to their competitive advantage. For Amazon scale isn’t just an expense to be dealt with, scale is a resource to exploit, if you know how.

Here’s my gloss of James Hamilton’s incredible talk…

Everything In The Talk Has A Foundation In Scale

  • Every day, AWS adds enough new server capacity to support all of Amazon’s global infrastructure when it was a $7B annual revenue enterprise (in 2004).

  • 365 days a year component manufacturers have to get gear to server and storage manufacturers, the server and storage manufacturers have to produce the gear and push it into the logistics channel, it has to get from the logistics channel over to the correct datacenter, it has to arrive at the loading dock, people have be there to wheel the racks to the right place in the DC, there has to be power, cooling, networking, the app stack has to be loaded up, it has to be tested, it has to be released to customers.

  • S3 usage: 132% year-over-year growth in data transfer; EC2 usage: 99% YoY usage growth; AWS overall business: over 1 million active customers.

  • All 14 other cloud providers combined have 1/5th the aggregate capacity of AWS (estimate by Gartner)

  • With over a million customers it means you are in a rich ecosystem. You have your pick of software vendors, if you have a problem someone has likely had it before, it’s easier to get your job done fast.

  • Such high growth means Amazon has the resources to keep reinvesting and innovating by increasing breadth and depth of services they offer.

  • Big Transitions generally occur when the economics are far superior, like mainframes to UNIX servers and then UNIX servers to x86 servers. These transitions usually take 10+ years. What’s different about the x86 on premise transition to the cloud is the speed at which it is happening.  The speed of the cloud transition is a function of great economic value along with the low friction for adoption. You don’t need software, you don’t hardware, you can just do it.

There Are Big Cost Problems In Networking

  • Networking is a red alert situation across the industry. It’s the perfect storm.

  • Problem #1: The cost of networking is escalating relative the cost of all other equipment. It’s anti-Moore’s law. All other gear is going down in cost, networking is getting relatively more expensive over time. Relative monthly costs: servers: 57%; networking equipment: 8%; power distribution and cooling: 18%; power: 13%; other: 4%.

  • Problem #2: At the same time networking is getting more expensive, the ratio of networking to compute is going up. That’s partly because Moore’s law is working (still) with servers and compute density is going up. Partly it’s because as the cost of compute falls the amount of advanced analytics performed goes up and analytics are network intensive. Solving big problems using a large number of servers requires a lot of networking. Network traffic has moved east-west rather than the traditional north-south direction.

  • Amazon’s solution 5 years ago was data driven and radical: they built to their own networking designs. Special routers were built. A team was hired to build the protocol stack all the way to the top. And they deployed all this themselves in their network. All services worldwide run on this gear.

    • This strategy turned out to be a lot cheaper. Just the support contract for networking gear was running 10s of millions of dollars.

    • Availability went up. The source of the improvement was simplicity. The problem AWS was trying to solve was simpler than the problem enterprise gear tries to solve. Enterprise gear must adhere to a lot of complicated specs that go unused and only make the system more fragile. By implementing just the functionality that was required meant a much simpler system which lead to higher availability. Any way to win is a good way to win.

    • A cornucopia of metrics. They measure everything. The rule is if a customer has a bad experience using their system their metrics must show it. This means metrics are getting better all the time because customer problems drive better metrics. Once you have metrics that accurately reflect customer experience then goals can be set on making the system better. Every week improvements are made to make things better. If the code didn’t start off better, it gets better over time.

    • Testability. Their gear was better because they tested it better. Enterprise gear is hard to test at scale. They created a $40 million test bed of 8000 servers (3 megawatts). But since this was the cloud they effectively rented the servers for a few months, so it was relatively cheap.

Networking Explained Layer By Layer, From The Very Top To The Network Interface Card

AWS Worldwide Network Backbone

  • 11 AWS regions worldwide. Choose which ones to use by nearness to customers or required jurisdictional boundaries.

  • Private fiber links interconnect most of the major regions. This avoids all the capacity planning problems (Amazon can do better capacity planning), peering issues, and buffering problems that occur on public links. So it’s faster to run their own network, it’s more reliable, cheaper, and lower latency.

Example AWS Region (US East ((Northern Virginia))

  • All regions have at least two availability zones. US East has five AZs.

  • Redundant paths run to transit centers.

  • Each region has redundant transit centers. A transit center connects private links to other AWS regions, private links to AWS Direct Connect customers, and to the internet through peering and paid transit.

  • If one AZ fails all the other AZs keep working.

  • Metro-area DWDM links between AZs

  • 82,864 fiber strands in a regions

  • AZs are less than 2ms apart and usually less than 1ms apart. From a latency perspective they are fairly close, within a few kilometers. Far enough apart for safety, close enough for good latency.

  • 25Tbps peak traffic between AZs

  • AWS offers AZs because:

    • With a single hardened datacenter the best reliability you’ll get is around 99.9% over a mix of applications over a large period of time. High reliability requires running in two datacenters. Traditionally datacenter diversity is from two datacenters that are very far apart because it’s not cost effective to keep datacenters close together. This means longer latencies. LA to NEW is 74ms round trip. Committing to an SSD is 1 to 2ms. You can’t wait 70+ milliseconds for a transaction to commit. Which means applications commit locally and then replicate to the second datacenter. This design in a failure case loses data during the failover. While a true failure is rare, like a building burning down, transient failures are more common, like a load balancer problem for example. So would you failover your connection was down for 3 minutes? No, because data would be lost and it would take a long time to recover that data from other sources. So you lose availability for common events.

    • AZs are milliseconds apart so you can commit to both at the same time. That means if you failover a customer won’t be able to tell because the data replication was synchronous. It’s invisible. It’s hard to write code for this model so you won’t do it for everything. And for some apps a concern for multi-AZ failures might also prevent you from using multiple AZs, but for the rest of apps this is a very powerful model. It’s more costly, but it gives AWS certain advantages.

Example AWS Availability Zone

  • An Availability Zone is always a datacenter in a completely independent building.

  • Amazon has 28+ datacenters. The plus means there are more datacenters in an AZ as a way of extending capacity for an AZ. More datacenters are added within an AZ to extend the capacity of an AZ. Otherwise you would be forced to split your app across AZs, even if you didn’t want to.

  • Some AZs have size fairly substantial datacenters.

  • DCs in an AZ are less than ¼ms apart.

Example Datacenter

  • AWS datacenters are purposely not gigantic. A single datacenter is 25 – 30 megawatts, with between 50,000 – 80,000 servers

  • The return on datacenter largeness diminishes. The advantage of datacenter scale as you build bigger and bigger goes down. Early advantages are huge. Later advantages are small. Going from 2000 to 2500 racks is a little better. A tiny datacenter is too expensive. A really large datacenter is only marginally more expensive per rack than a medium datacenter.

  • Risk increases with larger datacenters. The blast radius if something goes wrong and the datacenter is destroyed, the loss is too big.

  • 102Tbps network capacity into a datacenter.

Example Rack, Server & NIC

  • The only thing that matters for latency is the software latency at either end of a connection. Latency between two servers when sending a message:

    • Your app -> guest OS -> hypervisor -> NIC : the latency is milliseconds

    • through the NIC : the latency is microseconds

    • over the fiber : the latency is nanoseconds

  • SR-IOV (Single Root I/O Virtualization) allows a NIC to provide virtualize in hardware network cards. Each guest gets its own network card. The benefit is > 2x average latency reduction and > 10x latency jitter improvement. This means outliers a down to a 1/10th of what they were. SR-IOV is being deployed now on newer instances types and will eventually be everywhere. The hard part wasn’t adding SR-IOV, it was adding isolation, metering, DDoS protection, and the capacity limits that make SR-IOV useful in a cloud environment.

AWS Custom Server & Storage Designs

  • Cost of a negative situation is not high so expensive unneeded protection can be removed. Servers are designed for what they do, not a general population of users. Amazon knows exactly in what environment the server will run in, they’ll know exactly when something goes wrong, so the servers can be designed with less engineering headroom. The cost of server failure is not that big for them. They are already on site and are very good at replacing harddisks, etc. So a lot of the carefulness in enterprise equipment is not necessary.

  • Processors can be pushed harder. They know the cooling requirements, they influence the mechanical design, they just design good servers, so they can get more performance out of a server. Though a partnership with Intel Amazon has processors that run faster than can be bought on the open market.

  • An example is the design for their own storage rack. It weighs over a ton, 19” wide, and holds 864 disk drives. For some workloads this a wonderful game changing design that helps them get better prices in some areas.

Power Infrastructure

  • Amazon has designed and built their own power substations. It only saves a little money, but they can build them much faster. Utility companies are not used to dealing with the rate AWS is growing at, so they had to build their own.

  • 3 100% carbon neutral regions: US West (Oregon), AWS GovCloud (US), EU (Frankfurt)

Rapid Pace Of Innovation

  • 449 new services and major features released in 2014. 24 in 2008, 48 in 2009, 61 in 2010, 82 in 2011, 159 in 2012, 280 in 2013.

  • AWS is getting more reliable as the pace of innovation quickens. Their goal is to make available to customers the same tools that use to achieve this rate of innovation and high quality.

( via HighScalability.com )

Spark Sets New Record in Sort Performance

Databricks, a company founded by the creators of Apache Spark, has recently announced a new record in the Daytona GraySort contest using the Spark processing engine. The Daytona GraySort contest is a 3rd party benchmark measuring how fast a system can sort 100 Terabytes of data.Databricks posted a throughput of 4.27 TB/min over a cluster of 206 machines for their official run which constitutes a 3x performance improvement, using 10x fewer machines when compared to the previous record submitted by Yahoo! running Hadoop MapReduce.

In a blog post announcing their submission to the Daytona GraySort contest, Databricks explained some of the technological improvements recently introduced to Spark that allowed it to sustain such a large throughput.

Spark 1.1 introduced a new shuffle implementation called sort-based shuffle. The previous shuffle implementation required an in-memory buffer for each partition in the shuffle which lead to notable memory overhead. The new sort-based shuffle requires only one in-memory buffer at a time. This significantly reduced the memory usage and allowed for considerably more tasks to be run concurrently on the same hardware.

In addition to the new shuffle algorithm, the network module was revamped based on Netty’s native Epoll socket transport which maintains its on pool of memory, bypassing the JVM’s memory allocator and reducing the impact of garbage collection. The new network module was then used to build an external shuffle service to allow shuffled files to be served even during garbage collection pauses in the main Spark executor.

Finally, Spark 1.1 included TimSort as its new default sorting algorithm. TimSort is derived from merge sort and insertion sort and performs better than quicksort in most real-world datasets, especially for datasets that are partially ordered.

All of these improvements allowed the Spark cluster to sustain 3GB/s/node I/O activity during the map phase, and 1.1 GB/s/node network activity during the reduce phase which saturated the 10Gbps ethernet link.

Spark is an advanced execution engine born out of research done at the AMPLab at UC Berkley. It allows programs to run up to 10x faster than Hadoop MapReduce and when data is on disk, and up to 100x faster when data resides in memory. Spark supports programs written in Java, Scala or Python and uses familiar functional programming constructs to build data processing flows.

Spark has garnered significant attention as a next generation execution platform for Hadoop and is seen by some as a replacement for MapReduce. It graduated to a top level Apache project in February and since then has been included in the Cloudera, Hortonworks and MapR’s Hadoop distributions. More recently, Hortonworks announced they will support running Hive on Spark as part of their Stinger.next initiative.

Databricks was founded in 2013 as a commercial entity supporting Spark and its associated projects. Those projects include Spark Streaming for stream processing, Spark SQL for querying Hive data and MLlib for machine learning.

(Via InfoQ.com)

Amazon Announces EC2 Container Service For Managing Docker Containers On AWS

Earlier this year I wrote about container computing and enumerated some of the benefits that you get when you use it as the basis for a distributed application platform: consistency & fidelity, development efficiency, and operational efficiency. Because containers are lighter in weight and have less memory and computational overhead than virtual machines, they make it easy to support applications that consist of hundreds or thousands of small, isolated “moving parts.” A properly containerized application is easy to scale and maintain, and makes efficient use of available system resources.

Introducing Amazon EC2 Container Service

meter80_1
In order to help you to realize these benefits, we are announcing a preview of our new container management service, EC2 Container Service (or ECS for short). This service will make it easy for you to run any number of Docker containers across a managed cluster of Amazon Elastic Compute Cloud (EC2) instances using powerful APIs and other tools. You do not have to install cluster management software, purchase and maintain the cluster hardware, or match your hardware inventory to your software needs when you use ECS. You simply launch some instances in a cluster, define some tasks, and start them. ECS is built around a scalable, fault-tolerant, multi-tenant base that takes care of all of the details of cluster management on your behalf.

By the way, don’t let the word “cluster” scare you off! A cluster is simply a pool of compute, storage, and networking resources that serves as a host for one or more containerized applications. In fact, your cluster can even consist of a single t2.micro instance. In general, a single mid-sized EC2 instance has sufficient resources to be used productively as a starter cluster.

EC2 Container Service Benefits
Here’s how this service will help you to build, run, and scale Docker-based applications:

  • Easy Cluster Management – ECS sets up and manages clusters made up of Docker containers. It launches and terminates the containers and maintains complete information about the state of your cluster. It can scale to clusters that encompass tens of thousands of containers across multiple Availability Zones.
  • High Performance – You can use the containers as application building blocks. You can start, stop, and manage thousands of containers in seconds.
  • Flexible Scheduling – ECS includes a built-in scheduler that strives to spread your containers out across your cluster to balance availability and utilization. Because ECS provides you with access to complete state information, you can also build your own scheduler or adapt an existing open source scheduler to use the service’s APIs.
  • Extensible & Portable – ECS runs the same Docker daemon that you would run on-premises. You can easily move your on-premises workloads to the AWS cloud, and back.
  • Resource Efficiency – A containerized application can make very efficient use of resources. You can choose to run multiple, unrelated containers on the same EC2 instance in order to make good use of all available resources. You could, for example, decide to run a mix of short-term image processing jobs and long-running web services on the same instance.
  • AWS Integration – Your applications can make use of AWS features such as Elastic IP addresses, resource tags, and Virtual Private Cloud (VPC). The containers are, in effect, a new base-level building block in the same vein as EC2 and S3.
  • Secure – Your tasks run on EC2 instances within an Amazon Virtual Private Cloud. The tasks can take advantage of IAM roles, security groups, and other AWS security features. Containers run in a multi-tenant environment and can communicate with each other only across defined interfaces. The containers are launched on EC2 instances that you own and control.

Using EC2 Container Service
ECS was designed to be easy to set up and to use!

You can launch an ECS-enabled AMI and your instances will be automatically checked into your default cluster. If you want to launch into a different cluster you can specify it by modifying the configuration file in the image, or passing in User Data on launch. To ECS-enable a Linux AMI, you simply install the ECS Agent and the Docker daemon.

ECS will add the newly launched instance to its capacity pool and run containers on it as directed by the scheduler. In other words, you can add capacity to any of your clusters by simply launching additional EC2 instances in them!

The ECS Agent will be available in open source form under an Apache license. You can install it on any of your existing Linux AMIs and call registerContainerInstances to add them to your cluster.

grid80_1

Here are a few vocabulary items to help you to get familiar with the terminology used by ECS:
  • Cluster – A cluster is a pool of EC2 instances in a particular AWS Region, all managed by ECS. One cluster can contain multiple instance types and sizes, and can reside within one or more Availability Zones.
  • Scheduler – A scheduler is associated with each cluster. The scheduler is responsible for making good use of the resources in the cluster by assigning containers to instances in a way that respects any placement constraints and simultaneously drives as much parallelism as possible, while also aiming for high availability.
  • Container – A container is a packaged (or “Dockerized,” as the cool kids like to say) application component. Each EC2 instance in a cluster can serve as a host to one or more containers.
  • Task Definition – A JSON file that defines a Task as a set of containers. Fields in the file define the image for each container, convey memory and CPU requirements, and also specify the port mappings that are needed for the containers in the task to communicate with each other.
  • Task – A task is an instantiation of a Task Definition consisting of one or more containers, defined by the work that they do and their relationship to each other.
  • ECS-Enabled AMI – An Amazon Machine Image (AMI) that runs the ECS Agent and dockerd. We plan to ECS-enable the Amazon Linux AMI and are working with our partners to similarly enable their AMIs.

EC2 Container Service includes a set of APIs that are both simple and powerful. You can create, describe, and destroy clusters and you can register EC2 instances therein. You can create task definitions and initiate and manage tasks.

Here is the basic set of steps that you will follow in order to run your application on ECS. I am making the assumption that you have already Dockerized your application by breaking it down in to fine-grained components, each described by a Dockerfile and each running nicely on your existing infrastructure. There are plenty of good resources online to help you with this process. Many popular applications application components have already been Dockerized and can be found on Docker Hub. You can use ECS with any public or private Docker repository that you can access. Ok, so here are the steps:

  1. Create a cluster, or decide to use the default one for your account in the target Region.
  2. Create your task definitions and register them with the cluster.
  3. Launch some EC2 instances and register them with the cluster.
  4. Start the desired number of copies of each task.
  5. Monitor the overall utilization of the cluster and the overall throughput of your application, and make adjustments as desired. For example, you can launch and then register additional EC2 instances in order to expand the cluster’s pool of available resources.

EC2 Container Service Pricing and Availability
The service is launch today in preview form. If you are interested in signing up, click here to join the waiting list.

There is no extra charge for ECS. As usual, you pay only for the resources that you use.

( Via Amazon blog)

Google Launches Managed Service For Running Docker-Based Applications On Its Cloud Platform

docker-google

Google today announced the alpha launch of Google Container Engine, a new managed service for building and running Docker container-based applications on its cloud platform.

Docker is probably the hottest technology in developer circles these days — it’s almost impossible to have a discussion with a developer without it coming up — and Google’s Cloud Platform team has decided to go all in on this technology that makes it easier for developers to run distributed applications.

In essence, this new service is a “Cluster-as-a-Service” platform based on Google’s open source Kubernetes project. Kubernetes, which helps developers manage their container clusters, is based on Google’s own work with containers in its massive data centers. In this new service, Kubernetes dynamically manages the different Docker containers that make up an application for the user.

Google says the combination of “fast booting, efficient VM hosts and seamless virtualized network integration” will make its cloud computing service “the best place to run container-based applications.” The company’s competitors would likely argue with that, but none of them offer a similar service at this point.

Google initially launched support for Docker images in May as part of its new Managed VM service. These managed VMs are coming out of Google’s limited alpha — with the addition of auto-scaling support — and with that, developers can now use Docker containers on Google’s platform without having to jump through any hoops. Managed VMs will remain in beta for now, though, which in Google’s new language means there is no access control and that charges may be waived, but there is no SLA or technical support obligation either.

And remember, because this new service is officially in alpha, it isn’t feature-complete and the whole infrastructure could melt down at any minute.

(Via TechCrunch.com)

Is Red Hat the new Oracle?

4728543452_bc75fbe9fd_b
Locking in? Trying to replicate its Linux success in the cloud, Red Hat said it will not support Red Hat Linux customers who run a non-Red Hat OpenStack distribution.

Everyone knows that Red Hat, the king of enterprise Linux, is banking on OpenStack as its next big opportunity. And most figured it would be aggressive in competing with rival OpenStack distributions — from Canonical, HP, Suse, and others.

What we didn’t necessarily know until the Wall Street Journal  (registration required) reported it Tuesday night is that Red Hat — which makes its money selling support and maintenance for its open-source products – would refuse to support of users of Red Hat Enterprise Linux who also run non-Red Hat versions of OpenStack.

Since Red Hat accounts for more than 60 percent of the paid enterprise Linux market, that policy could stem adoption of rival OpenStack distributions. It could also irritate customers — many of whom don’t like the specter of vendor lock-in.

In this policy — which the company confirmed to the Journal — Red Hat seems to have ripped a page out of Oracle’s playbook. The database giant, as it expanded into other software areas, decided that it would not support customers running  non-Oracle virtualization, non-Oracle Linux etc., unless the customer could prove that its issue originated in the Oracle part of the stack. That went over like a lead balloon with users. Since Oracle announced its own OpenStack distribution this week — and it also fields its own Linux distribution — the stage is set for dueling single-source OpenStack implementations going forward. Needless to say, that could ding OpenStack’s promise of no-vendor-lock-in.

The Journal also reported that Red Hat employees were told to stop working with Mirantis, an OpenStack systems integrator that late last year started offering its own OpenStack distribution. Red Hat’s president of products and technologies (pictured above) Paul Cormier told the paper that Red Hat would not bring a competitor into its accounts.

Reached late Tuesday for comment, a Red Hat spokeswoman noted that OpenStack “is not simply a layered product on top of Linux — [RHEL] is tightly integrated into and part of OpenStack. It is much more complex and intertwined than, say, Microsoft choosing to run PowerPoint on iOS.” I will update this story with additional Red Hat comment when it becomes available. Mirantis could not be reached for comment.

This news, which came out of the OpenStack Summit in Atlanta, may unsettle corporate customers wary of committing too much of their IT budget to any one vendor. But it’s hardly surprising. All of these vendors, while pledging open-source goodness of OpenStack, also want to expand their own reach in customers’ shops. OpenStack competitors have been wary of Red Hat for quite some time, expecting it to try to replicate its dominance in the enterprise Linux realm with cloud with OpenStack.

To be sure this problem is sort of theoretical now, as IDC Analyst Al Gillen pointed out. “From a practical standpoint, this is a non-issue today since there are precious few OpenStack clouds in production use,” he said via email. He agreed that this smacks of Oracle’s support policies but attributed Red Hat’s action more to the “immaturity and fast release of OpenStack more than anything else.”

Mark Shuttleworth, founder of Canonical, which competes with Red Hat both in Linux and OpenStack, recently acknowledged that the battle front has moved from the single-node Linux server realm — where Red Hat won — to multi-node cloud deployments where Red Hat’s enterprise software licensing mentality could put off customers. This new policy will test how compliant big customers will be to such enterprise sales tactics in the age of cloud.

(Via Gigaom.com)

Eight Cloud and Big Data Predictions for 2014

clip_image002Prediction has always been a fun exercise. It forces you to take a step back from your day to day activities, look at the market “crystal ball” and figure out what the future looks like. This year I found this exercise to be particularly difficult as the amount of new innovation and conflicting trends that are taking place at the same time can be very confusing.

After spending quite some time reading market analyses from different sources (see references at the end) and compiling them with all the things I’ve seen and heard throughout the year, I came to the following observation on how 2014 will look for Cloud and Big Data.  I’m happy to exchange thoughts on that regard and learn how you envision 2014 will look.

 

Public and Private Cloud in 2014

1. Amazon continues to distance itself while Google and Microsoft are catching up

Amazon continues to lead in public cloud and distance itself from the rest of the market. This year it reached 5 times the size of other cloud vendors combined, as noted in a recent Gartner report. Google and Microsoft are slowly closing the gap as the closest alternatives, but at a much slower pace than one would expect given the significant investment of both companies in this area.

clip_image004

2. 2014 will be the year of Enterprise Clouds

Traditional enterprise players, such as IBM, HP, Cisco and Red Hat, continue to fight for the remaining share of the market, mostly around enterprise adoption. Yet, enterprises have been slower to embrace and execute on private cloud strategy. Part of the reason for slow adoption is the gap between the solution provided by most of the regular contenders – who are still competing on selling an end to end story – and the reality that most enterprises are looking for Open and Hybrid cloud strategy, especially given that there is no clear winner.

3. OpenStack will be the most popular choice for Enterprise Clouds in 2014

OpenStack is now high on the radar for Enterprise Cloud, mostly threatening VMware’s strong leadership in that domain. Most of the main contenders have embraced a strong OpenStack strategy, including Red Hat, Ubuntu, Suse, HP, IBM and Cisco, with Cisco taking a surprising leading position over IBM and HP, according to a recent Forrester survey.
clip_image006
VMware’s response of embracing OpenStack is still questionable, as the transition to OpenStack is not only about technical integration, but also involves a big shift in the value chain. More enterprises are less keen on paying high cost for hypervisor licenses, which is to date the main revenue channel in the VMware pie.
clip_image008

4. Native OpenStack alternatives will disrupt many of the existing cloud solutions

The rapid adoption of OpenStack will also disrupt many of the existing cloud solutions that were built in a pre-OpenStack world. New solutions that take a more native-to-OpenStack approach will pop up and replace many of the current solutions. A good example of this is Project Solum, which aims to provide a native alternative to existing PaaS solutions, such as CloudFoundry and OpenShift as I outlined in this recent post. I expect that will see more of these disruptions expanding to other frameworks in 2014.

5. Orchestration & Automation will be the next big thing in 2014

Having said all that, the remaining challenge of enterprises is to break the IT bottleneck. This bottleneck is created by IT-centric decision-making processes, a.k.a “IaaS First Approach,” in which IT is focused on building a private cloud infrastructure – a process that takes much longer than anticipated when compared with a more business/application-centric approach. One of the ways to overcome that challenge is to abstract the infrastructure and allow other departments within the organization to take a parallel path towards the cloud, while ensuring future compatibility with new development in the IT-led infrastructure.
clip_image010

DevOps has definitely been a key example for a business-led initiative that determines the speed of innovation and, thus, competitiveness of many organizations. The move to DevOps forces many organizations to go through both cultural and technology changes in making the biz and application more closely aligned, not just in goals, but in processes and tools as well.

Enterprises with legacy environments and customers take a two-step approach. They first tackle continuous delivery – automating the packages of their software deliverable and keeping tighter control over new production rollout. Once enterprises feel comfortable with their process and environment, they will automate the entire deployment process into production.

Configuration management, orchestration and workflow automation become key enablers in enterprise transition to cloud, and will gain much attention in 2014 and 2015. Again, Amazon has already recognized that need, introducing into the space a new offering called OpsWorks. It is only a matter of time until will see similar offerings embedded with other cloud providers in both public and private clouds.

As the focus moves to orchestration with greater focus on standardization of deployments and packages, DSL and API become equally important. Existing standards, such as TOSCA and CAMP, will undergo massive modifications to make them simpler to use and fit the cloud environment.

Big Data Predictions

There are many advancements happening within the Big Data/NoSQL domain that I’m not going to touch on. I will focus on two main areas that are close to my work.

6. 2014 will see a major increase of Big Data in the Cloud

Cloud infrastructure is closing the gap with the existing data center, as we see the emergence of support for Bare Metal, High CPU, High Memory and Flash-Disk. Many cloud providers are already offering Big Data as Service, such as Elastic Map Reduce and Redshift, and are continuously expanding their offering on that regard.

This advancement in cloud infrastructure removes almost all of the technical barriers for running I/O intensive workloads, such as Big Data analytics, on the cloud. I expect that in 2014, running Big Data analytics on the cloud will become the first choice for any new project, while the use of Big Data in non-cloud environments will be minimized to only niche use cases with extreme regulations and security constraints.

7. In-Memory Data backed by flash-disk will become a popular choice for Real-Time Big Data analytics
Flash disk pricing changes the economics behind the cost/performance ratio of disk-based solutions, making it possible to achieve the performance of an In-Memory-based solution at a price closer to a disk-based solution. So far, attempts have been made to use flash disks as a fast disk alternative to magnetic drives. This path inherits many of the limitations of a disk-based drive and, therefore, doesn’t capture the full potential of flash disk which, like RAM, can provide highly parallel access to data.
clip_image012

At the same time, memory-based solutions like In-Memory Database or In-Memory Data Grid cover a small niche in the Big Data ecosystem, mostly due to the high cost/performance ratio.

I believe that the combination of flash drive, In-Memory data-grids and databases at the front-end change that dynamic and will make memory-based solutions backed by flash disk much more attractive to a bigger niche, specifically as it relates to real-time analytics of Big Data.

In some cases, the combination of memory-based solutions is also integrated with existing Big Data frameworks and, thus, provides seamless performance acceleration. A good example is GigaSpaces’ integration with Storm and GridGain’s integration with Hadoop.

8. Real-time analytics turns mainstream

The most interesting indication that real-time analytics is becoming mainstream is Amazon’s support for real-time analytics, with many of the existing analytics solutions providing real-time analytics capabilities as built-in parts of their reports. Another good example of that is Google Analytics Real Time View.

New disruptive forces in Cloud worth watching in 2014
Disruptive technologies change the market landscape in ways that are difficult to anticipate by their very nature. Rather than predicting their impact, I felt it would be simpler to list them.

  • Networking – The networking segment of IT is experiencing a major disruption as of late with the move to Software Defined Network, OpenStack Neutron project and Network Function Virtualization. These developments will change networking from not only a technology perspective, but they are also driving a completely different business model which will be based on utilization or a subscription-based model.
  • Linux Container – Linux containers are gaining interest both as light-weight software packaging as well as VMS’s. The most commonly used use case for Linux containers has as an underlying container for PaaS. With the introduction of new projects, such as Docker – which makes the Linux container easy to use, we will see wider and more pervasive use of containers as high performance virtualization, as packaging tools in continuous deployment scenarios, etc.
  • Bare Metal Cloud - Bare metal cloud has been a small niche in the cloud space, mostly due to the fact that it is usually offered in a static configuration setup. Bare metal clouds provide the same degree of elasticity to bare metal devices. With OpenStack, for example, you can spawn a new bare metal device just as you would provision any other VM. The development would reduce one of the last remaining barriers for bringing mission critical applications to the cloud.
  • OpenStack as an Innovation Accelerator – Many items on the list of disruptive technologies that I listed above are not that new, and to a certain degree, have existed for years, like in the case of Linux Container. Often, a disruptive force needs to gain certain critical mass before it can break into massive adoption. OpenStack creates an ecosystem that provides a platform for many users to integrate new technologies in a way that could be consumed by end users immediately and that plays a major role in the acceleration of adoption of many of those disruptive technologies.

Where do we go from here?

There are many individual technologies and trends that have a disruptive force behind them. Having said that, I think that it is the intersection between those technologies that holds the most promising potential. Here are few examples:

  • Big Data and the Cloud Application Orchestration - Big Data analytics for operational information could serve as the new “brain” behind a new class of orchestration engine that would combine artificial intelligence decision-making based on trends and historical analysis and handle complex failure and scaling scenarios automatically.
  • Putting Network and Applications together – Putting network and applications together holds a lot of promise in the way we scale applications across regions and multiple sites, as well as how we control application SLA’s in a shared environment. For example, giving priority to customer-facing services versus batch analytics or optimizing the network routing based on the locality of the data, etc.

(via: Nati Shalom’s Blog)

 

We Finally Cracked The 10K Problem – This Time For Managing Servers With 2000x Servers Managed Per Sysadmin

In 1999 Dan Kegel issued a big hairy audacious challenge to web servers:

It’s time for web servers to handle ten thousand clients simultaneously, don’t you think? After all, the web is a big place now.

This became known as the C10K problem. Engineers solved the C10K scalability problems by fixing OS kernels and moving away from threaded servers like Apache to event-driven servers like Nginx and Node.

Today we are considering an even bigger goal, how to support 10 Million Concurrent Connections, which requires even more radical techniques.

No similar challenge was issued for managing servers in a datacenter, but according toDave Neary from Red Hat, in a recent FLOSS Weekly episode, we have passed the 10K barrier for server management with 10,000 or more servers managed per sysadmin.

Should We Let This Milestone Pass Without Mention?

Absolutely not! It’s a stunning accomplishment with 200x-2000x increases in productivity. Dave said he remembered in the 1990s it took one sysadmin to manage 4 or 5 Windows servers. A Linux sysadmin could manage 50 to 60 servers.

Now companies are managing over 10,000 servers per sysadmin. This huge change is rooted both in IaaS, treating a datacenter as an elastic programmable resource, divorcing operations from infrastructure deployment, and in the DevOps revolution, with its emphasis on tools, culture, automation, metrics, sharing of resources, and infrastructure as code.

What Will It Take To Manage 10 Million Servers Per Sysadmin?

Who might know? Google of course.

As James Hamilton says, Counting Servers is Hard, but Microsoft says they have 1 million servers, and Google is planning for 10 million servers, so it may take a while before we can get to 10 million servers per sysadmin.

But when it does happen the base will be built on:

At a high level the approach of scaling to 10 million connections per server and scaling 10 million machines per sysadmin are the same: scalability is specialization.

But at lower level they differ completely. Scaling to 10 million connections is about removing layers and doing the work yourself. Scaling to 10 million servers is all about putting the intelligence into smarter and smarter layers. A lot like how human body utilizes trillions of individual components mediated by many autonomous systems all directed by a parallelized and decentralized brain.

(source: HighScalability.com)