[Cloudflare]Cloudflare protects customers from cache poisoning

A few days ago, Cloudflare — along with the rest of the world — learned of a “practical” cache poisoning attack. In this post I’ll walk through the attack and explain how Cloudflare mitigated it for our customers. While any web cache is vulnerable to this attack, Cloudflare is uniquely able to take proactive steps to defend millions of customers.

In addition to the steps we’ve taken, we strongly recommend that customers update their origin web servers to mitigate vulnerabilities. Some popular vendors have applied patches that can be installed right away, including DrupalSymfony, and Zend.

How a shared web cache works

Say a user requests a cacheable file, index.html. We first check if it’s in cache, and if it’s not not, we fetch it from the origin and store it. Subsequent users can request that file from our cache until it expires or gets evicted.

Although contents of a response can vary slightly between requests, customers may want to cache a single version of the file to improve performance:


(See this support page for more info about how to cache HTML with Cloudflare.)

How do we know it’s the same file? We create something called a “cache key” which contains several fields, for example:

  • HTTP Scheme
  • HTTP Host
  • Path
  • Query string

In general, if the URL matches, and our customer has told us that a file is cacheable, we will serve the cached file to subsequent users.

How a cache poisoning attack works

In a cache poisoning attack, a malicious user crafts an HTTP request that tricks the origin into producing a “poisoned” version of index.html with the same cache key as an innocuous request. This file may get cached and served to other users:


We take this vulnerability very seriously, because an attacker with no privileges may be able to inject arbitrary data or resources into customer websites.

So how do you trick an origin into producing unexpected output? It turns out that some origins send back data back from HTTP headers that are not part of the cache key.

To give one example, we might observe origin behavior like:

An HTTP response that reflects back data in an HTTP request header

Because this data is returned, unescaped, from the origin, it can be used in scary ways:

An HTTP response that reflects back malicious JavaScript from an HTTP request header

Game over — the attacker can now get arbitrary JavaScript to execute on this webpage.

Notifying customers who are at risk

As soon as we learned about this new vulnerability, we wanted to see if any of our customers were vulnerable. We scanned all of our enterprise customer websites and checked if they echoed risky data. We immediately notified these customers about the vulnerability and advised them to update their origin.

Blocking the worst offenders

The next step was to block all requests that contain obviously malicious content — like JavaScript — in an HTTP header. Examples of this include a header with suspicious characters like < or >.

We were able to deploy these changes immediately for all customers who use our WAF. But we weren’t done yet.

A more subtle attack

There are other versions of the attack that could trick a client into downloading an unwanted but innocuous-looking resource, with harmful consequences.

Many requests that have traveled through another proxy before reaching Cloudflare contain the X-Forwarded-Host header. Some origins may rely on this value to serve web pages. For example:

An HTTP request may look innocuous, but contain malicious data that gets reflected by an origin

In this case, there’s no way to just block requests with this X-Forwarded-Host header, because it may have a valid purpose. However, we need to ensure that we don’t return this content to any users who didn’t request it!

There are a few ways we could defend against this type of attack. An obvious first answer is to just disable cache. This isn’t a great solution, though, as disabling cache would result in a tremendous amount of traffic on customer origin servers, which defeats the purpose of using Cloudflare.

Another option is to always include every HTTP header and its value in the cache key. However, there are many headers, and many different innocuous values (e.g. User-Agent). If we always included them in our default cache key, performance would degrade, because different users asking for the same content would get different copies, when they could all be effectively served with one.

Solution: include “interesting” header values in the cache key

Instead, we decided to change our cache keys for a request only if we think it may influence the origin response. Our default cache key got a bunch of new values:

  • HTTP Scheme
  • HTTP Host
  • Path
  • Query string
  • X-Forwarded-Host header
  • X-Host header
  • X-Forwarded-Scheme header

In order to prevent unnecessary cache sharding, we only include these header values when they differ from what’s in the URL or Host header. For example, if the HTTP Host is http://www.example.com, and X-Forwarded-Host is also http://www.example.com, we will not add the X-Forwarded-Host header to the cache key. Of course, it’s still crucial that applications do not send back data from any other headers!

One side effect of this change is that customers who use these headers, and also rely on Purge by URL, may need to specify more headers in their Purge API calls. You can read more detail in this support page.


Cloudflare is committed to protecting our customers. If you notice anything unusual with your account, or have more questions, please contact Cloudflare Support.

(Source link: https://blog.cloudflare.com/cache-poisoning-protection/)

Google’s Cloud Pub/Sub Real-Time Messaging Service Is Now In Public Beta


Google is launching the first public beta of Cloud Pub/Sub today, its backend messaging service that makes it easier for developers to pass messages between machines and to gather data from smart devices. It’s basically a scalable messaging middleware service in the cloud that allows developers to quickly pass information between applications, no matter where they’re hosted. Snapchat is already using it for its Discover feature and Google itself is using it in applications like its Cloud Monitoring service.

Pub/Sub was in alpha for quite a while. Google first (quietly) introduced it at its I/O developer conference last year, it never made a big deal about the service. Until now, the service was in private alpha, but starting today, all developers can use the service.



Using the Pub/Sub API, developers can create up to 10,000 topics (that’s the entity the application sends its messages to) and send up to 10,000 messages per second. Google says notifications should go out in under a second “even when tested at over 1 million messages per second.”

The typical use cases for this service, Google says, include balancing workloads in network clusters, implementing asynchronous workflows, logging to multiple systems, and data streaming from various devices.

During the beta period, the service is available for free. Once it comes out of beta, developers will have to pay $0.40 per million for the first 100 million API calls each month. Users who need to send more messages will pay $0.25 per million for the next 2.4 billion operations (that’s about 1,000 messages per second) and $0.05 per million for messages above that.

Now that Pub/Sub has hit beta — and Google even announced the pricing for the final release — chances are we will see a full launch around Google I/O this summer.


(via Techcrunch.com)

Linux Containers and the Future Cloud

Linux-based container infrastructure is an emerging cloud technology based on fast and lightweight process virtualization. It provides its users an environment as close as possible to a standard Linux distribution. As opposed to para-virtualization solutions (Xen) and hardware virtualization solutions (KVM), which provide virtual machines (VMs), containers do not create other instances of the operating system kernel. Due to the fact that containers are more lightweight than VMs, you can achieve higher densities with containers than with VMs on the same host (practically speaking, you can deploy more instances of containers than of VMs on the same host).

Another advantage of containers over VMs is that starting and shutting down a container is much faster than starting and shutting down a VM. All containers under a host are running under the same kernel, as opposed to virtualization solutions like Xen or KVM where each VM runs its own kernel. Sometimes the constraint of running under the same kernel in all containers under a given host can be considered a drawback. Moreover, you cannot run BSD, Solaris, OS/x or Windows in a Linux-based container, and sometimes this fact also can be considered a drawback.

The idea of process-level virtualization in itself is not new, and it already was implemented by Solaris Zones as well as BSD jails quite a few years ago. Other open-source projects implementing process-level virtualization have existed for several years. However, they required custom kernels, which was often a major setback. Full and stable support for Linux-based containers on mainstream kernels by the LXC project is relatively recent, as you will see in this article. This makes containers more attractive for the cloud infrastructure. More and more hosting and cloud services companies are adopting Linux-based container solutions. In this article, I describe some open-source Linux-based container projects and the kernel features they use, and show some usage examples. I also describe the Docker tool for creating LXC containers.

The underlying infrastructure of modern Linux-based containers consists mainly of two kernel features: namespaces and cgroups. There are six types of namespaces, which provide per-process isolation of the following operating system resources: filesystems (MNT), UTS, IPC, PID, network and user namespaces (user namespaces allow mapping of UIDs and GIDs between a user namespace and the global namespace of the host). By using network namespaces, for example, each process can have its own instance of the network stack (network interfaces, sockets, routing tables and routing rules, netfilter rules and so on).

Creating a network namespace is very simple and can be done with the following iproute command: ip netns add myns1. With the ip netns command, it also is easy to move one network interface from one network namespace to another, to monitor the creation and deletion of network namespaces, to find out to which network namespace a specified process belongs and so on. Quite similarly, when using the MNT namespace, when mounting a filesystem, other processes will not see this mount, and when working with PID namespaces, you will see by running the ps command from that PID namespace only processes that were created from that PID namespace.

The cgroups subsystem provides resource management and accounting. It lets you define easily, for example, the maximum memory that a process may use. This is done by using cgroups VFS operations. The cgroups project was started by two Google developers, Paul Menage and Rohit Seth, back in 2006, and it initially was called “process containers”. Neither namespaces nor cgroups intervene in critical paths of the kernel, and thus they do not incur a high performance penalty, except for the memory cgroup, which can incur significant overhead under some workloads.

Linux-Based Containers

Basically, a container is a Linux process (or several processes) that has special features and that runs in an isolated environment, configured on the host. You might sometimes encounter terms like Virtual Environment (VE) and Virtual Private Server (VPS) for a container.

The features of this container depend on how the container is configured and on which Linux-based container is used, as Linux-based containers are implemented differently in several projects. I mention the most important ones in this article:

  • OpenVZ: the origins of the OpenVZ project are in a proprietary server virtualization solution called Virtuozzo, which originally was started by a company called SWsoft, founded in 1997. In 2005, a part of the Virtuozzo product was released as an open-source project, and it was called OpenVZ. Later, in 2008, SWsoft merged with a company called Parallels. OpenVZ is used for providing hosting and cloud services, and it is the basis of the Parallels Cloud Server. Like Virtuozzo, OpenVZ also is based on a modified Linux kernel. In addition, it has command-line tools (primarily vzctl) for management of containers, and it makes use of templates to create containers for various Linux distributions. OpenVZ also can run on some unmodified kernels, but with a reduced feature set. The OpenVZ project is intended to be fully mainlined in the future, but that could take quite a long time.
  • Google containers: in 2013, Google released the open-source version of its container stack, lmctfy (which stands for Let Me Contain That For You). Right now, it’s still in the beta stage. The lmctfy project is based on using cgroups. Currently, Google containers do not use the kernel namespaces feature, which is used by other Linux-based container projects, but using this feature is on the Google container project roadmap.
  • Linux-VServer: an open-source project that was first publicly released in 2001, it provides a way to partition resources securely on a host. The host should run a modified kernel.
  • LXC: the LXC (LinuX Containers) project provides a set of userspace tools and utilities to manage Linux containers. Many LXC contributors are from the OpenVZ team. As opposed to OpenVZ, it runs on an unmodified kernel. LXC is fully written in userspace and supports bindings in other programming languages like Python, Lua and Go. It is available in most popular distributions, such as Fedora, Ubuntu, Debian and more. Red Hat Enterprise Linux 6 (RHEL 6) introduced Linux containers as a technical preview. You can run Linux containers on architectures other than x86, such as ARM (there are several how-tos on the Web for running containers on Raspberry PI, for example).

I also should mention the libvirt-lxc driver, with which you can manage containers. This is done by defining an XML configuration file and then running virsh startvirsh console and visrh destroy to run, access and destroy the container, respectively. Note that there is no common code between libvirt-lxc and the userspace LXC project.

LXC Container Management

First, you should verify that your host supports LXC by running lxc-checkconfig. If everything is okay, you can create a container by using one of several ready-made templates for creating containers. In lxc-0.9, there are 11 such templates, mostly for popular Linux distributions. You easily can tailor these templates according to your requirements, if needed. So, for example, you can create a Fedora container called fedoraCT with:

lxc-create -t fedora -n fedoraCT


The container will be created by default under /var/lib/lxc/fedoraCT. You can set a different path for the generated container by adding the --lxcpath PATH option.

The -t option specifies the name of the template to be used, (fedora in this case), and the -n option specifies the name of the container (fedoraCT in this case). Note that you also can create containers of other distributions on Fedora, for example of Ubuntu (you need the debootstrap package for it). Not all combinations are guaranteed to work.

You can pass parameters to lxc-create after adding --. For example, you can create an older release of several distributions with the -R or -r option, depending on the distribution template. To create an older Fedora container on a host running Fedora 20, you can run:

lxc-create -t fedora -n fedora19 -- -R 19


You can remove the installation of an LXC container from the filesystem with:

lxc-destroy -n fedoraCT


For most templates, when a template is used for the first time, several required package files are downloaded and cached on disk under /var/cache/lxc. These files are used when creating a new container with that same template, and as a result, creating a container that uses the same template will be faster next time.

You can start the container you created with:

lxc-start -n fedoraCT


And stop it with:

lxc-stop -n fedoraCT


The signal used by lxc-stop is SIGPWR by default. In order to use SIGKILL in the earlier example, you should add -k to lxc-stop:

lxc-stop -n fedoraCT -k


You also can start a container as a dæmon by adding -d, and then log on into it with lxc-console, like this:

lxc-start -d -n fedoraCT
lxc-console -n fedoraCT


The first lxc-console that you run for a given container will connect you to tty1. If tty1 already is in use (because that’s the second lxc-console that you run for that container), you will be connected to tty2 and so on. Keep in mind that the maximum number of ttys is configured by the lxc.tty entry in the container configuration file.

You can make a snapshot of a non-running container with:

lxc-snapshot -n fedoraCT


This will create a snapshot under /var/lib/lxcsnaps/fedoraCT. The first snapshot you create will be called snap0; the second one will be called snap1 and so on. You can time-restore the snapshot at a later time with the -r option—for example:

lxc-snapshot -n fedoraCT -r snap0 restoredFdoraCT


You can list the snapshots with:

lxc-snapshot -L -n fedoraCT


You can display the running containers by running:

lxc-ls --active


Managing containers also can be done via scripts, using scripting languages. For example, this short Python script starts the fedoraCT container:


import lxc

container = lxc.Container("fedoraCT")


Container Configuration

A default config file is generated for every newly created container. This config file is created, by default, in /var/lib/lxc/<containerName>/config, but you can alter that using the --lxcpath PATH option. You can configure various container parameters, such as network parameters, cgroups parameters, device parameters and more. Here are some examples of popular configuration items for the container config file:

  • You can set various cgroups parameters by setting values to the lxc.cgroup.[subsystem name] entries in the config file. The subsystem name is the name of the cgroup controller. For example, configuring the maximum memory a container can use to be 256MB is done by setting lxc.cgroup.memory.limit_in_bytes to be 256MB.
  • You can configure the container hostname by setting lxc.utsname.
  • There are five types of network interfaces that you can set with the lxc.network.type parameter: emptyvethvlan,macvlan and phys. Using veth is very common in order to be able to connect a container to the outside world. By using phys, you can move network interfaces from the host network namespace to the container network namespace.
  • There are features that can be used for hardening the security of LXC containers. You can avoid some specified system calls from being called from within a container by setting a secure computing mode, or seccomp, policy with the lxc.seccomp entry in the configuration file. You also can remove capabilities from a container with the lxc.cap.drop entry. For example, setting lxc.cap.drop = sys_module will create a container without the CAP_SYS_MDOULE capability. Trying to run insmod from inside this container will fail. You also can define Apparmor and SELinux profiles for your container. You can find examples in the LXC README and inman 5 lxc.conf.


Docker is an open-source project that automates the creation and deployment of containers. Docker first was released in March 2013 with Apache License Version 2.0. It started as an internal project by a Platform-as-a-Service (PaaS) company called dotCloud at the time, and now called Docker Inc. The initial prototype was written in Python; later the whole project was rewritten in Go, a programming language that was developed first at Google. In September 2013, Red Hat announced that it will collaborate with Docker Inc. for Red Hat Enterprise Linux and for the Red Hat OpenShift platform. Docker requires Linux kernel 3.8 (or above). On RHEL systems, Docker runs on the 2.6.32 kernel, as necessary patches have been backported.

Docker utilizes the LXC toolkit and as such is currently available only for Linux. It runs on distributions like Ubuntu 12.04, 13.04; Fedora 19 and 20; RHEL 6.5 and above; and on cloud platforms like Amazon EC2, Google Compute Engine and Rackspace.

Docker images can be stored on a public repository and can be downloaded with the docker pull command—for example, docker pull ubuntu or docker pull busybox.

To display the images available on your host, you can use thedocker images command. You can narrow the command for a specific type of images (fedora, for example) with docker images fedora.

On Fedora, running a Fedora docker container is simple; after installing the docker-io package, you simply start the docker dæmon with systemctl start docker, and then you can start a Fedora docker container with docker run -i -t fedora /bin/bash.

Docker has git-like capabilities for handling containers. Changes you make in a container are lost if you destroy the container, unless you commit your changes (much like you do in git) withdocker commit <containerId> <containerName/containerTag>. These images can be uploaded to a public registry, and they are available for downloading by anyone who wants to download them. Alternatively, you can set a private Docker repository.

Docker is able to create a snapshot using the kernel device mapper feature. In earlier versions, before Docker version 0.7, it was done using AUFS (union filesystem). Docker 0.7 adds “storage plugins”, so people can switch between device mapper and AUFS (if their kernel supports it), so that Docker can run on RHEL releases that do not support AUFS.

You can create images by running commands manually and committing the resulting container, but you also can describe them with a Dockerfile. Just like a Makefile will compile code into a binary executable, a Dockerfile will build a ready-to-run container image from simple instructions. The command to build an image from a Dockerfile is docker build. There is a tutorial about Dockerfiles and their command syntax on the Docker Web site. For example, the following short Dockerfile is for installing the iperfpackage for a Fedora image:

FROM fedora
RUN yum install -y iperf


You can upload and store your images for free on the Docker public index. Just like with GitHub, storing public images is free and just requires you to register an account.

The Checkpoint/Restore Feature

The CRIU (Checkpoint/Restore in userspace) project is implemented mostly in userspace, and there are more than 100 little patches scattered in the kernel for supporting it. There were several attempts to implement Checkpoint/Restore in kernel space solely, some of them by the OpenVZ project. The kernel community rejected all of them though, as they were too complex.

The Checkpoint/Restore feature enables saving a process state in several image files and restoring this process from the point at which it was frozen, on the same host or on a different host at a later time. This process also can be an LXC container. The image files are created using Google’s protocol buffer (PB) format. The Checkpoint/Restore feature enables performing maintenance tasks, such as upgrading a kernel or hardware maintenance on that host after checkpointing its applications to persistent storage. Later on, the applications are restored on that host.

Another feature that is very important in HPC is load balancing using live migration. The Checkpoint/Restore feature also can be used for creating incremental snapshots, which can be used after a crash occurs. As mentioned earlier, some kernel patches were needed for supporting CRIU; here are some of them:

  • A new system call named kcmp() was added; it compares two processes to determine if they share a kernel resource.
  • A socket monitoring interface called sock_diag was added to UNIX sockets in order to be able to find the peer of a UNIX domain socket. Before this change, the ss tool, which relied on parsing of /proc entries, did not show this information.
  • A TCP connection repair mode was added.
  • procfs entry was added (/proc/PID/map_files).

Let’s look at a simple example of using the criu tool. First, you should check whether your kernel supports Checkpoint/Restore, by running criu check --ms. Look for a response that says "Looks good."

Basically, checkpointing is done by:

criu dump -t <pid>


You can specify a folder where the process state files will be saved by adding -D folderName.

You can restore with criu restore <pid>.


In this article, I’ve described what Linux-based containers are, and I briefly explained the underlying cgroups and namespaces kernel features. I have discussed some Linux-based container projects, focusing on the promising and popular LXC project. I also looked at the LXC-based Docker engine, which provides an easy and convenient way to create and deploy LXC containers. Several hands-on examples showed how simple it is to configure, manage and deploy LXC containers with the userspace LXC tools and the Docker tools.

Due to the advantages of the LXC and the Docker open-source projects, and due to the convenient and simple tools to create, deploy and configure LXC containers, as described in this article, we presumably will see more and more cloud infrastructures that will integrate LXC containers instead of using virtual machines in the near future. However, as explained in this article, solutions like Xen or KVM have several advantages over Linux-based containers and still are needed, so they probably will not disappear from the cloud infrastructure in the next few years.


Thanks to Jérôme Petazzoni from Docker Inc. and to Michael H. Warfield for reviewing this article.


Google Containers: https://github.com/google/lmctfy

OpenVZ: http://openvz.org/Main_Page

Linux-VServer: http://linux-vserver.org

LXC: http://linuxcontainers.org

libvirt-lxc: http://libvirt.org/drvlxc.html

Docker: https://www.docker.io

Docker Public Registry: https://index.docker.io

(Via LinuxJournal.com)

AWS OpsWorks in the Virtual Private Cloud

I am pleased to announce support for using AWS OpsWorks with Amazon Virtual Private Cloud (Amazon VPC). AWS OpsWorks is a DevOps solution that makes it easy to deploy, customize and manage applications. OpsWorks provides helpful operational features such as user-based ssh management, additional CloudWatch metrics for memory and load, automatic RAID volume configuration, and a variety of application deployment options. You can optionally use the popular Chef automation platform to extend OpsWorks using your own custom recipes. With VPC support, you can now take advantage of the application management benefits of OpsWorks in your own isolated network. This allows you to run many new types of applications on OpsWorks.

For example, you may want a configuration like the following, with your application servers in a private subnet behind a public Elastic Load Balancer (ELB). This lets you control access to your application servers. Users communicate with the Elastic Load Balancer which then communicates with your application servers through the ports you define. The NAT allows your application servers to communicate with the OpsWorks service and with Linux repositories to download packages and updates.

To get started, we’ll first create this VPC. For a shortcut to create this configuration, you can use a CloudFormation template. First, navigate to theCloudFormation console and select Create Stack.  Give your stack a name, provide the template URL http://cloudformation-templates-us-east-1.s3.amazonaws.com/OpsWorksinVPC.template and select Continue. Accept the defaults and select Continue. Create a tag with a key of “Name” and a meaningful value. Then create your CloudFormation stack.

When your CloudFormation stack’s status shows “CREATE_COMPLETE”, take a look at the outputs tab; it contains several IDs that you will need later, including the VPC and subnet IDs.

You can now create an OpsWorks stack to deploy a sample app in your new private subnet. Navigate to the AWS OpsWorks console and click Add Stack. Select the VPC and private subnet that you just created using the CloudFormation template.

Next, under Add your first layer, click Add a layer. For Layer type box, select PHP App Server. Select the Elastic Load Balancer created in by the CloudFormation template to the Layer and then click Add layer.

Next, in the layer’s Actions column click Edit. Scroll down to the Security Groups section and select the Additional Group with OpsWorksSecurityGroup in the name. Click the + symbol, then click Save.

Next, in the navigation pane, click Instances, accept the defaults, and then click Add an Instance. This creates the instance in the default subnet you set when you created the stack.

Under PHP App Server, in the row that corresponds to your instance, click start in the Actions column.

You are now ready to deploy a sample app to the instance you created. An app represents code you want to deploy to your servers. That code is stored in a repository, such as Git or Subversion. For this example, we’ll use the SimplePHPApp application from the Getting Started walkthrough.  First, in the navigation pane, click Apps. On the Apps page, click Add an app. Type a name for your app and scroll down to the Repository URL and set Repository URL to git://github.com/amazonwebservices/opsworks-demo-php-simple-app.git, and Branch/Revision to version1. Accept the defaults for the other fields.

When all the settings are as you want them, click Add app. When you first add a new app, it isn’t deployed yet to the instances for the layer. To deploy your App to the instance in PHP App Server layer, under Actions, click Deploy.

Once your deployment has finished, in the navigation pane, click Layers. Select the Elastic Load Balancer for your PHP App Server layer. The ELB page shows the load balancer’s basic properties, including its DNS name and the health status of the associated instances. A green check indicates the instance has passed the ELB health checks (this may take a minute). You can then click on the DNS name to connect to your app through the load balancer.

You can try these new features with a few clicks of the AWS Management Console. To learn more about how to launch OpsWorks instances inside a VPC, see the AWS OpsWorks Developer Guide.

You may also want to sign up for our upcoming AWS OpsWorks Webinar on September 12, 2013 at 10:00 AM PT. The webinar will highlight common use cases and best practices for how to set up AWS OpsWorks and Amazon VPC.

— Chris Barclay, Senior Product Manager

(source: Amazon Web Services blog)


More Database Power – 20,000 IOPS for MySQL With the CR1 Instance

If you are a regular reader of this blog, you know that I am a huge fan of the Amazon Relational Database Service (RDS). Over the course of the last couple of years, I have seen that my audiences really appreciate the way that RDS obviates a number of tedious yet mission-critical tasks that are an inherent responsibility when running a relational database. There’s no joy to be found in keeping operating systems and database engines current, creating and restoring backups, scaling hardware up and down, or creating an architecture that provides high availability.

Today we are making RDS even more powerful by adding a new high-end database instance class. The new db.cr1.8xlarge instance type gives you plenty of memory, CPU power, and network throughput to allow your MySQL 5.6 applications to perform at up to 20,000 IOPS. This is a 60% improvement over the previous high-water mark of 12,500 IOPS and opens the door to database-driven applications that are even more demanding than before. Here are the specs:

  • 64-bit platform
  • 244 GB of RAM
  • 88 ECU (16 hyperthreaded virtual cores each delivering 2.75 ECU)
  • High-performance networking

This new instance type is available in the US East (Northern Virginia), US West (Oregon), EU (Ireland), and Asia Pacific (Tokyo) Regions and you can start using it today!

(source: Amazon Web Services blog)

OpenStack Grizzly Architecture

As OpenStack has continued to mature, it has become more complicated in some ways but radically simplified in others. From a deployers view, each service has become easier to deploy with more sensible defaults and the proliferations of cloud distributions. However, the architects view of OpenStack has actually gotten more complicated – new services have been added and new ways of integrating them are now feasible.

As an aid to architects that are new to OpenStack, this post updates my OpenStack Folsom Architecture blog and revisits my Intro to OpenStack Architecture (Grizzly Edition)presentation from Portland, with a few clarfications and updates.

OpenStack Components

There are currently seven core components of OpenStack: Compute, Object Storage, Identity, Dashboard, Block Storage, Network and Image Service. Let’s look at each in turn:

  • Object Store (codenamed “Swift“) allows you to store or retrieve files (but not mount directories like a fileserver). Several companies provide commercial storage services based on Swift. These include KT, Rackspace (from which Swift originated) and Hewlett-Packard. Swift is also used internally at many large companies to store their data.
  • Image Store (codenamed “Glance“) provides a catalog and repository for virtual disk images. These disk images are mostly commonly used in OpenStack Compute.
  • Compute (codenamed “Nova“) provides virtual servers upon demand. Rackspace andHP provide commercial compute services built on Nova and it is used internally at companies like Mercado Libre, Comcast, Best Buy and NASA (where it originated).
  • Dashboard (codenamed “Horizon“) provides a modular web-based user interface for all the OpenStack services. With this web GUI, you can perform most operations on your cloud like launching an instance, assigning IP addresses and setting access controls.
  • Identity (codenamed “Keystone“) provides authentication and authorization for all the OpenStack services. It also provides a service catalog of services within a particular OpenStack cloud.
  • Network (which used to named “Quantum” but is in the process of being renamed due to a trademark issue) provides “network connectivity as a service” between interface devices managed by other OpenStack services (most likely Nova). The service works by allowing users to create their own networks and then attach interfaces to them. Quantum has a pluggable architecture to support many popular networking vendors and technologies.
  • Block Storage (codenamed “Cinder“) provides persistent block storage to guest VMs. This project was born from code originally in Nova (the nova-volume service that has been depricated). While this was originally a block storage only service, it has been extended to NFS shares.

In addition to these core projects, there are also a number of non-core projects that will be included in future OpenStack releases.

Conceptual Architecture

The OpenStack project as a whole is designed to “deliver(ing) a massively scalable cloud operating system.” To achieve this, each of the constituent services are designed to work together to provide a complete Infrastructure as a Service (IaaS). This integration is facilitated through public application programming interfaces (APIs) that each service offers (and in turn can consume). While these APIs allow each of the services to use another service, it also allows an implementer to switch out any service as long as they maintain the API. These are (mostly) the same APIs that are available to end users of the cloud.

Conceptually, you can picture the relationships between the services as so:

OpenStack Grizzly Conceptual Architecture

  • Dashboard provides a web front end to the other OpenStack services
  • Compute stores and retrieves virtual disks (“images”) and associated metadata in the Image Store (“Glance”)
  • Network provides virtual networking for Compute.
  • Block Storage provides storage volumes for Compute.
  • Image Store can store the actual virtual disk files in the Object Store
  • All the services authenticate with Identity

This is a stylized and simplified view of the architecture, assuming that the implementer is using all of the services together in the most common configuration. However, OpenStack does not mandate an all-or-nothing approach. Many implementers only deploy the pieces that they need. For example, Swift is a popular object store for cloud service providers, even if they deploy another cloud compute infrastructure.

The diagram also only shows the “operator” side of the cloud — it does not picture how consumers of the cloud may actually use it. For example, many users will access object storage heavily (and directly).

Logical Architecture

As you can imagine, the logical architecture is far more complicated than the conceptual architecture shown above. As with any service-oriented architecture, diagrams quickly become “messy” trying to illustrate all the possible combinations of service communications. The diagram below, illustrates the most common architecture of an OpenStack-based cloud. However, as OpenStack supports a wide variety of technologies, it does not represent the only architecture possible.

OpenStack Grizzly Logical Architecture

This picture is consistent with the conceptual architecture above in that:

  • End users interact through a common web interface or directly to each service through their API
  • All services authenticate through a common source (facilitated through Keystone)
  • Individual services interact with each other through their APIs (except where privileged administrator commands are necessary) — including the user’s web interface

In the sections below, we’ll delve into the architecture for each of the services.


Horizon is a modular Django web application that provides an end user and cloud operator interface to OpenStack services.

OpenStack Horizon Screenshot

The interface has user screens for:

  • Quota and usage information
  • Instances to operate cloud virtual machines
  • Volume management to control creation, deletion and connectivity to block storage
  • Image and snapshot to upload and control virtual images, which are used to backup and boot new instances
  • Access and security to manage keypairs and security groups (firewall rules)

In addition to the user screens, it also provides an interface for cloud operators. The operator interface sees across the entire cloud and adds some configuration focused screens such as:

  • Flavors to define service catalog offerings of CPU, memory and boot disk storage
  • Projects to provide logical groups of user accounts
  • Users to administer user accounts
  • System Info to view services running in the cloud and quotas applied to projects

The Grizzly edition of Horizon adds a few new features as well as significant refactoring to the user experience:

  • Networking (see the new network topology diagrams)
  • Direct image upload to Glance
  • Support for flavor extra specs
  • Migrate instances to other compute hosts
  • User experience improvements

The Horizon architecture is fairly simple. Horizon is usually deployed via mod_wsgi in Apache. The code itself is separated into a reusable python module with most of the logic (interactions with various OpenStack APIs) and presentation (to make it easily customizable for different sites).

From a network architecture point of view, this service will need to be customer accessible as well as be able to talk to each service’s public APIs. If you wish to use the administrator functionality (i.e. for other services), it will also need connectivity to their Admin API endpoints (which should not be customer accessible).


Nova is the most complicated and distributed component of OpenStack. A large number of processes cooperate to turn end user API requests into running virtual machines. Among Nova more prominent features are:

  • Starting, resizing, stopping and querying virtual machines (“instances”)
  • Assigning and removing public IP addresses
  • Attaching and detaching block storage
  • Adding, modifying and deleting security groups
  • Show instance consoles
  • Snapshot running instances

There are several changes to the architecture in this release. These changes include the depreciation of nova-network and nova-volume as well as the decoupling of nova-compute from the database (through the no-compute-db feature). All of these changes are all optional (the old code is still available to be used), but are slated to disappear soon.

Below is a list of these processes and their functions:

  • nova-api is a family of daemons (nova-apinova-api-os-computenova-api-ec2,nova-api-metadata or nova-api-all) that accept and respond to end user compute API calls. It supports OpenStack Compute API, Amazon’s EC2 API and a special Admin API (for privileged users to perform administrative actions). It also initiates most of the orchestration activities (such as running an instance) as well as enforces some policy (mostly quota checks). Different daemons allow Nova to implement different APIs (Amazon EC2, OpenStack Compute, Metadata) or combination of APIs (nova-api starts both the EC2 and OpenStack APIs).
  • The nova-compute process is primarily a worker daemon that creates and terminates virtual machine instances via hypervisor’s APIs (XenAPI for XenServer/XCP, libvirt for KVM or QEMU, VMwareAPI for VMware, etc.). New to the Grizzly release is the return of Hyper-V (thanks to the Cloudbase Solutions guys for the comment). The process by which it does so is fairly complex but the basics are simple: accept actions from the queue and then perform a series of system commands (like launching a KVM instance) to carry them out while updating state in the database through nova-conductor. Please note that the use of nova-conductor is optional in this release, but does greatly increase security.
  • The nova-scheduler process is conceptually the simplest piece of code in OpenStack Nova: take a virtual machine instance request from the queue and determines where it should run (specifically, which compute server host it should run on). In practice, it is now one of the most complex.
  • A new service called nova-conductor has been added to this release. It mediates access to the database for other daemons (only nova-compute in this release) to provide greater security.
  • The queue provides a central hub for passing messages between daemons. This is usually implemented with RabbitMQ today, but could be any AMPQ message queue (such as Apache Qpid), or Zero MQ.
  • The SQL database stores most of the build-time and run-time state for a cloud infrastructure. This includes the instance types that are available for use, instances in use, networks available and projects. Theoretically, OpenStack Nova can support any database supported by SQL-Alchemy but the only databases currently being widely used are sqlite3 (only appropriate for test and development work), MySQL and PostgreSQL.
  • Nova also provides console services to allow end users to access their virtual instance’s console through a proxy. This involves several daemons (nova-console,nova-xvpvncproxynova-spicehtml5proxy and nova-consoleauth).

Nova interacts with many other OpenStack services: Keystone for authentication, Glance for images and Horizon for web interface. The Glance interactions are central. The API process can upload and query Glance while nova-compute will download images for use in launching images.

Object Store

OpenStack’s Object Store (“Swift”) is designed to provide large scale storage of data that is accessible via APIs. Unlike a traditional file server, it is completely distributed, storing multiple copies of each object to achieve greater availability and scalability. Swift provides the following user functionality:

  • Stores and retrieves objects (files)
  • Sets and modifies metadata on objects (tags)
  • Versions objects
  • Serve static web pages and objects via HTTP. In fact, the diagrams in this blog post are being served out of Rackspace’s Swift service.

The swift architecture is very distributed to prevent any single point of failure as well as to scale horizontally. It includes the following components:

  • Proxy server (swift-proxy-server) accepts incoming requests via the OpenStack Object API or just raw HTTP. It accepts files to upload, modifications to metadata or container creation. In addition, it will also serve files or container listing to web browsers. The proxy server may utilize an optional cache (usually deployed with memcache) to improve performance.
  • Account servers manage accounts defined with the object storage service.
  • Container servers manage a mapping of containers (i.e folders) within the object store service.
  • Object servers manage actual objects (i.e. files) on the storage nodes.

There are also a number of periodic process which run to perform housekeeping tasks on the large data store. The most important of these is the replication services, which ensures consistency and availability through the cluster. Other periodic processes include auditors, updaters and reapers. Authentication for the object store service is handled through configurable WSGI middleware (which will usually be Keystone).

To learn more about Swift, head over to the SwiftStack website and read their OpenStack Swift Architecture.

Image Store

OpenStack Image Store centralizes virtual images for users and other cloud services:

  • Stores public and private images that users can utilize to start instances
  • Users can query and list available images for use
  • Delivers images to Nova to start instances
  • Snapshots from running instances can be stored so that virtual machines can be backed

The Glance architecture has stayed relatively stable since the Cactus release.

  • glance-api accepts Image API calls for image discovery, image retrieval and image storage.
  • glance-registry stores, processes and retrieves metadata about images (size, type, etc.).
  • A database to store the image metadata. Like Nova, you can choose your database depending on your preference (but most people use MySQL or SQlite).
  • A storage repository for the actual image files. In the diagram above, Swift is shown as the image repository, but this is configurable. In addition to Swift, Glance supports normal filesystems, RADOS block devices, Amazon S3 and HTTP. Be aware that some of these choices are limited to read-only usage.

There are also a number of periodic process which run on Glance to support caching. The most important of these is the replication services, which ensures consistency and availability through the cluster. Other periodic processes include auditors, updaters and reapers.

As you can see from the diagram in the Conceptual Architecture section, Glance serves a central role to the overall IaaS picture. It accepts API requests for images (or image metadata) from end users or Nova components and can store its disk files in the object storage service, Swift.


Keystone provides a single point of integration for OpenStack policy, catalog, token and authentication:

  • Authenticate users and issue tokens for access to services
  • Store users and tenants for a role-based access control (RBAC)
  • Provides a catalog of the services (and their API endpoints) in the cloud
  • Create policies across users and services

Architecturally, Keystone is very simple:

  • keystone handles API requests as well as providing configurable catalog, policy, token and identity services.
  • Each Keystone function has a pluggable backend which allows different ways to use the particular service. Most support standard backends like LDAP or SQL, as well as Key Value Stores (KVS).

Most people will use this as a point of customization for their current authentication services.


Quantum provides “network connectivity as a service” between interface devices
managed by other OpenStack services (most likely Nova). It allows users to:

  • Users can create their own networks and then attach server interfaces to them
  • Pluggable backend architecture lets users take advantage of commodity gear or vendor supported equipment
  • Extensions allow additional network services like load balancing

Like many of the OpenStack services, Quantum is highly configurable due to it’s
plug-in architecture. These plug-ins accommodate different networking equipment
and software. As such, the architecture and deployment can vary dramatically.

  • quantum-server accepts API requests and then routes them to the
    appropriate quantum plugin for action.
  • Quantum plugins and agents perform the actual work such as plugging and
    unplugging ports, creating networks or subnets and IP addressing. These
    plugins and agents differ depending on the vendor and technologies used in the
    particular cloud. Quantum ships with plugins and agents for: Cisco virtual and
    physical switches, Nicira NVP product, NEC OpenFlow products, Open vSwitch,
    Linux bridging and the Ryu Network Operating System. Midokua also provides a plug-in for Quantum integration. The common agents are L3 (layer 3), DHCP (dynamic host IP addressing) and vendor specific plug-in agent(s).
  • Most Quantum installations will also make use of a messaging queue to route
    information between the quantum-server and various agents as well as a
    database to store networking state for particular plugins.

Quantum will interact mainly with Nova, where it will provide networks and
connectivity for its instances. Florian Otel has written very thorough article on implementing Open vSwitch is you are looking for an example of Quantum in action.

Block Storage

Cinder separates out the persistent block storage functionality that was
previously part of Openstack Compute (in the form of nova-volume) into it’s own
service. The OpenStack Block Storage API allows for manipulation of volumes,
volume types (similar to compute flavors) and volume snapshots:

  • Create, modify and delete volumes
  • Snapshot or backup volumes
  • Query volume status and metadata

It’s architecture follows the Quantum model, which provides for a northbound API and vendor plugins underneath it.

  • cinder-api accepts API requests and routes them to cinder-volume
    for action.
  • cinder-volume acts upon the requests by reading or writing to the
    Cinder database to maintain state, interacting with other processes (like
    cinder-scheduler) through a message queue and directly upon block
    storage providing hardware or software. It can interact with a variety of
    storage providers through a driver architecture. Currently, there are included drivers for IBM (Xiv, Storwize and SVC), SolidFire, Scality, Coraid appliances, RADOS block storage (Ceph), Sheepdog, NetApp, Windows Server 2012 iSCSI, HP (Lefthand and 3PAR), Nexenta appliances, Huawei (T series and Dorado storage systems), Zadara VPSA, Red Hat’s GlusterFS, EMC (VNX and VMAX arrays), Xen and linux iSCSI.
  • Much like nova-scheduler, the cinder-scheduler daemon picks the optimal
    block storage provider node to create the volume on.
  • cinder-backup is a new service that backs up the data from a volume (not a full snapshot) to a backend service. Currently, the only shipping backend service is Swift.
  • Cinder deployments will also make use of a messaging queue to route
    information between the cinder processes as well as a database to store volume state.

Like Quantum, Cinder will mainly interact with Nova, providing volumes for its

Future Projects

In the next version of OpenStack (“Havana” which is due in the Fall of 2013), two new projects will be brought into the fold:

  • Ceilometer is a metering project. The project offers metering information and the ability to code more ways to know what has happened on an OpenStack cloud. While it provides metering, it is not a billing project. A full billing solution requires metering, rating, and billing. Metering lets you know what actions have taken place, rating enables pricing and line items, and billing gathers the line items to create a bill to send to the consumer and collect payment. For users that also want a billing package, BillingStack is another open source project that provides payment gateway and other billing features. Ceilometer is available as a preview now.
  • Heat provides a REST API to orchestrate multiple cloud applications implementing standards such as AWS CloudFormation.

Looking beyond the “Havana” release, OpenStack is slated to see the addition of two more projects in the Spring of 2014 (for the newly named “Icehouse” release):

  • Reddwarf is a database as a service offering that provides MySQL databases within OpenVZ containers upon demand.
  • Ironic is the aptly named project that uses OpenStack to deploy bare metal servers instead of virtualized cloud instances.

There are also a number of related but unofficial projects:

  • Moniker that provides DNS-as-a-service for OpenStack
  • Marconi which is a message queueing service
  • Savanna to provision Hadoop clusters on OpenStack
  • Murano that allows a non-experienced user to deploy reliable Windows based environments
  • Convection is a task or workflow service to execute command sequences or long running jobs

(Source: solinea.com)


Storm-YARN Released as Open Source

At Yahoo! we have worked on the convergence of Storm with Hadoop, as mentioned in our earlier post. We are pleased to announce that Storm-YARN has been released as open source. Storm-YARN enables Storm applications to utilize the computational resources in a Hadoop cluster along with accessing Hadoop storage resources such as HBase and HDFS.

Collocating real-time processing with batch processing offers a number of advantages over segregated clusters.

  • It provides a huge potential for elasticity. Real-time processing will rarely produce a constant and predictable load. As such, Storm needs more resources to keep up with spikes in demand. Collocating Storm with batch processing allows Storm to steal resources from batch jobs when needed and give them back when demand subsides. The Storm-YARN effort lays the groundwork to make this possible.
  • Many applications use Storm for low-latency processing and Map/Reduce for batch processing while sharing data between Storm and Map/Reduce. By placing Storm physically closer to the data source and/or other components in the same pipeline we can reduce network transfers and in turn the total cost of acquiring the data.

Launch Storm Cluster
To launch a Storm cluster managed by YARN, you simply execute:

storm-yarn launch <storm-yarn.yaml>

storm-yarn.yaml is the standard storm configuration file with YARN specification parameters including master.initial-num-supervisors (the initial number of supervisors to be launched) and master.container.size-mb (the memory size of the container to be allocated for each supervisor).
Figure 1 illustrates the execution of storm-yarn command. Storm-YARN asks YARN’s Resource Manager to launch a Storm Application Master. The Application Master then launches a Storm nimbus server and a Storm UI server locally. It also uses YARN to find resources for the supervisors and launch them.

Figure 1: Launch Storm Cluster with Hadoop YARN

Execute Storm Topologies
You can communicate with the Storm cluster the same as with a standalone Storm cluster, through the storm command.

storm jar <topology_jar>

Because nimbus is running on a node picked by YARN, you may need to specify that node on the command line by setting the nimbus.host config.

As illustrated in Figure 2, each Storm supervisor will launch worker processes within its container. These Storm worker processes are enabled to access Hadoop datasets stored in HDFS and HBase etc..

Figure 2 Submit and Execute Storm Topologies

Open Source Release
Yahoo! has decided to release Storm-YARN code under the Apache 2.0 License. The code is available at https://github.com/yahoo/storm-yarn. This alpha release enables members of the community to jointly make Storm-YARN a high-quality product. Please try it out and let us know what you think.

If you are interested in contributing, please feel free to submit proposals as issues, sign an Apache style CLA and contribute your code.

Additional details on Storm-YARN will be shared during our Storm-on-YARN: Convergence of Low-Latency and Big-Data talk at the 2013 Hadoop Summit North America on June 26, 2013, 11:20 am under the Future of Apache Hadoop track. We look forward to seeing you there.

Derek Dagit has implemented significant portion of Storm-Yarn release. We thank him for making this early release avaialble for open source.

Bobby Evans is a software developer at Yahoo! and a Hadoop PMC member at the Apache Software Foundation.

Andy Feng is a Distinguished Architect at Yahoo! and a Core Contributor of Storm project. He lead architecture design and development of next-gen big-data platform, which empowers variety application patterns (Batch, Microbatch, Streaming, Query).

(via Yahoo.com)