Linux Containers and the Future Cloud

Linux-based container infrastructure is an emerging cloud technology based on fast and lightweight process virtualization. It provides its users an environment as close as possible to a standard Linux distribution. As opposed to para-virtualization solutions (Xen) and hardware virtualization solutions (KVM), which provide virtual machines (VMs), containers do not create other instances of the operating system kernel. Due to the fact that containers are more lightweight than VMs, you can achieve higher densities with containers than with VMs on the same host (practically speaking, you can deploy more instances of containers than of VMs on the same host).

Another advantage of containers over VMs is that starting and shutting down a container is much faster than starting and shutting down a VM. All containers under a host are running under the same kernel, as opposed to virtualization solutions like Xen or KVM where each VM runs its own kernel. Sometimes the constraint of running under the same kernel in all containers under a given host can be considered a drawback. Moreover, you cannot run BSD, Solaris, OS/x or Windows in a Linux-based container, and sometimes this fact also can be considered a drawback.

The idea of process-level virtualization in itself is not new, and it already was implemented by Solaris Zones as well as BSD jails quite a few years ago. Other open-source projects implementing process-level virtualization have existed for several years. However, they required custom kernels, which was often a major setback. Full and stable support for Linux-based containers on mainstream kernels by the LXC project is relatively recent, as you will see in this article. This makes containers more attractive for the cloud infrastructure. More and more hosting and cloud services companies are adopting Linux-based container solutions. In this article, I describe some open-source Linux-based container projects and the kernel features they use, and show some usage examples. I also describe the Docker tool for creating LXC containers.

The underlying infrastructure of modern Linux-based containers consists mainly of two kernel features: namespaces and cgroups. There are six types of namespaces, which provide per-process isolation of the following operating system resources: filesystems (MNT), UTS, IPC, PID, network and user namespaces (user namespaces allow mapping of UIDs and GIDs between a user namespace and the global namespace of the host). By using network namespaces, for example, each process can have its own instance of the network stack (network interfaces, sockets, routing tables and routing rules, netfilter rules and so on).

Creating a network namespace is very simple and can be done with the following iproute command: ip netns add myns1. With the ip netns command, it also is easy to move one network interface from one network namespace to another, to monitor the creation and deletion of network namespaces, to find out to which network namespace a specified process belongs and so on. Quite similarly, when using the MNT namespace, when mounting a filesystem, other processes will not see this mount, and when working with PID namespaces, you will see by running the ps command from that PID namespace only processes that were created from that PID namespace.

The cgroups subsystem provides resource management and accounting. It lets you define easily, for example, the maximum memory that a process may use. This is done by using cgroups VFS operations. The cgroups project was started by two Google developers, Paul Menage and Rohit Seth, back in 2006, and it initially was called “process containers”. Neither namespaces nor cgroups intervene in critical paths of the kernel, and thus they do not incur a high performance penalty, except for the memory cgroup, which can incur significant overhead under some workloads.

Linux-Based Containers

Basically, a container is a Linux process (or several processes) that has special features and that runs in an isolated environment, configured on the host. You might sometimes encounter terms like Virtual Environment (VE) and Virtual Private Server (VPS) for a container.

The features of this container depend on how the container is configured and on which Linux-based container is used, as Linux-based containers are implemented differently in several projects. I mention the most important ones in this article:

  • OpenVZ: the origins of the OpenVZ project are in a proprietary server virtualization solution called Virtuozzo, which originally was started by a company called SWsoft, founded in 1997. In 2005, a part of the Virtuozzo product was released as an open-source project, and it was called OpenVZ. Later, in 2008, SWsoft merged with a company called Parallels. OpenVZ is used for providing hosting and cloud services, and it is the basis of the Parallels Cloud Server. Like Virtuozzo, OpenVZ also is based on a modified Linux kernel. In addition, it has command-line tools (primarily vzctl) for management of containers, and it makes use of templates to create containers for various Linux distributions. OpenVZ also can run on some unmodified kernels, but with a reduced feature set. The OpenVZ project is intended to be fully mainlined in the future, but that could take quite a long time.
  • Google containers: in 2013, Google released the open-source version of its container stack, lmctfy (which stands for Let Me Contain That For You). Right now, it’s still in the beta stage. The lmctfy project is based on using cgroups. Currently, Google containers do not use the kernel namespaces feature, which is used by other Linux-based container projects, but using this feature is on the Google container project roadmap.
  • Linux-VServer: an open-source project that was first publicly released in 2001, it provides a way to partition resources securely on a host. The host should run a modified kernel.
  • LXC: the LXC (LinuX Containers) project provides a set of userspace tools and utilities to manage Linux containers. Many LXC contributors are from the OpenVZ team. As opposed to OpenVZ, it runs on an unmodified kernel. LXC is fully written in userspace and supports bindings in other programming languages like Python, Lua and Go. It is available in most popular distributions, such as Fedora, Ubuntu, Debian and more. Red Hat Enterprise Linux 6 (RHEL 6) introduced Linux containers as a technical preview. You can run Linux containers on architectures other than x86, such as ARM (there are several how-tos on the Web for running containers on Raspberry PI, for example).

I also should mention the libvirt-lxc driver, with which you can manage containers. This is done by defining an XML configuration file and then running virsh startvirsh console and visrh destroy to run, access and destroy the container, respectively. Note that there is no common code between libvirt-lxc and the userspace LXC project.

LXC Container Management

First, you should verify that your host supports LXC by running lxc-checkconfig. If everything is okay, you can create a container by using one of several ready-made templates for creating containers. In lxc-0.9, there are 11 such templates, mostly for popular Linux distributions. You easily can tailor these templates according to your requirements, if needed. So, for example, you can create a Fedora container called fedoraCT with:


lxc-create -t fedora -n fedoraCT

 

The container will be created by default under /var/lib/lxc/fedoraCT. You can set a different path for the generated container by adding the --lxcpath PATH option.

The -t option specifies the name of the template to be used, (fedora in this case), and the -n option specifies the name of the container (fedoraCT in this case). Note that you also can create containers of other distributions on Fedora, for example of Ubuntu (you need the debootstrap package for it). Not all combinations are guaranteed to work.

You can pass parameters to lxc-create after adding --. For example, you can create an older release of several distributions with the -R or -r option, depending on the distribution template. To create an older Fedora container on a host running Fedora 20, you can run:


lxc-create -t fedora -n fedora19 -- -R 19

 

You can remove the installation of an LXC container from the filesystem with:


lxc-destroy -n fedoraCT

 

For most templates, when a template is used for the first time, several required package files are downloaded and cached on disk under /var/cache/lxc. These files are used when creating a new container with that same template, and as a result, creating a container that uses the same template will be faster next time.

You can start the container you created with:


lxc-start -n fedoraCT

 

And stop it with:


lxc-stop -n fedoraCT

 

The signal used by lxc-stop is SIGPWR by default. In order to use SIGKILL in the earlier example, you should add -k to lxc-stop:


lxc-stop -n fedoraCT -k

 

You also can start a container as a dæmon by adding -d, and then log on into it with lxc-console, like this:


lxc-start -d -n fedoraCT
lxc-console -n fedoraCT

 

The first lxc-console that you run for a given container will connect you to tty1. If tty1 already is in use (because that’s the second lxc-console that you run for that container), you will be connected to tty2 and so on. Keep in mind that the maximum number of ttys is configured by the lxc.tty entry in the container configuration file.

You can make a snapshot of a non-running container with:


lxc-snapshot -n fedoraCT

 

This will create a snapshot under /var/lib/lxcsnaps/fedoraCT. The first snapshot you create will be called snap0; the second one will be called snap1 and so on. You can time-restore the snapshot at a later time with the -r option—for example:


lxc-snapshot -n fedoraCT -r snap0 restoredFdoraCT

 

You can list the snapshots with:


lxc-snapshot -L -n fedoraCT

 

You can display the running containers by running:


lxc-ls --active

 

Managing containers also can be done via scripts, using scripting languages. For example, this short Python script starts the fedoraCT container:


#!/usr/bin/python3

import lxc

container = lxc.Container("fedoraCT")
container.start()

 

Container Configuration

A default config file is generated for every newly created container. This config file is created, by default, in /var/lib/lxc/<containerName>/config, but you can alter that using the --lxcpath PATH option. You can configure various container parameters, such as network parameters, cgroups parameters, device parameters and more. Here are some examples of popular configuration items for the container config file:

  • You can set various cgroups parameters by setting values to the lxc.cgroup.[subsystem name] entries in the config file. The subsystem name is the name of the cgroup controller. For example, configuring the maximum memory a container can use to be 256MB is done by setting lxc.cgroup.memory.limit_in_bytes to be 256MB.
  • You can configure the container hostname by setting lxc.utsname.
  • There are five types of network interfaces that you can set with the lxc.network.type parameter: emptyvethvlan,macvlan and phys. Using veth is very common in order to be able to connect a container to the outside world. By using phys, you can move network interfaces from the host network namespace to the container network namespace.
  • There are features that can be used for hardening the security of LXC containers. You can avoid some specified system calls from being called from within a container by setting a secure computing mode, or seccomp, policy with the lxc.seccomp entry in the configuration file. You also can remove capabilities from a container with the lxc.cap.drop entry. For example, setting lxc.cap.drop = sys_module will create a container without the CAP_SYS_MDOULE capability. Trying to run insmod from inside this container will fail. You also can define Apparmor and SELinux profiles for your container. You can find examples in the LXC README and inman 5 lxc.conf.

Docker

Docker is an open-source project that automates the creation and deployment of containers. Docker first was released in March 2013 with Apache License Version 2.0. It started as an internal project by a Platform-as-a-Service (PaaS) company called dotCloud at the time, and now called Docker Inc. The initial prototype was written in Python; later the whole project was rewritten in Go, a programming language that was developed first at Google. In September 2013, Red Hat announced that it will collaborate with Docker Inc. for Red Hat Enterprise Linux and for the Red Hat OpenShift platform. Docker requires Linux kernel 3.8 (or above). On RHEL systems, Docker runs on the 2.6.32 kernel, as necessary patches have been backported.

Docker utilizes the LXC toolkit and as such is currently available only for Linux. It runs on distributions like Ubuntu 12.04, 13.04; Fedora 19 and 20; RHEL 6.5 and above; and on cloud platforms like Amazon EC2, Google Compute Engine and Rackspace.

Docker images can be stored on a public repository and can be downloaded with the docker pull command—for example, docker pull ubuntu or docker pull busybox.

To display the images available on your host, you can use thedocker images command. You can narrow the command for a specific type of images (fedora, for example) with docker images fedora.

On Fedora, running a Fedora docker container is simple; after installing the docker-io package, you simply start the docker dæmon with systemctl start docker, and then you can start a Fedora docker container with docker run -i -t fedora /bin/bash.

Docker has git-like capabilities for handling containers. Changes you make in a container are lost if you destroy the container, unless you commit your changes (much like you do in git) withdocker commit <containerId> <containerName/containerTag>. These images can be uploaded to a public registry, and they are available for downloading by anyone who wants to download them. Alternatively, you can set a private Docker repository.

Docker is able to create a snapshot using the kernel device mapper feature. In earlier versions, before Docker version 0.7, it was done using AUFS (union filesystem). Docker 0.7 adds “storage plugins”, so people can switch between device mapper and AUFS (if their kernel supports it), so that Docker can run on RHEL releases that do not support AUFS.

You can create images by running commands manually and committing the resulting container, but you also can describe them with a Dockerfile. Just like a Makefile will compile code into a binary executable, a Dockerfile will build a ready-to-run container image from simple instructions. The command to build an image from a Dockerfile is docker build. There is a tutorial about Dockerfiles and their command syntax on the Docker Web site. For example, the following short Dockerfile is for installing the iperfpackage for a Fedora image:


FROM fedora
MAINTAINER Rami Rosen
RUN yum install -y iperf

 

You can upload and store your images for free on the Docker public index. Just like with GitHub, storing public images is free and just requires you to register an account.

The Checkpoint/Restore Feature

The CRIU (Checkpoint/Restore in userspace) project is implemented mostly in userspace, and there are more than 100 little patches scattered in the kernel for supporting it. There were several attempts to implement Checkpoint/Restore in kernel space solely, some of them by the OpenVZ project. The kernel community rejected all of them though, as they were too complex.

The Checkpoint/Restore feature enables saving a process state in several image files and restoring this process from the point at which it was frozen, on the same host or on a different host at a later time. This process also can be an LXC container. The image files are created using Google’s protocol buffer (PB) format. The Checkpoint/Restore feature enables performing maintenance tasks, such as upgrading a kernel or hardware maintenance on that host after checkpointing its applications to persistent storage. Later on, the applications are restored on that host.

Another feature that is very important in HPC is load balancing using live migration. The Checkpoint/Restore feature also can be used for creating incremental snapshots, which can be used after a crash occurs. As mentioned earlier, some kernel patches were needed for supporting CRIU; here are some of them:

  • A new system call named kcmp() was added; it compares two processes to determine if they share a kernel resource.
  • A socket monitoring interface called sock_diag was added to UNIX sockets in order to be able to find the peer of a UNIX domain socket. Before this change, the ss tool, which relied on parsing of /proc entries, did not show this information.
  • A TCP connection repair mode was added.
  • procfs entry was added (/proc/PID/map_files).

Let’s look at a simple example of using the criu tool. First, you should check whether your kernel supports Checkpoint/Restore, by running criu check --ms. Look for a response that says "Looks good."

Basically, checkpointing is done by:


criu dump -t <pid>

 

You can specify a folder where the process state files will be saved by adding -D folderName.

You can restore with criu restore <pid>.

Summary

In this article, I’ve described what Linux-based containers are, and I briefly explained the underlying cgroups and namespaces kernel features. I have discussed some Linux-based container projects, focusing on the promising and popular LXC project. I also looked at the LXC-based Docker engine, which provides an easy and convenient way to create and deploy LXC containers. Several hands-on examples showed how simple it is to configure, manage and deploy LXC containers with the userspace LXC tools and the Docker tools.

Due to the advantages of the LXC and the Docker open-source projects, and due to the convenient and simple tools to create, deploy and configure LXC containers, as described in this article, we presumably will see more and more cloud infrastructures that will integrate LXC containers instead of using virtual machines in the near future. However, as explained in this article, solutions like Xen or KVM have several advantages over Linux-based containers and still are needed, so they probably will not disappear from the cloud infrastructure in the next few years.

Acknowledgements

Thanks to Jérôme Petazzoni from Docker Inc. and to Michael H. Warfield for reviewing this article.

Resources

Google Containers: https://github.com/google/lmctfy

OpenVZ: http://openvz.org/Main_Page

Linux-VServer: http://linux-vserver.org

LXC: http://linuxcontainers.org

libvirt-lxc: http://libvirt.org/drvlxc.html

Docker: https://www.docker.io

Docker Public Registry: https://index.docker.io

(Via LinuxJournal.com)

Docker – a Linux Container.

(Today I read a article a Docker. It’s a good thing and I want share it to you)

About Docker

Docker is an open-source engine that automates the deployment of any application as a lightweight, portable, self-sufficient container that will run virtually anywhere.

Docker containers can encapsulate any payload, and will run consistently on and between virtually any server. The same container that a developer builds and tests on a laptop will run at scale, in production, on VMs, bare-metal servers, OpenStack clusters, public instances, or combinations of the above.

Common use cases for Docker include:

  • Automating the packaging and deployment of applications
  • Creation of lightweight, private PAAS environments
  • Automated testing and continuous integration/deployment
  • Deploying and scaling web apps, databases and backend services

Table of contents

Background

Fifteen years ago, virtually all applications were written using well defined stacks of services and deployed on a single monolithic, proprietary server. Today, developers build and assemble applications using a multiplicity of the best available services, and must be prepared for those applications to be deployed across a multiplicity of different hardware environments, included public, private, and virtualized servers.

Figure 1: The Evolution of IT

This sets up the possibility for:

  • Adverse interactions between different services and “dependency hell”
  • Challenges in rapidly migrating and scaling across different hardware* The impossibility of managing a matrix of multiple different services deployed across multiple different types of hardware

Figure 2: The Challenge of Multiple Stacks and Multiple Hardware Environments

Or, viewed as a matrix, we can see that there is a huge number of combinations and permutations of applications/services and hardware environments that need to be considered every time an application is written or rewritten. This creates a difficult situation for both the developers who are writing applications and the folks in operations who are trying to create a scalable, secure, and highly performance operations environment.

Figure 3: Dynamic Stacks and Dynamic Hardware Environments Create an NxN Matrix

How to solve this problem? A useful analogy can be drawn from the world of shipping. Before 1960, most cargo was shipped break bulk. Shippers and carriers alike needed to worry about bad interactions between different types of cargo (e.g. if a shipment of anvils fell on a sack of bananas). Similarly, transitions between different modes of transport were painful. Up to half the time to ship something could be taken up as ships were unloaded and reloaded in ports, and in waiting for the same shipment to get reloaded onto trains, trucks, etc. Along the way, losses due to damage and theft were large. And, there was an n X n matrix between a multiplicity of different goods and a multiplicity of different transport mechanisms.

Figure 4: Analogy: Shipping Pre-1960

Fortunately, an answer was found in the form of a standard shipping container. Any type of goods, from pistachios to Porsches, can be packaged inside a standard shipping container. The container can then be sealed, and not re-opened until it reaches its final destination. In between, the containers can be loaded and unloaded, stacked, transported, and efficiently moved over long distances. The transfer from ship to gantry crane to train to truck can be automated, without requiring a modification of the container. Many authors credit the shipping container with revolutionizing both transportation and world trade in general. Today, 18 million standard containers carry 90% of world trade.

Figure 5: Solution to Shipping Challenge Was a Standard Container

To some extent, Docker can be thought of as an intermodal shipping container system for code.

Figure 6: The Solution to Software Shipping is Also a Standard Container System

Docker enables any application and its dependencies to be packaged up as a lightweight, portable, self-sufficient container. Containers have standard operations, thus enabling automation. And, they are designed to run on virtually any Linux server. The same container that that a developer builds and tests on a laptop will run at scale, in production, on VMs, bare-metal servers, OpenStack clusters, public instances, or combinations of the above.

In other words, developers can build their application once, and then know that it can run consistently anywhere. Operators can configure their servers once, and then know that they can run any application.

Why Should I Care (For Developers)

Build once…run anywhere

“Docker interests me because it allows simple environment isolation and repeatability. I can create a run-time environment once, package it up, then run it again on any other machine. Furthermore, everything that runs in that environment is isolated from the underlying host (much like a virtual machine). And best of all, everything is fast and simple.”

Why Should I Care (For Devops)

Configure once…run anything

  • Make the entire lifecycle more efficient, consistent, and repeatable
  • Increase the quality of code produced by developers
  • Eliminate inconsistencies between development, test, production, and customer environments
  • Support segregation of duties
  • Significantly improves the speed and reliability of continuous deployment and continuous integration systems
  • Because the containers are so lightweight, address significant performance, costs, deployment, and portability issues normally associated with VMs

What are the Main Features of Docker

It is useful to compare the main features of Docker to those of shipping containers. (See the analogy above).

Physical Containers Docker
Content Agnostic The same container can hold almost any kind of cargo Can encapsulate any payload and its dependencies
Hardware Agnostic Standard shape and interface allow same container to move from ship to train to semi-truck to warehouse to crane without being modified or opened Using operating system primitives (e.g. LXC) can run consistently on virtually any hardware – VMs, bare metal, openstack, public IAAS, etc. – without modification
Content Isolation and Interaction No worry about anvils crushing bananas. Containers can be stacked and shipped together Resource, network, and content isolation. Avoids dependency hell
Automation Standard interfaces make it easy to automate loading, unloading, moving, etc. Standard operations to run, start, stop, commit, search, etc. Perfect for devops: CI, CD, autoscaling, hybrid clouds
Highly efficient No opening or modification, quick to move between waypoints Lightweight, virtually no perf or start-up penalty, quick to move and manipulate
Separation of duties Shipper worries about inside of box, carrier worries about outside of box Developer worries about code, Ops worries about infrastructure.

Figure 7: Main Docker Features

For a more technical view of features, please see the following:

  • Filesystem isolation: each process container runs in a completely separate root filesystem.
  • Resource isolation: system resources like cpu and memory can be allocated differently to each process container, using cgroups.
  • Network isolation: each process container runs in its own network namespace, with a virtual interface and IP address of its own.
  • Copy-on-write: root filesystems are created using copy-on-write, which makes deployment extremely fast, memory-cheap and disk-cheap.
  • Logging: the standard streams (stdout/stderr/stdin) of each process container is collected and logged for real-time or batch retrieval.
  • Change management: changes to a container’s filesystem can be committed into a new image and re-used to create more containers. No templating or manual configuration required.
  • Interactive shell: docker can allocate a pseudo-tty and attach to the standard input of any container, for example to run a throwaway interactive shell.

What are the Basic Docker Functions

Docker makes it easy to build, modify, publish, search, and run containers. The diagram below should give you a good sense of the Docker basics. With Docker, a container comprises both an application and all of its dependencies. Containers can either be created manually or, if a source code repository contains a DockerFile, automatically. Subsequent modifications to a baseline Docker image can be committed to a new container using the Docker Commit Function and then Pushed to a Central Registry.

Containers can be found in a Docker Registry (either public or private), using Docker Search. Containers can be pulled from the registry using Docker Pull and can be run, started, stopped, etc. using Docker Run commands. Notably, the target of a run command can be your own servers, public instances, or a combination.

Figure 8: Basic Docker Functions

For a full list of functions, please go to: http://docs.docker.io/en/latest/commandline/

Docker runs three ways: * as a daemon to manage LXC containers on your Linux host (sudo docker -d) * as a CLI which talks to the daemon’s REST API (docker run …) * as a client of Repositories that let you share what you’ve built (docker pull, docker commit).

How Do Containers Work? (And How are they Different From VMs)

A container comprises an application and its dependencies. Containers serve to isolate processes which run in isolation in userspace on the host’s operating system.

This differs significantly from traditional VMs. Traditional, hardware virtualization (e.g. VMWare, KVM, Xen, EC2) aims to create an entire virtual machine. Each virtualized application contains not only the application (which may only be 10’s of MB) along with the binaries and libraries needed to run that application, and an entire Guest operating System (which may measure in 10s of GB).

The picture below captures the difference

Figure 9: Containers vs. Traditional VMs

Since all of the containers share the same operating system (and, where appropriate, binaries and libraries), they are significantly smaller than VMs, making it possible to store 100s of VMs on a physical host (versus a strictly limited number of VMs). In addition, since they utilize the host operating system, restarting a VM does not mean restarting or rebooting the operating system. Thus, containers are much more portable and much more efficient for many use cases.

With Docker Containers, the efficiencies are even greater. With a traditional VM, each application, each copy of an application, and each slight modification of an application requires creating an entirely new VM.

As shown above, a new application on a host need only have the application and its binaries/libraries. There is no need for a new guest operating system.

If you want to run several copies of the same application on a host, you do not even need to copy the shared binaries.

Finally, if you make a modification of the application, you need only copy the differences.

Figure 10: Mechanism to Make Docker Containers Lightweight

This not only makes it efficient to store and run containers, it also makes it extremely easy to update applications. As shown in the next figure, updating a container only requires applying the differences.

Figure 11: Modfiying and Updating Containers

What is the Relationship between Docker and dotCloud?

Docker is an open-source implementation of the deployment engine which powers dotCloud, a popular Platform-as-a-Service. It benefits directly from the experience accumulated over several years of large-scale operation and support of hundreds of thousands of applications and databases. dotCloud is the chief sponsor of the Docker project, and dotCloud CTO is the original architect and current, overall maintainer. While several dotCloud employees work on Docker full-time, Docker is a true community project, with hundreds of non-Docker contributors and a complete open design philosophy. All pulls, pushes, forks, bugs, issues, and roadmaps are available for viewing, editing, and commenting on GitHub.

What Are Some Cool Use Cases For Docker?

Docker is a powerful tool for many different use cases. Here are some great early use cases for Docker, as described by members of our community.

Use Case Examples Link
Build your own PaaS Dokku – Docker powered mini-Heroku. The smallest PaaS implementation you’ve ever seen http://bit.ly/191Tgsx
Web Based Environment for Instruction JiffyLab – web based environment for the instruction, or lightweight use of, Python and UNIX shell http://bit.ly/12oaj2K
Easy Application Deployment Deploy Java Apps With Docker = Awesome http://bit.ly/11BCvvu
Running Drupal on Docker http://bit.ly/15MJS6B
Installing Redis on Docker http://bit.ly/16EWOKh
Create Secure Sandboxes Docker makes creating secure sandboxes easier than ever http://bit.ly/13mZGJH
Create your own SaaS Memcached as a Service http://bit.ly/11nL8vh
Automated Application Deployment Push-button Deployment with Docker http://bit.ly/1bTKZTo
Continuous Integration and Deployment Next Generation Continuous Integration & Deployment with dotCloud’s Docker and Strider http://bit.ly/ZwTfoy
Lightweight Desktop Virtualization Docker Desktop: Your Desktop Over SSH Running Inside Of A Docker Container http://bit.ly/14RYL6x

More things you would like to know:

Getting started with Docker

Click here to get started, including full instructions, code, and documentation. We’ve also prepared an interactive tutorial to help you get started.

Getting a copy of the source code

The Docker project is hosted on GitHub. Click here to visit the repository.

Contribute to the docker community

Head on over to our community page

(via Docker.io)

 

Advanced Hard Drive Caching Techniques

With the introduction of the solid-state Flash drive, performance came to the forefront for data storage technologies. Prior to that, software developers and server administrators needed to devise methods for which they could increase I/O throughput to storage, most of which resulted in low capacity caching to random access memory (RAM) or a RAM drive. Although not as fast as RAM, the Flash drive was almost a dream come true, but it had its limitations—one of which was its low capacities packaged in the NAND-based chips. The traditional spinning disk drive provided users’ desired capacities but lacked in speedy accessibility. Even with the 6Gb SATA protocol, sequential data access at best performed at approximately 150MB per second (or MB/s) for both read and write operations, while random access varied between 2–5MB/s as the seeking across multiple sectors laid out in multiple tracks across multiple spinning platters proved to be an extremely disruptive bottleneck. The solid-state drive (SSD) with no movable components significantly decreased these access latencies, thus rendering this bottleneck almost nonexistent.

Even today, the consumer SSD cannot compare to the capacities provided by the magnetic hard disk drive (or HDD), which is why in this article I intend to introduce readers to proven methods for obtaining near SSD performance with the traditional HDD. Multiple open-source projects exist that can achieve this, all but one of which utilizes an SSD as a caching node, and the other caches to RAM. The device drivers I cover here are dm-cache, FlashCache and the RapidDisk/RapidCache suite; I also briefly discuss bcache and EnhanceIO.

Note:

To build the kernel modules shown in this article, you need to have either the full kernel source or the kernel headers installed for your current kernel image revision.

In my examples, I am using a commercial SATA III (6Gbps) SSD with an average performance of the following:

  • Sequential read: 231MB/s
  • Sequential write: 74MB/s
  • Random read: 230MB/s
  • Random write: 72MB/s

This SSD provides the caching layer for a slower mechanical SATA III HDD that performs at the following:

  • Sequential read: 115MB/s
  • Sequential write: 72MB/s
  • Random read: 2MB/s
  • Random write: 2MB/s

In my environment, the SSD is labeled as /dev/sdb, and the HDD is /dev/sda3. These are non-intrusive transparent caching solutions intended to achieve the performance benefits of SSDs. They can be added and removed to existing storage targets without issue or data loss (assuming that all cached data has been flushed to disk successfully). Also, all the examples here showcase a write-back caching scheme with the exception of RapidCache, which instead will be used in write-through mode. In write-back mode, newly written data is cached but not immediately written to the destination target. Write-through mode always will write new data to the target while still maintaining it in cache for future reads.

Note:

The benchmarks shown here were obtained by using FIO, a file I/O benchmarking and test tool designed for data storage technologies. It is maintained by Linux kernel developer Jens Axboe. Unless noted otherwise, all captured I/O is written at the typical 4KB page size, asynchronously to the storage target 32 transfers at a time (that is, queue depth).

dm-cache

dm-cache has been around for quite some time—at least since 2006. It originally made its debut as a research project developed by Dr Ming Zhao through his summer internship at IBM research. The dm-cache module just recently was integrated into the Linux kernel tree as of version 3.9. Whether you choose to enable it in a recently downloaded kernel or compile it from the official project site, the results will be the same. To load the module, you need to invoke modprobe or insmod:

$ sudo modprobe dm-cache

Now that the module is loaded, you need to inform that module about which drive to point to for the cache and which to point to for the destination. The dm-cache project site provides a Perl script to simplify this process called dmc-setup.pl. For example, if I wanted to use the entire SSD in write-back caching mode with a 4KB block size, I would type:

$ sudo perl dmc-setup.pl -o /dev/sda3 -c /dev/sdb -n cache -b 8 -w 

This script is a wrapper to the equivalent dmsetup command below:

$ echo 0 20971520 cache /dev/sda3 /dev/sdb 0 8 65536 16 1 | 
 ↪dmsetup create cache

The dm-cache documentation hosted on the project site provides details on each parameter field, so I don’t cover them here.

You may notice that in both examples, I named the mapping to both drives “cache”. So, when I need to access the drive mapping, I must refer to it as “cache”.

The following mapping passes all data requests to the caching driver, which in turn performs the necessary magic to process the requests either by handling it entirely out of cache or both the cache and the slower device:

$ ls -l /dev/mapper
total 0
lrwxrwxrwx 1 root root       7 Jun 30 12:10 cache -> ../dm-0
crw------- 1 root root 10, 236 Jun 30 11:52 control

Just like with any other device-mapper-enabled target, I also can pull up detailed mapping data:

$ sudo dmsetup status cache
0 20971520 cache stats: reads(83), writes(0), 
 ↪cache hits(0, 0.0),replacement(0), replaced dirty blocks(0)

$ sudo dmsetup table cache
0 20971520 cache conf: capacity(256M), associativity(16), 
 ↪block size(4K), write-back

If the target drive already is formatted with data on it, you just need to mount it; otherwise, format it to your specified filesystem:

$ sudo mke2fs -F /dev/mapper/cache 

Remember, these solutions are non-intrusive, so if you have existing data that needs to remain on that disk drive, skip the above step and go straight to mounting it for data accessibility:

    $ sudo mount /dev/mapper/cache /mnt/cache
    $ df|grep cache  /dev/mapper/cache  10321208 1072632   8724288  11% /mnt/cache

Using a benchmarking utility, the numbers will vary. On read operations, it is wholly dependent on whether the desired data resides in cache or whether the module needs to retrieve it from the slower disk. On write operations, it depends on the Flash technology itself, and whether it needs to go through a typical programmable erase (PE) cycle to write the new data. Regardless of this, the random read/write access to the slower drive has been increased significantly:
  • Sequential read: 105MB/s
  • Sequential write: 50MB/s
  • Random read: 67MB/s
  • Random write: 51MB/s

You can continue monitoring the cache status by typing:

$ sudo dmsetup status cache 
0 20971520 cache stats: reads(301319), writes(353216), 
 ↪cache hits(24485, 0.3),replacement(345972), 
 ↪replaced dirty blocks(92857)

To remove the cache mapping, unmount the drive and invoke dmsetup:

$ sudo umount /mnt/cache
$ sudo dmsetup remove cache

FlashCache

FlashCache is a project developed and maintained by Facebook. It was inspired by dm-cache. Much like dm-cache, it too is built from the device-mapper framework. It currently is hosted on GitHub and can be cloned from there. The repository encompasses the kernel module and administration utilities. Once built and installed, load the kernel module and in a similar fashion to the previous examples, create a mapping of the SSD and HDD:

$ sudo modprobe flashcache
$ sudo flashcache_create -p back -b 8 cache /dev/sdb /dev/sda3
cachedev cache, ssd_devname /dev/sdb, disk_devname /dev/sda3 
 ↪cache mode WRITE_BACK block_size 8, md_block_size 8, 
 ↪cache_size 0
FlashCache metadata will use 223MB of your 3944MB main memory

The flashcache_create administration utility is similar to the dmc-setup.pl Perl script used for dm-cache. It is a wrapper utility designed to simplify the dmsetup process. As with the dm-cache module, once the mapping has been created, you can view mapping details by typing:

$ sudo dmsetup table cache
0 20971520 flashcache conf:
    ssd dev (/dev/sdb), disk dev (/dev/sda3) cache mode(WRITE_BACK)
    capacity(57018M), associativity(512), data block size(4K) 
     ↪metadata block size(4096b)
    skip sequential thresh(0K)
    total blocks(14596608), cached blocks(83), cache percent(0)
    dirty blocks(0), dirty percent(0)
    nr_queued(0)
Size Hist: 4096:83 
$ sudo dmsetup status cache
0 20971520 flashcache stats: 
    reads(83), writes(0)
    read hits(0), read hit percent(0)
    write hits(0) write hit percent(0)
    dirty write hits(0) dirty write hit percent(0)
    replacement(0), write replacement(0)
    write invalidates(0), read invalidates(0)
    pending enqueues(0), pending inval(0)
    metadata dirties(0), metadata cleans(0)
    metadata batch(0) metadata ssd writes(0)
    cleanings(0) fallow cleanings(0)
    no room(0) front merge(0) back merge(0)
    disk reads(83), disk writes(0) ssd reads(0) ssd writes(83)
    uncached reads(0), uncached writes(0), uncached IO requeue(0)
    disk read errors(0), disk write errors(0) ssd read errors(0) 
     ↪ssd write errors(0)
    uncached sequential reads(0), uncached sequential writes(0)
    pid_adds(0), pid_dels(0), pid_drops(0) pid_expiry(0)

Mount the mapping for file accessibility:

$ sudo mount /dev/mapper/cache /mnt/cache

Using the same benchmarking utility, observe the differences between FlashCache and the previous module:

  • Sequential read: 284MB/s
  • Sequential write: 72MB/s
  • Random read: 284MB/s
  • Random write: 71MB/s

The numbers look more like the native SSD performance. However, I want to note that this article is not intended to prove that one solution performs better than the other, but instead to enlighten readers of the many methods you can use to accelerate data access to existing and slower configurations.

To unmount and remove the drive mapping, type the following in the terminal:

$ sudo umount /mnt/cache
$ sudo dmsetup remove /dev/mapper/cache

RapidDisk and RapidCache

Currently at version 2.9, RapidDisk is an advanced Linux RAM disk whose features include the capabilities to allocate RAM dynamically as a block device, use it as standalone disk drives, or even map it as caching nodes to slower local disk drives via RapidCache (the latter of which was inspired by FlashCache and uses the device-mapper framework). RAM is being accessed to handle the data storage by allocating memory pages as they are needed. It is a volatile form of storage, so if power is removed or if the computer is rebooted, all data stored within RAM will not be preserved. This is why the RapidCache module was designed to handle only read-through/write-through caching, which means that whatever is intended to be written to the slower storage device will be cached to RapidCache and written immediately to the hard drive. And, if data is being requested from the hard drive and it does not pre-exist in the RapidCache node, it will read the data from the slower device and then cache it to the RapidCache node. This method will retain the same write performance as the hard drive, but significantly increase sequential and random access read performance to cached data.

Once the package, which consists of two kernel modules and an administration utility, is built and installed, you need to insert the modules by typing the following on the command line:

$ sudo modprobe rxdsk
$ sudo modprobe -r rxdsk

Let’s assume that you’re running on a computer that contains 4GB of RAM, and you confidently can say that at least 1GB of that RAM is never used by the operating system and its applications. Using RapidDisk to create a RAM drive of 1GB in size, you would type:

$ sudo rxadm --attach 1024

Remember, RapidDisk will not pre-allocate this storage. It will allocate RAM only as it is used.

A quick benchmark test of just the RAM drive produces some overwhelmingly fast results with 4KB I/O transfers:

  • Sequential read: 1.6GB/s
  • Sequential write: 1.6GB/s
  • Random read: 1.3GB/s
  • Random write: 1.1GB/s

It produces the following with 1MB I/O transfers:

  • Sequential read: 4.9GB/s
  • Sequential write: 4.3GB/s
  • Random read: 4.9GB/s
  • Random write: 4.0GB/s

Impressive, right? To utilize such a speedy RAM drive as a caching node to a slower drive, a mapping must be created, where /dev/rxd0 is the node used to access the RAM drive, and /dev/mapper/rxc0 is the node used to access the mapping of the two drives:

$ sudo rxadm --rxc-map rxd0 /dev/sda3 4

You can get a list of attached devices and mappings by typing:

$ sudo rxadm --list
rxadm 2.9
Copyright 2011-2013 Petros Koutoupis

List of rxdsk device(s):

 RapidDisk Device 1: rxd0
    Size: 1073741824

List of rxcache mapping(s):

 RapidCache Target 1: rxc0
0 20971519 rxcache conf:
    rxd dev (/dev/rxd0), disk dev (/dev/sda3) mode (WRITETHROUGH)
    capacity(1024M), associativity(512), block size(4K)
    total blocks(262144), cached blocks(0)
 Size Hist: 512:663 

As with the previous device-mapper-based solutions, you even can list detailed information of the mapping by typing:

$ sudo dmsetup table rxc0
0 20971519 rxcache conf:
    rxd dev (/dev/rxd0), disk dev (/dev/sda3) mode (WRITETHROUGH)
    capacity(1024M), associativity(512), block size(4K)
    total blocks(262144), cached blocks(0)
 Size Hist: 512:663 

$ sudo dmsetup status rxc0
0 20971519 rxcache stats: 
    reads(663), writes(0)
    cache hits(0) replacement(0), write replacement(0)
    read invalidates(0), write invalidates(0)
    uncached reads(663), uncached writes(0)
    disk reads(663), disk writes(0)
    cache reads(0), cache writes(0)

Format the mapping if needed and mount it:

$ sudo mount /dev/mapper/rxc0 /mnt/cache

A benchmark test produces the following results:

  • Sequential read: 794MB/s
  • Sequential write: 70MB/s
  • Random read: 901MB/s
  • Random write: 2MB/s

Notice that the write performance is not very great, and that’s because it is not meant to be. Write-through mode promises only faster read performance of cached data and consistent write performance to the original drive. The read performance, however, shows significant improvement when accessing cached data.

To remove the mapping and detach the RAM drive, type the following:

$ sudo umount /mnt/cache
$ sudo rxadm --rxc-unmap rxc0
$ sudo rxadm --detach rxd0

Other Solutions Worth Mentioning

bcache:

bcache is relatively new to the hard drive caching scene. It offers all the same features and functionalities as the previous solutions with the exception of its capability to map one or more SSDs as the cache for one or more HDDs instead of one volume to one volume. The project’s maintainer does, however, tout its superiority over the other solutions when it comes to data access performance from the cache. From what I can tell, bcache is unlike the previous solutions where it does not rely on the device-mapper framework and instead is a standalone module. At the time of this writing, it is set to be integrated into release 3.10 of the Linux kernel tree. Unfortunately, I haven’t had the opportunity or the appropriate setup to test bcache. As a result, I haven’t been able to dive any deeper into this solution and benchmark its performance.

EnhanceIO:

EnhanceIO is an SSD caching solution produced by STEC, Inc., and hosted on GitHub. It was greatly inspired by the work done by Facebook for FlashCache, and although it’s open-source, a commercial version is offered by the company for those seeking additional support. STEC did not simply modify a few lines of code of FlashCache and republish it. Instead, STEC rewrote the write-back caching logic while also improving other areas, such as memory footprint, failure handling and more. As with bcache, I haven’t had the opportunity to install and test EnhanceIO.

Summary

These solutions are intended to provide users with near SSD speeds and HDD capacities at a significantly reduced cost. From the data center to your home office, these solutions can be deployed almost anywhere. They also can be tuned to operate more appropriately in their intended environments. Some of them even offer a variety of caching algorithm options, such as Least Recently Used (LRU), Most Recently Used (MRU), hybrids of the two or just a simple first-in first-out (FIFO) caching scheme. The first three options can be expensive regarding performance, as they require the tracking of cached data sets for what has been accessed and how recently in order to determine whether to discard it. FIFO, however, functions as a circular buffer in which the oldest cached data set will be discarded first. With the exception of RapidCache, the SSD-focused modules also preserve metadata of the cache to ensure that any disruptions, including power cycles/outages, don’t compromise the integrity of the data.

Resources

dm-cache: http://visa.cs.fiu.edu/tiki/dm-cache

FlashCache: https://github.com/facebook/flashcache

EnhanceIO: https://github.com/stec-inc/EnhanceIO

bcache: http://bcache.evilpiepirate.org

RapidDisk: http://www.rapiddisk.org

FIO Git Repository: http://git.kernel.dk/?p=fio.git;a=summary

Wikipedia Page on Caching Algorithms:http://en.wikipedia.org/wiki/Cache_algorithms

 (source: LinuxJournal.com)

Time-Saving Tricks on the Command Line

I remember the first time a friend of mine introduced me to Linux and showed me how I didn’t need to type commands and path names fully—I could just start typing and use the Tab key to complete the rest. That was so cool. I think everybody loves Tab completion because it’s something you use pretty much every minute you spend in the shell. Over time, I discovered many more shortcuts and time-saving tricks, many of which I have come to use almost as frequently as Tab completion.

In this article, I highlight a set of tricks for common situations that make a huge difference for me:

  • Working in screen sessions: core features that will get you a long way.
  • Editing the command line: moving around quickly and editing quickly.
  • Viewing files or man pages using less.
  • E-mailing yourself relevant log snippets or alerts triggered by events.

While reading the article, it would be best to have a terminal window open so you can try using the tips right away. All the tips should work in Linux, UNIX and similar systems without any configuration.

Working in Screen Sessions

Screen has been covered in Linux Journal before (see Resources), but to put it simply, screen lets you have multiple “windows” within a single terminal application. The best part is that you can detach and reattach to a running screen session at any time, so you can continue your previous work exactly where you left off. This is most useful when working on a remote server.

Luckily, you really don’t need to master screen to benefit from it greatly. You already can enjoy its most useful benefits by using just a few key features, namely the following:

  • screen -R projectx: reattach to the screen session named “projectx” or create it fresh now.
  • Ctrl-a c: create a new window.
  • Ctrl-a n: switch to the next window.
  • Ctrl-a p: switch to the previous window.
  • Ctrl-a 0: switch to the first window; use Ctrl-a 1 for the second window, and so on.
  • Ctrl-a w: view the list of windows.
  • Ctrl-a d: detach from this screen session.
  • screen -ls: view the list of screen sessions.

Note: in the above list, “Ctrl-a c” means pressing the Ctrl and a keys at the same time, followed by c. Ctrl-a is called the command key, and all screen commands start with this key sequence.

Let me show all of these in the context of a realistic example: debugging a Django Web site on my remote hosting server, which usually involves the following activities:

  • Editing the configuration file.
  • Running some commands (performing Django operations).
  • Restarting the Web site.
  • Viewing the Web site logs.

Of course, I could do all these things one by one, but it’s a lot more practical to have multiple windows open for each. I could use multiple real terminal windows, but reopening them every time I need to do this kind of work would be tedious and slow. Screen can make this much faster and easier.

Starting Screen:

Before you start screen, it’s good to navigate to the directory where you expect to do most of your work first. This is because new windows within screen will all start in that directory. In my example, I first navigate to my Django project’s directory, so that when I open new screen windows, the relevant files will be right there in front of me.

There are different ways of starting screen, but I recommend this one:


screen -R mysite

 

When you run this the first time, it creates a screen session named “mysite”. Later you can use this same command to reconnect to this session again. (The -R flag stands for reattach.)

Creating Windows:

Now that I’m in screen, let’s say I start editing the configuration of the Django Web site:


vim mysite/settings.py

 

Let’s say I made some changes, and now I want to restart the site. I could exit vim or put it in the background in order to run the command to restart the site, but I anticipate I will need to make further changes right here. It’s easier just to create a new window now, using the screen command Ctrl-a c.

It’s easy to create another window every time you start doing something different from your current activity. This is especially useful when you need to change the directory between commands. For example, if you have script files in /some/long/path/scripts and log files in /other/long/path/logs, then instead of jumping between directories, just keep a separate window for each.

In this example, first I started looking at the configuration files. Next, I wanted to restart the Web site. Then I wanted to run some Django commands, and then I wanted to look at the logs. All these are activities I tend to do many times per debugging session, so it makes sense to create a separate window for each activity.

The cost of creating a new window is so small, you can do it without thinking. Don’t interrupt your current activity; fire up another window with Ctrl-a c and rock on.

Switching between Windows:

The windows you create in screen are numbered starting from zero. You can switch to a window by its number—for example, jump to the first window with Ctrl-a 0, the second window with Ctrl-a 1 and so on. It’s also very convenient to switch to the next and previous windows with Ctrl-a n and Ctrl-a p, respectively.

Listing Your Windows:

If you’re starting to lose track of which window you are in, check the list of windows with Ctrl-a w or Ctrl-a “. The former shows the list of windows in the status line (at the bottom) of the screen, showing the current window marked with a *. The latter shows the list of windows in a more user-friendly format as a menu.

Detaching from and Reattaching to a Session:

The best time-saving feature of screen is reattaching to existing sessions. You can detach cleanly from the current screen session with Ctrl-a d. But you don’t really need to. You could just as well simply close the terminal window.

The great thing about screen sessions is that whatever way you disconnected from them, you can reattach later. At the end of the day, you can shut down your local PC without closing a remote screen session and come back to it the next day by running the same command you used to start it, as in this example with screen -R mysite.

You might have multiple screen sessions running for different purposes. You can list them all with:


screen -ls

 

If you are disconnected from screen abruptly, sometimes it may think you are still in an attached state, which will prevent you from reattaching with the usual command screen -R label. In that case, you can append a -D flag to force detach from any existing connections—for example:


screen -R label -D

 

Learning More about Screen:

If you want to learn more, see the man page and the links in the Resources section. The built-in cheat sheet of shortcuts also comes handy, and you can view it with Ctrl-a ?.

I also should mention one of screen’s competitor: tmux. I chose screen in this article because in my experience, it is more available in systems I cannot control. You can do everything I covered above with tmux as well. Use whichever is available in the remote system in which you find yourself.

Finally, you can get the most out of screen when working on a remote system—for example, over an SSH session. When working locally, it’s probably more practical to use a terminal application with tabs. That’s not exactly the same thing, but probably close enough.

Editing the Command Line

Many highly practical shortcuts can make you faster and more efficient on the command line in different ways:

  • Find and re-run or edit a long and complex command from the history.
  • Edit much more quickly than just using the backspace key and retyping text.
  • Move around much faster than just using the left- and right-arrow keys.

Finding a Command in the History:

If you want to repeat a command you executed recently, it may be easy enough just to press the up-arrow key a few times until you find it. If the command was more than only a few steps ago though, this becomes unwieldy. Very often, it’s much more practical to use the Ctrl-r shortcut instead to find a specific command by a fragment.

To search for a command in the past, press Ctrl-r and start typing any fragment you remember from it. As you type, the most recent matching line will appear on the command line. This is an incremental search, which means you can keep typing or deleting letters, and the matched command will change dynamically.

Let’s try this with an example. Say I ran these commands yesterday, which means they are still in my recent history but too far away simply to use the up arrow:


...
cd ~/dev/git/github/bashoneliners/
. ~/virtualenv/bashoneliners/bin/activate
./run.sh pip install --upgrade django
git push beta master:beta
git push release master:release
git status
...

 

Let’s say I want to activate the virtualenv again. That’s a hassle to type again, because I have to type at least a few characters at each path segment, even with Tab completion. Instead, it’s a lot easier to press Ctrl-r and start typing “activate”.

For a slightly more complex example, let’s say I want to run a git push command again, but I don’t remember exactly which one. So I press Ctrl-r and start typing “push”. This will match the most recent command, but I actually want the one before that, and I don’t remember a better fragment to type. The solution is to press Ctrl-r again, in the middle of my current search, as that jumps to the next matching command.

This is really extremely useful, saving not only the time of typing, but also often the time of thinking too. Imagine one of those long one-liners where you processed a text file through a long sequence of pipes with sed, awk, Perl and whatnot; or an rsync command with many flags, filters and exclusions; or complex loops using “for” and “while”. You can bring those back to your command line quickly using Ctrl-r and some fragment you remember from them.

Here are a few other things to note:

  • The search is case-sensitive.
  • You can abort the search with Ctrl-c.
  • To edit the line before running it, press any of the arrow keys.

This trick can be even more useful if you pick up some new habits. For example, when referring to a path you use often, type the absolute path rather than a relative path. That way, the command will be reusable later from any directory.

Moving Around Quickly and Editing Quickly:

Basic editing on the command line involves moving around with the arrow keys and deleting characters with Backspace or Delete. When there are more than only a few characters to move or delete, using these basic keys is just too slow. You can do the same much faster by knowing just a handful of interesting shortcuts:

  • Ctrl-w: cut text backward until space.
  • Esc-Backspace: cut one word backward.
  • Esc-Delete: cut one word forward.
  • Ctrl-k: cut from current position until the end of the line.
  • Ctrl-y: paste the most recently cut text.

Not only is it faster to delete portions of a line chunk by chunk like this, but an added bonus is that text deleted this way is saved in a register so that you can paste it later if needed. Take, for example, the following sequence of commands:


git init --bare /path/to/repo.git
git remote add origin /path/to/repo.git

 

Notice that the second command uses the same path at the end. Instead of typing that path twice, you could copy and paste it from the first command, using this sequence of keystrokes:

  1. Press the up arrow to bring back the previous command.
  2. Press Ctrl-w to cut the path part: “/path/to/repo.git”.
  3. Press Ctrl-c to cancel the current command.
  4. Type git remote add origin, and press Ctrl-y to paste the path.

Some of the editing shortcuts are more useful in combination with moving shortcuts:

  • Ctrl-a: jump to the beginning of the line.
  • Ctrl-e: jump to the end of the line.
  • Esc-b: jump one word backward.
  • Esc-f: jump one word forward.

Jumping to the beginning is very useful if you mistype the first words of a long command. You can jump to the beginning much faster than with the left-arrow key.

Jumping forward and backward is very practical when editing the middle part of a long command, such as the middle of long path segments.

Putting It All Together:

A good starting point for learning these little tricks is to stop some old inefficient habits:

  • Don’t clear the command line with the Backspace key. Use Ctrl-c instead.
  • Don’t delete long arguments with the Backspace key. Use Ctrl-w instead.
  • Don’t move to the beginning or the end of the line using the left- and right-arrow keys. Jump with Ctrl-a and Ctrl-e instead.
  • Don’t move over long terms using the arrow keys. Jump over terms with Esc-b and Esc-f instead.
  • Don’t press the up arrow 20 times to find a not-so-recent previous command. Jump to it directly with Ctrl-r instead.
  • Don’t type anything twice on the same line. Copy it once with Ctrl-w, and reuse it many times with Ctrl-y instead.

Once you get the hang of it, you will start to see more and more situations where you can combine these shortcuts in interesting ways and minimize your typing.

Learning More about Command-Line Editing:

If you want to learn more, see the bash man page and search for “READLINE”, “Commands for Moving” and “Commands for Changing Text”.

Viewing Files or man Pages with less

The less command is a very handy tool for viewing files, and it’s the default application for viewing man pages in many modern systems. It has many highly practical shortcuts that can make you faster and more efficient in different ways:

  • Searching forward and backward.
  • Moving around quickly.
  • Placing markers and jumping to markers.

Searching Forward and Backward:

You can search forward for some text by typing / followed by the pattern to search for. To search backward, use ? instead of /. The search pattern can be a basic regular expression. If your terminal supports it, the search results are highlighted with inverted foreground and background colors.

You can jump to the next result by pressing n, and to the previous result by pressing N. The direction of next and previous is relative to the direction of the search itself. That is, when searching forward with /, pressing n will move you forward in the file, and when searching backward with ?, pressing n will move you backward in the file.

If you use the vim editor, you should feel right at home, as these shortcuts work the same way as in vim.

Searching is case-sensitive by default, unless you specify the -i flag when starting less. When reading a file, you can toggle between case-sensitive and insensitive modes by typing -i.

Moving Around Quickly:

Here are a couple shortcuts to help you move around quickly:

  • g: jump to the beginning of the file.
  • G: jump to the end of the file.
  • space: move forward by one window.
  • b: move backward by one window.
  • d: move down by a half-window.
  • u: move up by a half-window.

Using Markers:

Markers are extremely useful in situations when you need to jump between two or more different parts within the same file repeatedly.

For example, let’s say you are viewing a server log with initialization information near the beginning of the file and some errors somewhere in the middle. You need to switch between the two parts while trying to figure out what’s going on, but using search repeatedly to find the relevant parts is very inconvenient.

A good solution is to place markers at the two locations so you can jump to them directly. Markers work similarly as in the vim editor: you can mark the current position by pressing m followed by a lowercase letter, and you can jump to a marker by pressing ‘ followed by the same letter. In this example, I would mark the initialization part with mi and the part with the error with me, so that I could jump to them easily with ‘i and ‘e. I chose the letters as the initials of what the locations represent, so I can remember them easily.

Learning More Shortcuts:

If you are interested in more, see the man page for the less command. The built-in cheat sheet of shortcuts also comes handy, and you can view it by pressing h.

E-mailing Yourself

When working on a remote server, getting data back to your PC can be inconvenient sometimes—for example, when your PC is NAT-ed and the server cannot connect to it directly with rsync or scp. A quick alternative might be sending data by e-mail instead.

Another good scenario for e-mailing yourself is to use alerts triggered by something you were waiting for, such as a crashed server coming back on-line or other particular system events.

E-mailing a Log Snippet:

Let’s say you found the log of errors crashing your remote service, and you would like to copy it to your PC quickly. Let’s further assume the relevant log spans multiple pages, so it would be inconvenient to copy and paste from the terminal window. Let’s say you can extract the relevant part using a combination of theheadtail and grep commands. You could save the log snippet in a file and run rsync on your local PC to copy it, or you could just mail it to yourself by simply piping it to this command:


mailx -s 'error logs' me@example.com

 

Depending on your system, the mailx command might be different, but the parameters are probably the same: -s specifies the subject (optional), the remaining arguments are the destination e-mail addresses, and the standard input is used as the message body.

Triggering an E-mail Alert after a Long Task

When you run a long task, such as copying a large file, it can be annoying to wait and keep checking whether it has finished. It’s better to arrange to trigger an e-mail to yourself when the copying is complete—for example:


the_long_task; date | mailx -s 'job done' me@example.com

 

That is, when the long task has completed, the e-mail command will run. In this example, the message body simply will be the output of the date command. In a real situation, you probably will want to use something more interesting and relevant as the message—for example ls -lh on the file that was copied or even multiple commands grouped together like this:


the_long_task; { df -h; tail some.log; } | \
    mailx -s 'job done' me@example.com

 

Triggering an E-mail Alert by Any Kind of Event:

Have you ever been in one of the following situations?

  • You are waiting for crashed serverX to come back on-line.
  • You are tailing a server log, waiting for a user to test your new evolution, which will trigger a particular entry in the log.
  • You are waiting for another team to deploy an updated .jar file.

Instead of staring at the screen or checking repeatedly whether the event you are waiting for has happened, you could use this kind of one-liner:


while :; do date; CONDITION && break; sleep 300; \
done; MAILME

 

This is essentially an infinite loop, with an appropriate CONDITIONin the middle to exit the loop and, thus, trigger the e-mail command. Inside the loop, I print the date, just so that I can see the loop is alive, and sleep for five minutes (300 seconds) in each cycle to avoid overloading the machine I’m on.

CONDITION can be any shell command, and its exit code will determine whether the loop should exit. For the situations outlined above, you could write the CONDITION like this:

  • ping -c1 serverX: emit a single ping to serverX. If it responds, ping will exit with success, ending the loop.
  • grep pattern /path/to/log: search for the expected pattern in the log. If the pattern is found, grep will exit with success, ending the loop.
  • find /path/to/jar -newer /path/to/jar.marker: this assumes that before starting the infinite loop, you created a marker file like this: touch -r /path/to/jar /path/to/jar.marker in order to save a copy of the exact same timestamp as the .jar file you want to monitor. The findcommand will exit with success after the jar file has been updated.

In short, don’t wait for a long-running task or some external event. Set up an infinite loop, and alert yourself by e-mail when there is something interesting to see.

Conclusion

All the tips in this article are standard features and should work in on Linux, UNIX and similar systems. I have barely scratched the surface here, highlighting the minimal set of features in each area that should provide the biggest bang for your buck. Once you get used to using them, these little tricks will make you a real ninja in the shell, jumping around and getting things done lightning fast with minimal typing.

Resources

Type man screen.

Type man bash, and search for “READLINE”, “Commands for Moving” and “Commands for Changing Text”.

Type man less.

(via Linuxjournal.com)

Queueing in the Linux Network Stack

Packet queues are a core component of any network stack or device. They allow for asynchronous modules to communicate, increase performance and have the side effect of impacting latency. This article aims to explain where IP packets are queued on the transmit path of the Linux network stack, how interesting new latency-reducing features, such as BQL, operate and how to control buffering for reduced latency.

Figure 1. Simplified High-Level Overview of the Queues on the Transmit Path of the Linux Network Stack

Driver Queue (aka Ring Buffer)

Between the IP stack and the network interface controller (NIC) lies the driver queue. This queue typically is implemented as a first-in, first-out (FIFO) ring buffer (http://en.wikipedia.org/wiki/Circular_buffer)—just think of it as a fixed-sized buffer. The driver queue does not contain the packet data. Instead, it consists of descriptors that point to other data structures called socket kernel buffers (SKBs,http://vger.kernel.org/%7Edavem/skb.html), which hold the packet data and are used throughout the kernel.

Figure 2. Partially Full Driver Queue with Descriptors Pointing to SKBs

The input source for the driver queue is the IP stack that queues IP packets. The packets may be generated locally or received on one NIC to be routed out another when the device is functioning as an IP router. Packets added to the driver queue by the IP stack are dequeued by the hardware driver and sent across a data bus to the NIC hardware for transmission.

The reason the driver queue exists is to ensure that whenever the system has data to transmit it is available to the NIC for immediate transmission. That is, the driver queue gives the IP stack a location to queue data asynchronously from the operation of the hardware. An alternative design would be for the NIC to ask the IP stack for data whenever the physical medium is ready to transmit. Because responding to this request cannot be instantaneous, this design wastes valuable transmission opportunities resulting in lower throughput. The opposite of this design approach would be for the IP stack to wait after a packet is created until the hardware is ready to transmit. This also is not ideal, because the IP stack cannot move on to other work.

Huge Packets from the Stack

Most NICs have a fixed maximum transmission unit (MTU), which is the biggest frame that can be transmitted by the physical media. For Ethernet, the default MTU is 1,500 bytes, but some Ethernet networks support Jumbo Frames (http://en.wikipedia.org/wiki/Jumbo_frame) of up to 9,000 bytes. Inside the IP network stack, the MTU can manifest as a limit on the size of the packets that are sent to the device for transmission. For example, if an application writes 2,000 bytes to a TCP socket, the IP stack needs to create two IP packets to keep the packet size less than or equal to a 1,500 MTU. For large data transfers, the comparably small MTU causes a large number of small packets to be created and transferred through the driver queue.

In order to avoid the overhead associated with a large number of packets on the transmit path, the Linux kernel implements several optimizations: TCP segmentation offload (TSO), UDP fragmentation offload (UFO) and generic segmentation offload (GSO). All of these optimizations allow the IP stack to create packets that are larger than the MTU of the outgoing NIC. For IPv4, packets as large as the IPv4 maximum of 65,536 bytes can be created and queued to the driver queue. In the case of TSO and UFO, the NIC hardware takes responsibility for breaking the single large packet into packets small enough to be transmitted on the physical interface. For NICs without hardware support, GSO performs the same operation in software immediately before queueing to the driver queue.

Recall from earlier that the driver queue contains a fixed number of descriptors that each point to packets of varying sizes. Since TSO, UFO and GSO allow for much larger packets, these optimizations have the side effect of greatly increasing the number of bytes that can be queued in the driver queue. Figure 3 illustrates this concept in contrast with Figure 2.

Figure 3. Large packets can be sent to the NIC when TSO, UFO or GSO are enabled. This can greatly increase the number of bytes in the driver queue.

Although the focus of this article is the transmit path, it is worth noting that Linux has receive-side optimizations that operate similarly to TSO, UFO and GSO and share the goal of reducing per-packet overhead. Specifically, generic receive offload (GRO,http://vger.kernel.org/%7Edavem/cgi-bin/blog.cgi/2010/08/30) allows the NIC driver to combine received packets into a single large packet that is then passed to the IP stack. When the device forwards these large packets, GRO allows the original packets to be reconstructed, which is necessary to maintain the end-to-end nature of the IP packet flow. However, there is one side effect: when the large packet is broken up, it results in several packets for the flow being queued at once. This “micro-burst” of packets can negatively impact inter-flow latency.

Starvation and Latency

Despite its necessity and benefits, the queue between the IP stack and the hardware introduces two problems: starvation and latency.

If the NIC driver wakes to pull packets off of the queue for transmission and the queue is empty, the hardware will miss a transmission opportunity, thereby reducing the throughput of the system. This is referred to as starvation. Note that an empty queue when the system does not have anything to transmit is not starvation—this is normal. The complication associated with avoiding starvation is that the IP stack that is filling the queue and the hardware driver draining the queue run asynchronously. Worse, the duration between fill or drain events varies with the load on the system and external conditions, such as the network interface’s physical medium. For example, on a busy system, the IP stack will get fewer opportunities to add packets to the queue, which increases the chances that the hardware will drain the queue before more packets are queued. For this reason, it is advantageous to have a very large queue to reduce the probability of starvation and ensure high throughput.

Although a large queue is necessary for a busy system to maintain high throughput, it has the downside of allowing for the introduction of a large amount of latency.

Figure 4 shows a driver queue that is almost full with TCP segments for a single high-bandwidth, bulk traffic flow (blue). Queued last is a packet from a VoIP or gaming flow (yellow). Interactive applications like VoIP or gaming typically emit small packets at fixed intervals that are latency-sensitive, while a high-bandwidth data transfer generates a higher packet rate and larger packets. This higher packet rate can fill the queue between interactive packets, causing the transmission of the interactive packet to be delayed.

Figure 4. Interactive Packet (Yellow) behind Bulk Flow Packets (Blue)

To illustrate this behaviour further, consider a scenario based on the following assumptions:

  • A network interface that is capable of transmitting at 5 Mbit/sec or 5,000,000 bits/sec.
  • Each packet from the bulk flow is 1,500 bytes or 12,000 bits.
  • Each packet from the interactive flow is 500 bytes.
  • The depth of the queue is 128 descriptors.
  • There are 127 bulk data packets and one interactive packet queued last.

Given the above assumptions, the time required to drain the 127 bulk packets and create a transmission opportunity for the interactive packet is (127 * 12,000) / 5,000,000 = 0.304 seconds (304 milliseconds for those who think of latency in terms of ping results). This amount of latency is well beyond what is acceptable for interactive applications, and this does not even represent the complete round-trip time—it is only the time required to transmit the packets queued before the interactive one. As described earlier, the size of the packets in the driver queue can be larger than 1,500 bytes, if TSO, UFO or GSO are enabled. This makes the latency problem correspondingly worse.

Large latencies introduced by over-sized, unmanaged queues is known as Bufferbloat (http://en.wikipedia.org/wiki/Bufferbloat). For a more detailed explanation of this phenomenon, see the Resources for this article.

As the above discussion illustrates, choosing the correct size for the driver queue is a Goldilocks problem—it can’t be too small, or throughput suffers; it can’t be too big, or latency suffers.

Byte Queue Limits (BQL)

Byte Queue Limits (BQL) is a new feature in recent Linux kernels (> 3.3.0) that attempts to solve the problem of driver queue sizing automatically. This is accomplished by adding a layer that enables and disables queueing to the driver queue based on calculating the minimum queue size required to avoid starvation under the current system conditions. Recall from earlier that the smaller the amount of queued data, the lower the maximum latency experienced by queued packets.

It is key to understand that the actual size of the driver queue is not changed by BQL. Rather, BQL calculates a limit of how much data (in bytes) can be queued at the current time. Any bytes over this limit must be held or dropped by the layers above the driver queue.

A real-world example may help provide a sense of how much BQL affects the amount of data that can be queued. On one of the author’s servers, the driver queue size defaults to 256 descriptors. Since the Ethernet MTU is 1,500 bytes, this means up to 256 * 1,500 = 384,000 bytes can be queued to the driver queue (TSO, GSO and so forth are disabled, or this would be much higher). However, the limit value calculated by BQL is 3,012 bytes. As you can see, BQL greatly constrains the amount of data that can be queued.

BQL reduces network latency by limiting the amount of data in the driver queue to the minimum required to avoid starvation. It also has the important side effect of moving the point where most packets are queued from the driver queue, which is a simple FIFO, to the queueing discipline (QDisc) layer, which is capable of implementing much more complicated queueing strategies.

Queueing Disciplines (QDisc)

The driver queue is a simple first-in, first-out (FIFO) queue. It treats all packets equally and has no capabilities for distinguishing between packets of different flows. This design keeps the NIC driver software simple and fast. Note that more advanced Ethernet and most wireless NICs support multiple independent transmission queues, but similarly, each of these queues is typically a FIFO. A higher layer is responsible for choosing which transmission queue to use.

Sandwiched between the IP stack and the driver queue is the queueing discipline (QDisc) layer (Figure 1). This layer implements the traffic management capabilities of the Linux kernel, which include traffic classification, prioritization and rate shaping. The QDisc layer is configured through the somewhat opaque tc command. There are three key concepts to understand in the QDisc layer: QDiscs, classes and filters.

The QDisc is the Linux abstraction for traffic queues, which are more complex than the standard FIFO queue. This interface allows the QDisc to carry out complex queue management behaviors without requiring the IP stack or the NIC driver to be modified. By default, every network interface is assigned a pfifo_fast QDisc (http://lartc.org/howto/lartc.qdisc.classless.html), which implements a simple three-band prioritization scheme based on the TOS bits. Despite being the default, the pfifo_fast QDisc is far from the best choice, because it defaults to having very deep queues (see txqueuelen below) and is not flow aware.

The second concept, which is closely related to the QDisc, is the class. Individual QDiscs may implement classes in order to handle subsets of the traffic differently—for example, the Hierarchical Token Bucket (HTB, http://lartc.org/manpages/tc-htb.html). QDisc allows the user to configure multiple classes, each with a different bitrate, and direct traffic to each as desired. Not all QDiscs have support for multiple classes. Those that do are referred to as classful QDiscs, and those that do not are referred to as classless QDiscs.

Filters (also called classifiers) are the mechanism used to direct traffic to a particular QDisc or class. There are many different filters of varying complexity. The u32 filter (http://www.lartc.org/lartc.html#LARTC.ADV-FILTER.U32) is the most generic, and the flow filter is the easiest to use.

Buffering between the Transport Layer and the Queueing Disciplines

In looking at the figures for this article, you may have noticed that there are no packet queues above the QDisc layer. The network stack places packets directly into the QDisc or else pushes back on the upper layers (for example, socket buffer) if the queue is full. The obvious question that follows is what happens when the stack has a lot of data to send? This can occur as the result of a TCP connection with a large congestion window or, even worse, an application sending UDP packets as fast as it can. The answer is that for a QDisc with a single queue, the same problem outlined in Figure 4 for the driver queue occurs. That is, the high-bandwidth or high-packet rate flow can consume all of the space in the queue causing packet loss and adding significant latency to other flows. Because Linux defaults to the pfifo_fast QDisc, which effectively has a single queue (most traffic is marked with TOS=0), this phenomenon is not uncommon.

As of Linux 3.6.0, the Linux kernel has a feature called TCP Small Queues that aims to solve this problem for TCP. TCP Small Queues adds a per-TCP-flow limit on the number of bytes that can be queued in the QDisc and driver queue at any one time. This has the interesting side effect of causing the kernel to push back on the application earlier, which allows the application to prioritize writes to the socket more effectively. At the time of this writing, it is still possible for single flows from other transport protocols to flood the QDisc layer.

Another partial solution to the transport layer flood problem, which is transport-layer-agnostic, is to use a QDisc that has many queues, ideally one per network flow. Both the Stochastic Fairness Queueing (SFQ, http://crpppc19.epfl.ch/cgi-bin/man/man2html?8+tc-sfq) and Fair Queueing with Controlled Delay (fq_codel,http://linuxmanpages.net/manpages/fedora18/man8/tc-fq_codel.8.html) QDiscs fit this problem nicely, as they effectively have a queue-per-network flow.

How to Manipulate the Queue Sizes in Linux

Driver Queue:

The ethtool command (http://linuxmanpages.net/manpages/fedora12/man8/ethtool.8.html) is used to control the driver queue size for Ethernet devices. ethtool also provides low-level interface statistics as well as the ability to enable and disable IP stack and driver features.

The -g flag to ethtool displays the driver queue (ring) parameters:


# ethtool -g eth0
Ring parameters for eth0:
Pre-set maximums:
RX:       16384
RX Mini:    0
RX Jumbo:    0
TX:      16384
Current hardware settings:
RX:       512
RX Mini:    0
RX Jumbo:    0
TX:    256

 

You can see from the above output that the driver for this NIC defaults to 256 descriptors in the transmission queue. Early in the Bufferbloat investigation, it often was recommended to reduce the size of the driver queue in order to reduce latency. With the introduction of BQL (assuming your NIC driver supports it), there no longer is any reason to modify the driver queue size (see below for how to configure BQL).

ethtool also allows you to view and manage optimization features, such as TSO, GSO, UFO and GRO, via the -k and -K flags. The -kflag displays the current offload settings and -K modifies them.

As discussed above, some optimization features greatly increase the number of bytes that can be queued in the driver queue. You should disable these optimizations if you want to optimize for latency over throughput. It’s doubtful you will notice any CPU impact or throughput decrease when disabling these features unless the system is handling very high data rates.

Byte Queue Limits (BQL):

The BQL algorithm is self-tuning, so you probably don’t need to modify its configuration. BQL state and configuration can be found in a /sys directory based on the location and name of the NIC. For example: /sys/devices/pci0000:00/0000:00:14.0/net/eth0/queues/tx-0/byte_queue_limits.

To place a hard upper limit on the number of bytes that can be queued, write the new value to the limit_max file:


echo "3000" > limit_max

 

What Is txqueuelen?

Often in early Bufferbloat discussions, the idea of statically reducing the NIC transmission queue was mentioned. The txqueuelen field in the ifconfig command’s output or the qlen field in the ip command’s output show the current size of the transmission queue:


$ ifconfig eth0
eth0    Link encap:Ethernet  HWaddr 00:18:F3:51:44:10 
       inet addr:69.41.199.58  Bcast:69.41.199.63  Mask:255.255.255.248
       inet6 addr: fe80::218:f3ff:fe51:4410/64 Scope:Link
       UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
       RX packets:435033 errors:0 dropped:0 overruns:0 frame:0
       TX packets:429919 errors:0 dropped:0 overruns:0 carrier:0
       collisions:0 txqueuelen:1000 
       RX bytes:65651219 (62.6 MiB)  TX bytes:132143593 (126.0 MiB)
       Interrupt:23

$ ip link
1: lo:  mtu 16436 qdisc noqueue state UNKNOWN 
   link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0:  mtu 1500 qdisc pfifo_fast state UP qlen 1000
   link/ether 00:18:f3:51:44:10 brd ff:ff:ff:ff:ff:ff

 

The length of the transmission queue in Linux defaults to 1,000 packets, which is a large amount of buffering, especially at low bandwidths.

The interesting question is what queue does this value control? One might guess that it controls the driver queue size, but in reality, it serves as a default queue length for some of the QDiscs. Most important, it is the default queue length for the pfifo_fast QDisc, which is the default. The “limit” argument on the tc command line can be used to ignore the txqueuelen default.

The length of the transmission queue is configured with the ip or ifconfig commands:


ip link set txqueuelen 500 dev eth0

 

Queueing Disciplines:

As introduced earlier, the Linux kernel has a large number of queueing disciplines (QDiscs), each of which implements its own packet queues and behaviour. Describing the details of how to configure each of the QDiscs is beyond the scope of this article. For full details, see the tc man page (man tc). You can find details for each QDisc in man tc qdisc-name (for example, man tc htb orman tc fq_codel).

TCP Small Queues:

The per-socket TCP queue limit can be viewed and controlled with the following /proc file: /proc/sys/net/ipv4/tcp_limit_output_bytes.

You should not need to modify this value in any normal situation.

Oversized Queues Outside Your Control

Unfortunately, not all of the over-sized queues that will affect your Internet performance are under your control. Most commonly, the problem will lie in the device that attaches to your service provider (such as DSL or cable modem) or in the service provider’s equipment itself. In the latter case, there isn’t much you can do, because it is difficult to control the traffic that is sent toward you. However, in the upstream direction, you can shape the traffic to slightly below the link rate. This will stop the queue in the device from having more than a few packets. Many residential home routers have a rate limit setting that can be used to shape below the link rate. Of course, if you use Linux on your home gateway, you can take advantage of the QDisc features to optimize further. There are many examples of tc scripts on-line to help get you started.

Summary

Queueing in packet buffers is a necessary component of any packet network, both within a device and across network elements. Properly managing the size of these buffers is critical to achieving good network latency, especially under load. Although static queue sizing can play a role in decreasing latency, the real solution is intelligent management of the amount of queued data. This is best accomplished through dynamic schemes, such as BQL and active queue management (AQM,http://en.wikipedia.org/wiki/Active_queue_management) techniques like Codel. This article outlines where packets are queued in the Linux network stack, how features related to queueing are configured and provides some guidance on how to achieve low latency.

Acknowledgements

Thanks to Kevin Mason, Simon Barber, Lucas Fontes and Rami Rosen for reviewing this article and providing helpful feedback.

Resources

Controlling Queue Delay: http://queue.acm.org/detail.cfm?id=2209336

Bufferbloat: Dark Buffers in the Internet:http://cacm.acm.org/magazines/2012/1/144810-bufferbloat/fulltext

Bufferbloat Project: http://www.bufferbloat.net

Linux Advanced Routing and Traffic Control How-To (LARTC):http://www.lartc.org/howto

(source Linuxjournal.com)

 

Linux Kernel 3.10 Released

Linus Torvalds has announced the release of Linux Kernel 3.10:

So I delayed this by a day, considering whether to do another -rc, but decided that there wasn’t enough upside. Sure, it hasn’t been as quiet as I’d like, and we had this long discussion about an inode list locking scalability issue over the last week or two, but in the end that issue turned out to not be new, and while we may end up back-porting the eventual resolution to 3.10, it wasn’t a reason to delay the release.

Similarly, while I might wish for fewer pull requests during the late rc’s (and particularly the ones that came in Friday evening -inconvenient for a weekend release), at some point delaying things doesn’t really help things, and just makes the pent up demand for the next merge window worse.

In other words, I could really have gone either way, but decided that there wasn’t enough reason to break the normal pattern of “rc7 is the last rc before the release”. So here goes..

The appended changelog is (as usual) just the changes since the last rc. This time mainly from the networking pull (which includes drivers and core networking, as well as bluetooth), the rest were pretty small and scattered. We’ve got some arch updates, some acpi/pm fixes, and a scattering of other random fixes..

In the bigger picture (ie since 3.9) this release has been pretty typical and not particularly prone to problems, despite my waffling about the exact release date. As usual, the bulk patch-wise is all drivers (pretty much exactly two thirds), while the rest is evenly split between arch updates and “misc”. No major new subsystems this
time around, although there are individual new features. As usual, I’m sure H-Online and kernelnewbies will do better writeups of the details..

Linus

Linux 3.9 brought more file systems enhancement for Btrfs, XFS and ext-4, included better LZO compression, improvement for power management, ARM SoC, and got rid off CONFIG_EXPERIMENTAL.

Linux 3.10 brings the following key changes:

  • Timer free multitasking (Nearly tickless operation) – Up to now, Linux used preemptive multitasking where an hardware timer fires up at regular intervals (“ticks”), and can forcefully pause any program and run a OS routine that decides which task should continue running next.This multitasking method may pose problems with CPUs of laptops and mobile devices which require inactivity to enter in low power modes. Since preemptive multitasking fires the the timer often (1000 times per second in a typical Linux kernel) even when the system is not doing anything, the CPUs could not save as much power as it was possible. Virtualization added even more problems, since each VM runs its own timer. This Linux release adds support for not firing the timer (tickless) even when tasks are running. It’s not actually fully tickless in this release, as the the timer only fires up one time per second. The full tickless mode is disabled when a CPU runs more than one process, and a CPU must be kept running with full ticks to allow other CPUs to go into tickless mode. You can read ‘(Nearly) full tickless operation in 3.10‘ and the Documentation for details.
  • Bcache, a block layer cache for SSD caching – Bcache allows SSDs to cache other block devices, it does writeback caching (besides just write through caching), and is filesystem agnostic. By default it won’t cache sequential IO, just the random reads and writes. It can be used for desktops, servers, high-end storage arrays, and perhaps even embedded. For more details read the documentation or visit the wiki
  • Btrfs: smaller extents – Btrfs has incorporated a new key type for metadata extent references which uses disk space more efficiently and reduces the size from 51 bytes to 33 bytes per extent reference for each tree block. In practice, this results in a 30-35% decrease in the size of the extent tree, which means less copy-on-write operations, larger parts of the extent tree stored in memory which makes heavy metadata operations go much faster. It can be enabled with mkfs or with btrfstune -x.
  • XFS metadata checksums –  Experimental implementation of metadata CRC32c checksums. These metadata checksums are part of a bigger project that aims to implement what the XFS developers have called “self-describing metadata“ which aims at solving verification scalability (fsck takes too long to verify petabyte scale filesystems with billions of inodes). This feature is experimental and requires using experimental xfsprogs. For more information, you can read the metadata Documentation.
  • SysV IPC scalability improvements –  Linux used to lock much too big ranges, and it used to have a single IPC lock per IPC semaphore array. Most loads never cared, but some did. This release splits out locking and adds per-semaphore locks for greater scalability of the IPC semaphore code. Micro benchmarks show improvements of more than 10x in some cases.
  • rwsem locking scalability improvements -The rwsem (“read-writer semaphore”) locking scheme, used in many places in the Linux kernel, had performance problems because of strict, serialized, FIFO sequential write-ownership of the semaphore. In Linux 3.9, an “opportunistic lock stealing” patch was merged to fix it for the slow path, but in 3.10, opportunity lock stealing has been implemented in the fast path,improving the performance of pgbench with double digits in some cases.
  • mutex locking scalability improvements – The mutex locking scheme, used widely in the Linux kernel, has been improved with some scalability improvements due to the use of less atomic operations and some queuing changes that reduce reduce cacheline contention.
  • TCP optimization: Tail loss probe – This release adds the TCP Tail loss probe algorithm which aims at  reducing tail latency of short transactions.
  • ARM big.LITTLE support – Support for b.L processing has been added to 3.10. See commit.
  • MIPS KVM support – KVM/MIPS supports MIPS32R2 and beyond. Read the release notes for details. See commit.
  • tracing: tracing snapshots, stack tracing – The tracing framework has got the ability to allow several tracing buffers, which can be used to take snapshots of the main tracing buffer. These tracing snapshots can be triggered manually or with function probes. It’s also possible to cause a stack trace to be traced in the ring buffer when a given function is called.

Further details on Linux 3.10 are available on Kernelnewbies.org.

(source: cnx-software.com)

 

Paper: MegaPipe: A New Programming Interface For Scalable Network I/O

The paper MegaPipe: A New Programming Interface for Scalable Network I/O (videoslides) hits the common theme that if you want to go faster you need a better car design, not just a better driver. So that’s why the authors started with a clean-slate and designed a network API from the ground up with support for concurrent I/O, a requirement for achieving high performance while scaling to large numbers of connections per thread, multiple cores, etc.  What they created is MegaPipe, “a new network programming API for message-oriented workloads to avoid the performance issues of BSD Socket API.”

The result: MegaPipe outperforms baseline Linux between 29% (for long connections) and 582% (for short connections). MegaPipe improves the performance of a modified version of memcached between 15% and 320%. For a workload based on real-world HTTP traces, MegaPipe boosts the throughput of nginx by 75%.

What’s this most excellent and interesting paper about?

Message-oriented network workloads, where connections are short and/or message sizes are small, are CPU intensive and scale poorly on multi-core systems with the BSD Socket API. We present MegaPipe, a new API for efficient, scalable network I/O for message-oriented workloads. The design of MegaPipe centers around the abstraction of a channel a per-core, bidirectional pipe between the kernel and user space, used to exchange both I/O requests and event notifications. On top of the channel abstraction, we introduce three key concepts of MegaPipe: partitioninglightweight socket (lwsocket), and batching

We implement MegaPipe in Linux and adapt memcached and nginx. Our results show that, by embracing a clean-slate design approach, MegaPipe is able to exploit new opportunities for improved performance and ease of programmability. In microbenchmarks on an 8-core server with 64 B messages, MegaPipe outperforms baseline Linux between 29% (for long connections) and 582% (for short connections). MegaPipe improves the performance of a modified version of memcached between 15% and 320%. For a workload based on real-world HTTP traces, MegaPipe boosts the throughput of nginx by 75%.

Performance with Small Messages:

Small messages result in greater relative network I/O overhead in comparison to larger messages. In fact, the per-message overhead remains roughly constant and thus, independent of message size; in comparison with a 64 B message, a 1 KiB message adds only about 2% overhead due to the copying between user and kernel on our system, despite the large size difference.

Partitioned listening sockets:

Instead of a single listening socket shared across cores, MegaPipe allows applications to clone a listening socket and partition its associated queue across cores. Such partitioning improves performance with multiple cores while giving applications control over their use of parallelism.

Lightweight sockets:

Sockets are represented by file descriptors and hence inherit some unnecessary filerelated overheads. MegaPipe instead introduces lwsocket, a lightweight socket abstraction that is not wrapped in filerelated data structures and thus is free from system-wide synchronization.

System Call Batching:

MegaPipe amortizes system call overheads by batching asynchronous I/O requests and completion notifications within a channel.

(Source: HighScalability.com)