Linux 3.9 Release

Linus Torvalds has announced the release of Linux Kernel 3.9:

So the last week was much quieter than the preceding ones, which makes me suspect that one reason -rc7 was bigger than I liked was that people were gaming the system and had timed some of their pull requests for just before the release, explaining why -rc7 was big enough that I didn’t  actually want to do a final release last week. Please don’t do that.

Anyway. Whatever the reason, this week has been very quiet, which makes me much more comfortable doing the final 3.9 release, so I guess the last -rc8 ended up working. Because not only aren’t there very many commits here, even the ones that made it really are tiny and not pretty obscure and not very interesting.

Also, this obviously means that the merge window is open. I won’t be merging anything today, but if you start sending me your pull requests (Konrad already sent in his Xen pull request for the 3.10 merge window a week ago), tomorrow the flood gates start opening..

Linux 3.8 brought file systems enhancement for Btrfs, XFS and ext-4, introduce F2FS file system, memory management improvement, and the removal for i386 support.

Linux 3.9 brings the following key changes (Sources H-Online and Phoronix):

  • File-system improvements:
    • Btrfs has experimental RAID5/6 support and fsync performance improvements.
    • EXT-4 bug fixes for corruption issue, and JBD2 journaling layer issue which affected performance.
    • Various improvements for F2FS file-system.
  • Added ability to use SSDs as hard-disk cache.
  • Update the latest LZO compression implementation within the kernel. Decompression and compression performance has been massively improved and x86 and ARM targets.
  • Improved power management, including “lightweight suspend” (aka “suspend freeze”) mode.
  • Improved ARM SoC support
    • Added Nvidia Tegra 4 support including support for Dalmore and Pluto development boards.
    • Added Nvidia Tegra 3 Beaver Board support
    • Kernel-based Virtual Machine (KVM) support for ARMv7 (Cortex A15 required)
  • Mainlining of Google’s Goldfish virtual CPU.
  • Initial ARC Linux support. See commit.
  • Added support for Imagination Meta ATP (Meta 1) and HTP (Meta 2)
  • Graphics drivers updates – Nouveau, the open-source reverse-engineered NVIDIA Linux graphics driver, is faster for some Linux OpenGL games. There’s also some Intel OpenGL performance changes.
  • Support for Intel 7000 Wi-Fi components supporting 802.11ac.
  • CONFIG_EXPERIMENTAL kernel configuration option has been removed

More details about Linux 3.9 will be available on Kernelnewbies.org (which is currently down).

(via CNXSoft blog)

Three Ways to Web Server Concurrency

Multiprocessing, multithreading and evented I/O: the trade-offs in Web servers.

A Web server needs to support concurrency. The server should service clients in a timely, fair manner to ensure that no client starves because some other client causes the server to hang. Multiprocessing and multithreading, and hybrids of these, are traditional ways to achieve concurrency. Node.js represents another way, one based on system libraries for asynchronous I/O, such as epoll (Linux) and kqueue (FreeBSD). To highlight the trade-offs among the approaches, I have three echo servers written in close-to-the-metal C: a forking_server, a threading_server and a polling_server.

Shared Code

The Web servers use utils.c (Listing 1). The function error_msg prints messages and optionally terminates the server; announce_client dumps information about a connection; and generate_echo_response creates a syntactically correct HTTP response.

=

Listing 1. utils.c

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <fcntl.h>
#include “utils.h”

void error_msg(const char* msg, bool halt_flag) {
perror(msg);
if (halt_flag) exit(-1);
}

/* listening socket */
int create_server_socket(bool non_blocking) {
/* Modify as needed. */
const int port = 3000;

struct sockaddr_in server_addr;

/* create, bind, listen */
int sock = socket(AF_INET, /* family */
SOCK_STREAM, /* TCP */
0);
if (socket < 0) error_msg(“Problem with socket call”, true);

/* non-blocking? */
if (non_blocking) fcntl(sock, F_SETFL, O_NONBLOCK);

/* bind */
bzero(&server_addr, sizeof(server_addr));
server_addr.sin_family = AF_INET;
server_addr.sin_addr.s_addr = INADDR_ANY;
server_addr.sin_port = htons(port); /* host to network endian */
if (bind(sock, (struct sockaddr*) &server_addr,
↪sizeof(server_addr)) < 0)
error_msg(“Problem with bind call”, true);

/* listen */
fprintf(stderr, “Listening for requests on port %i…\n”, port);
if (listen(sock, BACKLOG) < 0)
error_msg(“Problem with listen call”, true);

return sock;
}

void announce_client(struct in_addr* addr) {
char buffer[BUFF_SIZE + 1];

inet_ntop(AF_INET, addr, buffer, sizeof(buffer));
fprintf(stderr, “Client connected from %s…\n”, buffer);
}

void generate_echo_response(char request[ ], char response[ ]) {
strcpy(response, “HTTP/1.1 200 OK\n”);
strcat(response, “Content-Type: text/*\n”);
strcat(response, “Accept-Ranges: bytes\n”);
strcat(response, “Connection: close\n\n”);
strcat(response, request);
}

The central function is create_server_socket, which creates a blocking or a nonblocking listening socket. This function invokes three system functions:

  • socket — create socket.
  • bind — set port.
  • listen — await connections.

The first call creates a TCP-based socket, and the bind call then specifies the port number at which the Web server awaits connections. The listen call starts the waiting for up to BACKLOG connections:

 if (listen(sock, BACKLOG) < 0) /* BACKLOG == 12 */
 error_msg("Problem with listen call", true); 

A Multiprocessing Server

The forking_server in Listing 2 supports concurrency through multiprocessing, an approach that early Web servers, such as Apache 1 used to launch Web applications written as, for example, C or Perl scripts. Separate processes handle separate connections. This approach is hardly outdated, although modern servers, such as Apache 2, typically combine multiprocessing and multithreading.

Listing 2. forking_server.c

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <netinet/in.h>
#include <signal.h>
#include “utils.h”

int main() {
/* Avoid zombies. */
signal(SIGCHLD, SIG_IGN);

char buffer[BUFF_SIZE + 1];

struct sockaddr_in client_addr;
socklen_t len = sizeof(struct sockaddr_in);

/* listening socket */
int sock = create_server_socket(false);

/* connections + requests */
while (true) {
int client = accept(sock,
(struct sockaddr*) &client_addr,
&len);
if (client < 0) error_msg(“Problem with accept call”, true);

announce_client(&client_addr.sin_addr);

/* fork child */
pid_t pid = fork();
if (pid < 0) error_msg(“Problem with fork call”, false);

/* 0 to child, child’s PID to parent */
if (0 == pid) { /** child **/
close(sock); /* child’s listening socket */

/* request */
bzero(buffer, sizeof(buffer));
int bytes_read = recv(client, buffer, sizeof(buffer), 0);
if (bytes_read < 0) error_msg(“Problem with
↪recv call”, false);

/* response */
char response[BUFF_SIZE * 2];
bzero(response, sizeof(response));
generate_echo_response(buffer, response);
int bytes_written = send(client, response,
↪strlen(response), 0);
if (bytes_written < 0) error_msg(“Problem with
↪send call”, false);

close(client);
exit(0); /* terminate */
}
else /** parent **/
close(client); /* parent’s read/write socket. */
}

return 0;
}

The forking_server divides the labor among a parent process and as many child processes as there are connected clients. A client is active until the connection closes, which ends the session.

The parent process executes main from the first instruction. The parent listens for connections and per connection:

  • Spawns a new process to handle the connection.
  • Resumes listening for other connections.

The following is the critical code segment:

 pid_t pid = fork(); /* spawn child */
 if (0 == pid) { /* child */
 close(sock); /* close inherited listening socket */
 /* handle request and terminate */
 ...
 } else /* parent */
 close(client); /* close client, resume listening */ 

The parent executes the call to fork. If the call succeeds, fork returns a non-negative integer: 0 to the forked child process and the child’s process identifier to the parent. The child inherits the parent’s open socket descriptors, which explains the if-else construct:

  • if clause: the child closes its copy of the listening socket because accepting clients is the parent’s job. The child handles the client’s request and then terminates with a call to exit.
  • else clause: the parent closes the client socket because a forked child handles the client. The parent resumes listening for connections.

Creating and destroying processes are expensive. Modules, such as FastCGI, remedy this inefficiency through pre-forking. At startup, FastCGI creates a pool of reusable client-handling processes.

An inefficiency remains, however. When one process preempts another, a context switch occurs with the resultant system working to ensure that the switched-in and switched-out process behaves properly. The kernel maintains per-process context information so that a preempted process can restart. The context’s three main structures are:

  • The page table: maps virtual addresses to physical ones.
  • The process table: stores vital information.
  • The file table: tracks the process’ open files.

The CPU cycles that the system spends on context switches cannot be spent on applications such as Web servers. Although measuring the latency of a context switch is nontrivial, 5ms–10ms per switch is a ballpark and even optimistic range. Pre-forking mitigates the inefficiency of process creation and destruction but does not eliminate context switching.

What good is multiprocessing? The process structure frees the programmer from synchronizing concurrent access to shared memory locations. Imagine, for example, a Web application that lets a user play a simple word game. The application displays scrambled letters, such as kcddoe, and a player tries to unscramble the letters to form a word—in this case docked. This is a single-player game, and the application must track the game’s state: the string to be unscrambled, the player’s movement of letters one at a time and on so. Suppose that there is a global variable:

 typedef struct { 
/* variables to track game's state */ 
} WordGame;
 WordGame game; /* single, global instance */ 

so that the application has, in the source code, exactly one WordGame instance visible across the application’s functions (for example, move_letter, submit_guess). Concurrent players need separate WordGames, so that one player cannot interfere with another’s game.

Now, consider two child processes C1 and C2, each handling a player. Under Linux, a forked child inherits its parent’s address space; hence, C1 and C2 begin with the same address space as each other and their parent. If none of the three processes thereafter executes a write operation, no harm results. The need for separate address spaces first arises with write operations. Linux thus enforces a copy-on-write (COW) policy with respect to memory pages. If C1 or C2 performs a write operation on an inherited page, this child gets its own copy of the page, and the child’s page table is updated. In effect, therefore, the parent and the forked children have separate address spaces, and each client effectively has its own copy of the writable WordGame.

A Multithreading Server

The multithreading_server shown in Listing 3 avoids the context-switch downside of the forking_server but faces challenges of its own. Each process has at least one thread of execution. A single multithreaded process has multiple threads. The threading_server is multithreaded.

Listing 3. threading_server.c

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <netinet/in.h>
#include <signal.h>
#include <pthread.h>
#include “utils.h”

/* thread routine */
void* handle_client(void* client_ptr) {
pthread_detach(pthread_self()); /* terminates on return */

/* read/write socket */
int client = *((int*) client_ptr);

/* request */
char buffer[BUFF_SIZE + 1];
bzero(buffer, sizeof(buffer));
int bytes_read = recv(client, buffer, sizeof(buffer), 0);
if (bytes_read < 0) error_msg(“Problem with recv call”, false);

/* response */
char response[BUFF_SIZE * 2];
bzero(response, sizeof(response));
generate_echo_response(buffer, response);
int bytes_written = send(client, response, strlen(response), 0);
if (bytes_written < 0) error_msg(“Problem with send call”, false);

close(client);

return NULL;
} /* detached thread terminates on return */

int main() {
char buffer[BUFF_SIZE + 1];

struct sockaddr_in client_addr;
socklen_t len = sizeof(struct sockaddr_in);

/* listening socket */
int sock = create_server_socket(false);

/* connections */
while (true) {
int client = accept(sock,
(struct sockaddr*) &client_addr,
&len);
if (client < 0) error_msg(“Problem accepting a
↪client request”, true);

announce_client(&client_addr.sin_addr);

/* client handler */
pthread_t tid;
int flag = pthread_create(&tid, /* id */
NULL, /* attributes */
handle_client, /* routine */
&client); /* routine’s arg */
if (flag < 0) error_msg(“Problem creating thread”, false);
}

return 0;
}

The threading_server mimics the division-of-labor strategy in the forking_server, but the client handlers are now threads within a single process instead of forked child processes. This difference is huge. Thanks to COW, separate processes effectively have separate address spaces, but separate threads within a process share one address space.

When a client connects, the threading_server delegates the handling to a new thread:

 pthread_create(&tid, /* id */
 NULL, /* attributes */
 handle_client, /* routine */
 &client); /* arg to routine */ 

The thread gets a unique identifier and executes a thread routine—in this case, handle_client. The threading_server passes the client socket to the thread routine, which reads from and writes to the client.

How could the WordGame be ported to the forking_server? This server must ensure one WordGame instance per client. The single WordGame:

 WordGame game; /* one instance */ 

could become an array of these:

 WordGame games[BACKLOG]; /* BACKLOG == max clients */ 

When a client connects, the threading_server could search for an available game instance and pass this to the client-handling thread:

 int game_index = get_open_game(); /* called in main so thread safe */ 

In the function main, the threading_server would invoke get_open_game, and each client-handling thread then would have access to its own WordGame instance:

 games[game_index].socket = client;
 pthread_create(&tid, /* id */
 NULL, /* attributes */
 handle_client, /* routine */
 &games[game_index]); /* WordGame arg */ 

A WordGame local to the thread_routine also would work:

 void* handle_client(void* client_ptr) {
 WordGame game; /* local so thread safe */
 /* ... */
 } 

Each thread gets its own copy of locals, which are thereby threadsafe. Of importance is that the programmer rather than the system ensures one WordGame per client.

The threading_server would be more efficient with a thread pool. The pre-forking strategy used in FastCGI for processes extends nicely to threads.

A Polling Server

Listing 4 is a polling_server, which resembles the forking_server in some respects and the threading_server in others.

Listing 4. polling_server.c


#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <netinet/in.h>
#include <signal.h>
#include <sys/epoll.h>
#include <fcntl.h>
#include <errno.h>
#include “utils.h”


#define MAX_BUFFERS (BACKLOG + 1) /* max clients + listener */


int main() {
char buffer[BUFF_SIZE + 1];


/* epoll structures */
struct epoll_event event, /* server2epoll */
event_buffers[MAX_BUFFERS]; /* epoll2server */

int epollfd = epoll_create(MAX_BUFFERS); /* arg just a hint */
if (epollfd &lt; 0) error_msg(“Problem with epoll_create”,
↪true);


struct sockaddr_in client_addr;
socklen_t len = sizeof(struct sockaddr_in);


int sock = create_server_socket(true); /* non-blocking */


/* polling */
event.events = EPOLLIN | EPOLLET; /* incoming, edge-triggered */
event.data.fd = sock; /* register listener */
if (epoll_ctl(epollfd, EPOLL_CTL_ADD, sock, &event) < 0)
error_msg(“Problem with epoll_ctl call”, true);

/* connections + requests */
while (true) {
/* event count */
int n = epoll_wait(epollfd, event_buffers, MAX_BUFFERS, -1);
if (n < 0) error_msg(“Problem with epoll_wait call”, true);


/*
– If connection, add to polling: may be none or more
– If request, read and echo
*/
int i;
for (i = 0; i < n; i++) {
/* listener? */
if (event_buffers[i].data.fd == sock) {
while (true) {
socklen_t len = sizeof(client_addr);
int client = accept(sock,
(struct sockaddr *) &client_addr,
&len);


/* no client? */
if (client < 0 && (EAGAIN == errno ||
↪EWOULDBLOCK == errno)) break;

/* client */
fcntl(client, F_SETFL, O_NONBLOCK); /* non-blocking */
event.events = EPOLLIN | EPOLLET; /* incoming,
edge-triggered */
event.data.fd = client;
if (epoll_ctl(epollfd, EPOLL_CTL_ADD, client, &event) < 0)
error_msg(“Problem with epoll_ctl ADD call”, false);

announce_client(&client_addr.sin_addr);
}
}
/* request */
else {
bzero(buffer, sizeof(buffer));
int bytes_read = recv(event_buffers[i].data.fd, buffer,
↪sizeof(buffer), 0);


/* echo request */
if (bytes_read < 0) {
char response[BUFF_SIZE * 2];
bzero(response, sizeof(response));
generate_echo_response(buffer, response);
int bytes_written =
send(event_buffers[i].data.fd, response,
↪strlen(response), 0);
if (bytes_written < 0) error_msg(“Problem with
↪send call”, false);

close(event_buffers[i].data.fd); /* epoll stops
polling fd */
}
}
}
}


return 0;
}


The polling_server is complicated:

 while (true) /* listening loop */
 for (...) /* event loop */
 if (...) /* accepting event? */
 while (true) /* accepting loop */
 else /* request event */ 

This server executes as one thread in one process and so must support concurrency by jumping quickly from one task (for example, accepting connections) to another (for example, reading requests). These nimble jumps are among nonblocking I/O operations, in particular calls to accept (connections) and recv (requests).

The polling_server’s call to accept returns immediately:

  • If there are no clients waiting to connect, the server moves on to check whether there are requests to read.
  • If there are waiting clients, the polling_server accepts them in a loop.

The polling_server uses the epoll system library, declaring a single epoll structure and an array of these:

 struct epoll_event
 event, /* from server to epoll */
 event_buffers[MAX_EVENTS]; /* from epoll to server */ 

The server uses the single structure to register interest in connections on the listening socket and in requests on the client sockets. The epoll library uses the array of epoll structures to record such events. The division of labor is:

  • The polling_server registers events of interest with epoll.
  • The epoll library records detected events in epoll_event structures.
  • The polling_server handles epoll-detected events.

The polling_server is interested in incoming (EPOLLIN) events and in edge-triggered (EPOLLET) rather than level-triggered events. The distinction comes from digital logic design but examples abound. A red traffic light is a level-triggered event signaling that a vehicle should remain stopped, whereas the transition from green to red is an edge-triggered event signaling that a vehicle should come to a stop. The polling_server is interested in connecting and requesting events when these first occur.

A for loop iterates through detected events. Above the loop, the statement:

 int n = epoll_wait(epollfd, event_buffers, MAX_EVENTS, -1); 

gets an event count, where the events are either connections or requests.

My polling_server takes a shortcut. When the server reads a request, it reads only the bytes then available. Yet the server might require several reads to get the full request; hence, the server should buffer the partials until the request is complete. I leave this fix as an exercise for the reader.

How could the WordGame be ported to the polling_server? This server, like the threading_server, must ensure one WordGame instance per client and must coordinate a client’s access to its WordGame. On the upside, the polling_server is single-threaded and thereby threadsafe. Unlike the forking_server, the polling_server does not incur the cost of context switches among forked children.

Conclusions

Which is the best way to client concurrency? A reasoned answer must consider traditional multiprocessing and multithreading, together with hybrids of these. Theevented I/O way that epoll exemplifies also deserves study. In the end, the selected method must meet the challenges of supporting concurrency across real-world Web applications under real-world conditions.

Resources

The three Web servers together with an iterative_server are available athttp://condor.depaul.edu/mkalin.

For more on Node.js, see: http://nodejs.org.

(via linuxjournal.com)

Linux 3.3 Release

Linux Torvalds announced the release of Linux Kernel 3.2 on the 18th of March:

So after the extra -rc release last weekend, now the final 3.3 is out there in the usual locations.

Things did indeed calm down during the last week, and the shortlog looks pretty boring. The diffstat from -rc7 is dominated by the arch/tile defconfig changes, the rest is pretty small, although there are changes spread out in various subsystems (drivers, filesystem, networking, perf tools).


And obviously, the 3.3 release means that the merge window for 3.4 is now open, although I may keep of pulling stuff for a day or so to encourage people to test the actual release.

Linux 3.2 brought ext-4 and btrfs file system improvements, support for Qualcomm Hexagon processor, an improved profiling tool (perf top), and better CPU and network bandwidth management.

Linux 3.3 brings the following key changes:

  • Android merge: As announced at the end of last year, the Android drivers are now part of the Linux 3.3 kernel.
  • Btrfs improvements:
    • Restriping between different RAID levels and improved balancing:  In btrfs, a “balance” operation consists in a complete rewrite of the filesystem data, which could take many hours, and it didn’t support a change of raid profile. The balancing implementation has been completely reworked. Btrfs can now pause and resume a balance operation, and give status updates. It is also possible to restripe between different raid levels. It also lets filter the balance based on metadata/data profiles, and lets balance only mostly empty block groups. The userspace utils are available in the “parser” branch of the btrfs-progs.
    • Improved debugging with a new utility “integrity check” for debugging btrfs.The tool consist in a extra integrity test that for every write request checks that the filesystem is not writing to the disk bogus references that could left the file system in an inconsistent state that would cause data loss.
  • Open vSwitch: Open vSwitch is a software implementation of a multilayer network switch which is now being merged in the main tree.
  • Better bonding of network interfaces: There is a new “teaming” network device, which is intended to be a fast, scalable, clean, userspace-driven replacement for the bonding driver. It allows to create virtual interfaces that teams together multiple ethernet devices. This is typically used to increase the maximum bandwidth and provide redundancy. The libteam userspace library with couple of demo apps is available at github.com/jpirko/libteam
  • Bufferbloat fighting: Byte queue limits:Bufferbloat” is a term used to describe the latency and throughput problems caused by excessive buffering trough the several elements of a network connection. Byte queue limits aim to solve this issue by provide limits to packet data that can be put in the transmission queue of a network device.
  • Network priority control group: The Network priority cgroup provides an interface to allow an administrator to dynamically set the priority of network traffic generated by various applications.
  • Better Ext4 online resizing: This release supports a new online resizing ioctl which lets kernel do all work. Benchmarks show it’s up to 100 times faster :o .
  • New architecture: TI C6X DSP: This architecture supports members of the Texas Instruments family of C64x single and multicore DSPs. The multicore DSPs do not support cache coherency (so are not suitable for SMP) and do not come with an MMU.  You can visit Linux C6X Wiki for details.
  • EFI boot support: An x86 bzImage can now be loaded and executed directly by EFI firmware. Both BIOS and EFI boot loaders can still load and run the same bzImage.

Further details on Linux 3.3 are available on Kernelnewbies.org.