Monkey is an open source project started on 2001 with the goal to learn C, the long story is here . Along this years, the code have been improved in many aspects, since nomenclatures to heavy architecture changes, all have been made for good and nowadays thanks to the community of core developers and contributors around the project, Monkey is one of the top performance web servers around, and i would claim that the best option for Embedded Linux.
Understanding the basics of a human readable protocol: HTTP
The Hyper Text Transfer Protocol is basically a language with simple grammar to communicate two components: a HTTP client and a HTTP server. In a common context, the communication starts from a client performing a request to the server and for hence the server replying back with some result for the request performed. As a result we can consider a status response plus a content or simply an error.
Each HTTP request performed by the client is composed by a request method, URI, protocol version, and optionally a bunch of headers, so described that, we can say that a server must take care of:
- Listen for new connections
- Accept connections
- Once the connection is accepted, start reading the HTTP request sent by the client
- Parse the HTTP request, understand what the client wants
- Depending of the request type, the sever can: serve some content, close the connection because some exception, proxy back the request to somebody else, etc.
- Close the connection or keep it opened waiting for more requests. This depends of the protocol version and client HTTP headers.
Depending of the server target, it can be implemented in many ways with different architecture strategies, so the architecture described in this post only aims to describe what have worked better for us in terms of high performance and low resources usage.
Architecture design facts
- Monkey is a web server designed with a strong focus in Linux. It do not aims to be portable across other operating system, focusing in the top and widely used mainstream operating system allow us to put our energies and effort in one place in the best way, and of course take the most of Linux Kernel to achieve high performance.
- Event driven: well known as asynchronous, an event driver web server aims to use non-blocking system calls to perform it works reducing the computing time in the user-space context, e.g: if we are sending a file content to a client, we do not block the whole process or thread when sending the data, instead we instruct the kernel through a system call to send N bytes from the file and just notify me where i am able to send more bytes, in the meanwhile.. i process other connections and send other pending data.
- Embedded Friendly: our embedded context is Embedded Linux, we care a lot of resources consumption, that means that under a heavy load don’t use more than 2.5MB of memory. Even Monkey binary size is around 80KB, once is load in memory it takes like 350KB, and depending of the load, more resources can be needed.
- Small core, flexible API: it implements a basic core to handle HTTP protocol, it exposes a flexible API through the plugin interface where is possible to hook plugins for transport layer, security, request type and event handlers.
In Monkey, we have defined two contexts of work: process context and thread context. The process context represents the main process waiting for incoming connections and the scheduler balancing the new connection for the worker threads. The thread context belongs to each thread working the active connections:
The number of workers are defined in the configuration, it scale properly well in single and multi-core CPUs solutions. There is no need to set thread affinity through CPU masks, the Linux Kernel Scheduler is smart enough to assign CPU time to each worker request, by default all workers are assign to all CPUs.
From a system administrator point of view, is possible to assign each worker to a different set of CPUs, but this approach is not suggested unless we are totally aware about what the Linux scheduler does in terms of interruptions, context switches and CPU time for Kernel and User space applications. Do it only if you can do it better than the running scheduler.
Before to enter in the server loop, the scheduler launch and initialize each worker, taking care of set the initial data structures and the interfaces for the interaction between the components mentioned, this stage involves the creation of a epoll(7) queue per worker. Is good to mention that each epoll(7) queue created through epoll_create(2) is managed through a specific file descriptor.
Once the workers are up and running, the next Scheduler job is to to manage the incoming connections. So for each new connection accepted, it determinate who is the lowest loaded worker and assign the connection to it. The chosen worker is the one that have less connections in its epoll(7) interface, so the scheduler goes around the worker counters and chose one. On this specific point the scheduler have two file descriptors: the connection file descriptor returned by accept(2) and the file descriptor that represents the epoll(7) of the chosen worker. So it basically register the new file descriptor in the proper epoll(7) queue.
Each worker or thread, runs in an infinite loop through the epoll(7) interface, which is basically a Linux specific polling mechanism to register, enqueue and notify about events in file descriptors registered by the Scheduler (sockets on this case).
The worker stay in a loop waiting for events in the epoll_wait(2) system call. Every time the Scheduler register a new file descriptor, an event will be reported in the worker epoll(7) interface, and it will do same when for subsequent events such as “there is data available for read” (EPOLLIN), “now you can write to the socket” (EPOLLOUT), “connection closed” (EPOLLHUP), etc.
So for each event triggered, the worker keeps a status of the connection to determinate if is a new connection, its receiving the HTTP request, HTTP request completed, parsing the request or sending out some response. Besides events, every a fixed time of seconds set in the configuration, it checks the connections that timed out due to an incomplete request or another anomaly.
Monkey defines three categories of API where the plugins can hook: Context, Events, Stages and Networking.
Define callbacks that can be invoked when the server is starting up, it covers the process and thread contexts described earlier.
For every type of event reported in a worker loop, a plugin can implement a hook to perform specific actions:
Every new connection, enter in a stage status, so for each step of the HTTP cycle it passed along different phases, and each plugin can hook to a specific one:
Monkey is not aware about networking, for hence it intentionally depends of a plugin that provides the transport layer, this approach allows to change from common sockets communication to encrypted one as SSL in a easy manner. The networking plugin only needs to provide the required API functions for the communication:
Every time a connection have performed a successful request, this is allocated in a global list of the worker scope (implemented through a pthread_key). for each event reported, the worker needs to lookup the internal data associated to it, so the file descriptor or socket number acts like a primary key for the search. The solution of data structure implemented for Monkey v1.2, is the use of red-black tree algorithm. This algorithm have shown to behave very fairly and scalable when handling thousands of active connections per worker, maintaining a good balance between performance and cost.
The cost of each file descriptor lookup is critical for the server performance, having a O(n) solution will work fine for a few connections but under high concurrency a O(log(n)) solution will end up providing the highest performance.
One of the success key to reduce overhead in a server, is to reduce as much as possible the memory allocation requests performed to the system within the main loop. Current Monkey implementation only performs 1 memory allocation per new connection, if it needed because the incoming request will post too much data, it will allocate more memory as it needs. Other web server solutions implements caching mechanism to reduce even more the memory allocations, as our focus is Embedded Linux we focus into speed at low resources usage, and implement a caching mechanism will increase our costs. So we dropped that common approach to do not abuse of system memory, just a decision based in the target.
Linux Kernel system calls
The Linux Kernel exposes a useful of non-portable set of system calls to achieve high performance when creating networking applications. The first one is epoll(7), as described earlier this interface allow to watch a set of file descriptors for certain defined events. Similar solutions like select(2) or poll(2) do not perform so well as epoll(7) does.
When sending a static file, the old-fashioned way is to open the file, get the file descriptor and perform multiples read(2)/write(2) to write out the file content. This operation requires the Kernel to copy data between Kernel and User spaces back and forward which obviously generate an overhead. As solution, the Linux Kernel implements a Zero-Copy strategy through the system call sendfile(2). This system call do not copy data to user space, instead it allows to send it directly to other file descriptor achieving good performance reducing the latency of the old fashioned way described.
In our architecture, the Logger plugin requires to transfer data through a pipe(2) (a unidirectional data channel that can be used for interprocess communication). A common mechanism is to use read(2)and write(2) on each end, but in a similar way as sendfile(2) works, a new system call takes place for this kind of situation called splice(2). This system call moves data from one point to other without the copy-data overhead. The main difference between sendfile(2) and splice(2), is that splice(2) requires that one end must be a pipe(2).
In my previous post, i mentioned how to usage the new Linux Kernel feature called TCP_FASTOPEN, being something very simple to implement, it requires the cooperation of both sides: the client and the server. If you have full control of your networking application (client and server), consider to use TCP_FASTOPEN, it will increase performance decreasing the TCP handshake roundtrip.
Based in the architecture and API described, the following plugins are distributed as part of the core:
Liana: basic sockets connectivity layer
PolarSSL: provides a transport layer based in SSL
Cheetah: plugin that provides a command line interface to query the internals of a running server through a unix socket
Mandril: security layer that aims to restrict the access by URI strings or sub networks.
Dirlisting: directory listing
Logger: log writer
CGI: old fashioned CGI interface
FastCGI: provide fast-cgi support
Bonus track: Full HTTP Stack for web services implementation
Besides to be a common web server to serve static or dynamic content, Monkey is a full stack for the development of web applications. In order to provide an easy API for web application or web services development, we have created Duda I/O , which is an event-driven C framework for rapid development based in Monkey stack.
Duda implements a core API of pseudo-objects and provide extra features through a packages system, everything in a friendly C API. The most relevant features supported at the moment are WebSocket, JSON, SQLite3, Redis, Base64 and SHA1.
Due to it high performance nature and open source ecosystem around, is being used in production from Embedded Linux products to Big Data solutions. The License of Duda allows to create closed-sourced services or applications and link them to Duda I/O stack at zero cost.
For more details please refer to Duda I/O main site.
Monkey organization believes in Open Source and is fully committed to create the best networking technology for different needs. If you are interested into participate as a contributor or testing our stack, feel free to reach us on our mailing lists or irc channel #monkey at irc.freenode.net.