Comparing the Performance of Various Web Frameworks

TechEmpower has been running benchmarks for the last year, attempting to measure and compare the performance of web frameworks. For these benchmarks the term “framework” is used loosely including platforms and micro-frameworks.

The results of the first benchmark were published in March 2013, and the tests were conducted on two hardware configurations: 2 Amazon EC2 m1.large instances and 2 Intel Sandy Bridge Core i7-2600K with 8GB of RAM, both setups using Gbit Ethernet. The frameworks had to reply with a simple JSON response {"message" : "Hello, World!"}. Following is an excerpt of the results, showing 10 of the most common web frameworks out of 24 benchmarked along with their results measured in Requests per second (RPS) and percentage of the best result.

image

Over the year, the benchmark evolved to cover 119 frameworks, including multiple types of queries and responses and running on one additional hardware configuration: 2 x Dell R720xd Dual-Xeon E5 v2 with 32GB of RAM each and 10 Gbit Ethernet.

Following are excerpts from the recently published 9th round of benchmarking, this time performed on three different hardware configurations:

image

Several observations:

While high-end hardware certainly performs better than lower end in terms of RPS, there is a major change in hierarchy for the top frameworks, as it is the case for Xeon E5 vs. Intel I7. This can be correlated with the fact that there are many at over 90% for Intel I7, TechEmpower attributing it to a bottleneck generated by the Ethernet 1Gbit used in the respective configuration:

Since the first round, we have known that the highest-performance frameworks and platforms have been network-limited by our gigabit Ethernet…. With ample network bandwidth, the highest-performing frameworks are further differentiated.

Google’s Go is the first on Intel I7, but has a major drop in performance on Xeon E5 in spite of using all cores, emphasizing the importance of performance tuning for certain types of hardware architectures, as noticed by TechEmpower:

The Peak [Xeon E5] test environment uses servers with 40 HT cores across two Xeon E5 processors, and the results illustrate how well each framework scales to high parallelism on a single server node.

Of course, other variables are at play, such as significantly more HT processor cores–40 versus 8 for our i7 tests–and a different system architecture, NUMA.

Amazon EC2 M1.large is severely underperforming compared with other configurations, making one wonder if the price/performance ratio is still attractive for cloud computing instances.

-*) Most frameworks are based on Linux. No Windows framework was benchmarked on EC2 or I7. The best Windows framework (plain-windows) performed at 18.6% of the leader on E5. Also, Facebook’s HHVM does not appear on EC2. It is also likely that HHVM is tuned differently when it runs on Facebook’s hardware.

Some of the well known frameworks are severely underperforming on all hardware configurations used, such as Grails, Spring, Django and Rails.

The performance of the recently released Dart 1.3 has doubled and should be on par with Node.js’ but it does not show in the benchmark because this round used Dart 1.2. This may be applicable to other frameworks that have been improved lately and the improvements will show up in future benchmarks.

As it is with other performance benchmarks, the results are debatable since they depend on the hardware, network and system configurations. TechEmpower invites those interested in bumping up the results for their favorite framework to fork the code on GitHub.

The benchmark includes detailed explanations on how tests are conducted, the hardware configurations, and the components of each framework: the language, web server, OS, etc.

(via InfoQ.com)

How To Scale A Web Application Using A Coffee Shop As An Example

This is a guest repost by Sriram Devadas, Engineer at Vistaprint, Web platform group. A fun and well written analogy of how to scale web applications using a familiar coffee shop as an example. No coffee was harmed during the making of this post.

I own a small coffee shop.

My expense is proportional to resources
100 square feet of built up area with utilities, 1 barista, 1 cup coffee maker.

My shop’s capacity
Serves 1 customer at a time, takes 3 minutes to brew a cup of coffee, a total of 5 minutes to serve a customer.

Since my barista works without breaks and the German made coffee maker never breaks down,
my shop’s maximum throughput = 12 customers per hour.

 

Web Server

Customers walk away during peak hours. We only serve one customer at a time. There is no space to wait.

I upgrade shop. My new setup is better!

Expenses
Same area and utilities, 3 baristas, 2 cup coffee maker, 2 chairs

Capacity
3 minutes to brew 2 cups of coffee, ~7 minutes to serve 3 concurrent customers, 2 additional customers can wait in queue on chairs.

Concurrent customers = 3, Customer capacity = 5

 

Scaling Vertically

Business is booming. Time to upgrade again. Bigger is better!

Expenses
200 square feet and utilities, 5 baristas, 4 cup coffee maker, 3 chairs

Capacity goes up proportionally. Things are good.

In summer, business drops. This is normal for coffee shops. I want to go back to a smaller setup for sometime. But my landlord wont let me scale that way.

Scaling vertically is expensive for my up and coming coffee shop. Bigger is not always better.

 

Scaling Horizontally With A Load Balancer

Landlord is willing to scale up and down in terms of smaller 3 barista setups. He can add a setup or take one away if I give advance notice.

If only I could manage multiple setups with the same storefront…

There is a special counter in the market which does just that!
It allows several customers to talk to a barista at once. Actually the customer facing employee need not be a barista, just someone who takes and delivers orders. And baristas making coffee do not have to deal directly with pesky customers.

Now I have an advantage. If I need to scale, I can lease another 3 barista setup (which happens to be a standard with coffee landlords), and hook it up to my special counter. And do the reverse to scale down.

While expenses go up, I can handle capacity better too. I can ramp up and down horizontally.

 

Resource Intensive Processing

My coffee maker is really a general purpose food maker. Many customers tell me they would love to buy freshly baked bread. I add it to the menu.

I have a problem. The 2 cup coffee maker I use, can make only 1 pound of bread at a time. Moreover it takes twice as long to make bread.

In terms of time,
1 pound of bread = 4 cups of coffee

Sometimes bread orders clog my system! Coffee ordering customers are not happy with the wait. Word spreads about my inefficiency.

I need a way of segmenting my customer orders by load, while using my resources optimally.

 

Asynchronous Queue Based Processing

I introduce a token based queue system.

Customers come in, place their order, get a token number and wait.
The order taker places the order on 2 input queues – bread and coffee.
Baristas look at the queues and all other units and decide if they should pick up the next coffee or the next bread order.
Once a coffee or bread is ready, it is placed on an output tray and the order taker shouts out the token number. The customer picks up the order.

- The inputs queues and output tray are new. Otherwise the resources are the same, only working in different ways.

- From a customers point of view, the interaction with the system is now quite different.

- The expense and capacity calculations are complicated. The system’s complexity went up too. If any problem crops up, debugging and fixing it could be problematic.

- If the customers accept this asynchronous system, and we can manage the complexity, it provides a way to scale capacity and product variety. I can intimidate my next door competitor.

 

Scaling Beyond

We are reaching the limits of our web server driven, load balanced, queue based asynchronous system. What next?

My already weak coffee shop analogy has broken down completely at this point.

Search for DNS round robin and other techniques of scaling if you are interested to know more.

If you are new to web application scaling, do the same for each of the topics mentioned in this page.

My coffee shop analogy was an over-simplification, to spark interest on the topic of scaling web applications.

If you really want to learn, tinker with these systems and discuss them with someone who has practical knowledge.

(via HighScalability.com)

 

Scale PHP on Ec2 to 30,000 Concurrent Users / Server

RockThePost.com is a LAMP stack hosted on Ec2. We’re preparing to be featured in an email which will be sent to ~1M investors… all at the same time. For our 2 person engineering department, that meant we had to do a quick sanity check to see just how many people we can support concurrently.

Our app uses PHP’s Zend Framework 2. Our web servers are two m1.medium Ec2 machines with an ELB in front of them setup to split the load. On the backend, we’re running a master/slave MySQL database configuration. Very typical for most of the small data shops I’ve worked at.

Here are my opinions and thoughts in random order.

  • Use PHP’s APC feature. I don’t understand why this isn’t on by default but having APC enabled is really a requirement in order for a website to have a chance at performing well.
  • Put everything that’s not a .php request on a CDN. No need to bog down your origin server with static file requests. Our deployer puts everything on S3 and we abs path everything to CloudFront. Disclaimer: CloudFront has had some issues lately and we’ve recently set the site to instead serve all static materials directly from S3 until the CloudFront issues are resolved.
  • Don’t make connections to other servers in your PHP code, such as the database and memcache servers, unless it’s mandatory and there’s really NO other way to do what you’re trying to do. I’d guess that the majority of PHP web apps out there use a MySQL database for the backend session management and a memcache pool for caching. Making connections to other servers during your execution flow is not efficient. It blocks, runs up the CPU, and tends to hold up the line, as it were. Instead, use the APC key/value store for storing data in PHP and Varnish for full page caching.
  • I repeat… Use Varnish. In all likelihood, most of the pages on your site do not change, or hardly ever change. Varnish is the Memcache/ModRewrite of web server caching. Putting Varnish on my web server made the single biggest difference to my web app when I was load testing it.
  • Use a c1.xlarge. The m1.medium has only 1 CPU to handle all of the requests. I found that upgrading to a c1.xlarge, which has 8 CPU’s, really paid off. During my load test, I did an apt-get install nmon and then ran nmon and monitored the CPU. You can literally watch the server use all 8 of its cores to churn out pages.

Google Analytics shows us how many seconds an average user spends on an average page. Using this information, we ran some load tests with Siege and were able to extrapolate that with the above recommendations, we should be able to handle 30,000 users concurrently per web server on the “exterior” of the site. For “interior” pages which are PHP powered, we’re expecting to see something more along the lines of 1,000 concurrent sessions per server. In advance of the email blast, we’ll launch a bunch of c1.xlarges and then kill them off slowly as we get more comfortable with the actual load we’re seeing on blast day.

Finally, we will do more work to make our code itself more scaleable… however, the code is only about 4 months old and this is the first time we’ve ever been challenged to actually make it scale.

(source: coderwall.com)

Speed Up Your Web Site with Varnish

Varnish is a program that can greatly speed up a Web site while reducing the load on the Web server. According to Varnish’s official site, Varnish is a “Web application accelerator also known as a caching HTTP reverse proxy”.

When you think about what a Web server does at a high level, it receives HTTP requests and returns HTTP responses. In a perfect world, the server would return a response immediately without having to do any real work. In the real world, however, the server may have to do quite a bit of work before returning a response to the client. Let’s first look at how a typical Web server handles this, and then see what Varnish does to improve the situation.

Although every server is different, a typical Web server will go through a potentially long sequence of steps to service each request it receives. It may start by spawning a new process to handle the request. Then, it may have to load script files from disk, launch an interpreter process to interpret and compile those files into bytecode and then execute that bytecode. Executing the code may result in additional work, such as performing expensive database queries and retrieving more files from disk. Multiply this by hundreds or thousands of requests, and you can see how the server quickly can become overloaded, draining system resources trying to fulfill requests. To make matters worse, many of the requests are repeats of recent requests, but the server may not have a way to remember the responses, so it’s sentenced to repeating the same painful process from the beginning for each request it encounters.

Things are a little different with Varnish in place. For starters, the request is received by Varnish instead of the Web server. Varnish then will look at what’s being requested and forward the request to the Web server (known as a back end to Varnish). The back-end server does its regular work and returns a response to Varnish, which in turn gives the response to the client that sent the original request.

If that’s all Varnish did, it wouldn’t be much help. What gives us the performance gains is that Varnish can store responses from the back end in its cache for future use. Varnish quickly can serve the next response directly from its cache without placing any needless load on the back-end server. The result is that the load on the back end is reduced significantly, response times improve, and more requests can be served per second. One of the things that makes Varnish so fast is that it keeps its cache completely in memory instead of on disk. This and other optimizations allow Varnish to process requests at blinding speeds. However, because memory typically is more limited than disk, you have to size your Varnish cache properly and take measures not to cache duplicate objects that would waste valuable space.

Let’s install Varnish. I’m going to explain how to install it from source, but you can install it using your distribution’s package manager. The latest version of Varnish is 3.0.3, and that’s the version I work with here. Be aware that the 2.x versions of Varnish have some subtle differences in the configuration syntax that could trip you up. Take a look at the Varnish upgrade page on the Web site for a full list of the changes between versions 2.x and 3.x.

Missing dependencies is one of the most common installation problems. Check the Varnish installation page for the full list of build dependencies.

Run the following commands as root to download and install the latest version of Varnish:


cd /var/tmp
wget http://repo.varnish-cache.org/source/varnish-3.0.3.tar.gz
tar xzf varnish-3.0.3.tar.gz
cd varnish-3.0.3
sh autogen.sh
sh configure
make
make test
make install

 

Varnish is now installed under the /usr/local directory. The full path to the main binary is /usr/local/sbin/varnishd, and the default configuration file is /usr/local/etc/varnish/default.vcl.

You can start Varnish by running the varnishd binary. Before you can do that though, you have to tell Varnish which back-end server it’s caching for. Let’s specify the back end in the default.vcl file. Edit the default.vcl file as shown below, substituting the values for those of your Web server:


backend default {
    .host = "127.0.0.1";
    .port = "80";
}

 

Now you can start Varnish with this command:


/usr/local/sbin/varnishd -f /usr/local/etc/varnish/default.vcl 
 ↪-a :6081 -P /var/run/varnish.pid -s malloc,256m

 

This will run varnishd as a dæmon and return you to the command prompt. One thing worth pointing out is that varnishd will launch two processes. The first is the manager process, and the second is the child worker process. If the child process dies for whatever reason, the manager process will spawn a new process.

Varnishd Startup Options

The -f option tells Varnish where your configuration file lives.

The -a option is the address:port that Varnish will listen on for incoming HTTP requests from clients.

The -P option is the path to the PID file, which will make it easier to stop Varnish in a few moments.

The -s option configures where the cache is kept. In this case, we’re using a 256MB memory-resident cache.

If you installed Varnish from your package manager, it may be running already. In that case, you can stop it first, then use the command above to start it manually. Otherwise, the options it was started with may differ from those in this example. A quick way to see if Varnish is running and what options it was given is with the pgrep command:


/usr/bin/pgrep -lf varnish

 

Varnish now will relay any requests it receives to the back end you specified, possibly cache the response, and deliver the response back to the client. Let’s submit some simple GET requests and see what Varnish does. First, run these two commands on separate terminals:


/usr/local/bin/varnishlog
/usr/local/bin/varnishstat

 

The following GET command is part of the Perl www library (libwww-perl). I use it so you can see the response headers you get back from Varnish. If you don’t have libwww-perl, you could use Firefox with the Live HTTP Headers extension or another tool of your choice:


GET -Used http://localhost:6081/

 

Figure 1. Varnish Response Headers

The options given to the GET command aren’t important here. The important thing is that the URL points to the port on which varnishd is listening. There are three response headers that were added by Varnish. They are X-Varnish, Via and Age. These headers are useful once you know what they are. The X-Varnish header will be followed by either one or two numbers. The single-number version means the response was not in Varnish’s cache (miss), and the number shown is the ID Varnish assigned to the request. If two numbers are shown, it means Varnish found a response in its cache (hit). The first is the ID of the request, and the second is the ID of the request from which the cached response was populated. The Via header just shows that the request went through a proxy. The Age header tells you how long the response has been cached by Varnish, in seconds. The first response will have an Age of 0, and subsequent hits will have an incrementing Age value. If subsequent responses to the same page don’t increment the Age header, that means Varnish is not caching the response.

Now let’s look at the varnishstat command launched earlier. You should see something similar to Figure 2.

Figure 2. varnishstat Command

The important lines are cache_hit and cache_miss. cache_hits won’t be shown if you haven’t had any hits yet. As more requests come in, the counters are updates to reflect hits and misses.

Next, let’s look at the varnishlog command launched earlier (Figure 3).

Figure 3. varnishlog Command

This shows you fairly verbose details of the requests and responses that have gone through Varnish. The documentation on the Varnish Web site explains the log output as follows:

The first column is an arbitrary number, it defines the request. Lines with the same number are part of the same HTTP transaction. The second column is the tag of the log message. All log entries are tagged with a tag indicating what sort of activity is being logged. Tags starting with Rx indicate Varnish is receiving data and Tx indicates sending data. The third column tell us whether this is data coming or going to the client (c) or to/from the back end (b). The forth column is the data being logged.

varnishlog has various filtering options to help you find what you’re looking for. I recommend playing around and getting comfortable with varnishlog, because it will really help you debug Varnish. Read the varnishlog(1) man page for all the details. Next are some simple examples of how to filter with varnishlog.

To view communication between Varnish and the client (omitting the back end):


/usr/local/bin/varnishlog -c

 

To view communication between Varnish and the back end (omitting the client):


/usr/local/bin/varnishlog -b

 

To view the headers received by Varnish (both the client’s request headers and the back end’s response headers):


/usr/local/bin/varnishlog -i RxHeader

 

Same thing, but limited to just the client’s request headers:


/usr/local/bin/varnishlog -c -i RxHeader

 

Same thing, but limited to just the back end’s response headers:


/usr/local/bin/varnishlog -b -i RxHeader

 

To write all log messages to the /var/log/varnish.log file and dæmonize:


/usr/local/bin/varnishlog -Dw /var/log/varnish.log

 

To read and display all log messages from the /var/log/varnish.log file:


/usr/local/bin/varnishlog -r /var/log/varnish.log

 

The last two examples demonstrate storing your Varnish log to disk. Varnish keeps a circular log in memory in order to stay fast, but that means old log entries are lost unless saved to disk. The last two examples above demonstrate how to save all log messages to a file for later review.

If you wanted to stop Varnish, you could do so with this command:


kill `cat /var/run/varnish.pid`

 

This will send the TERM signal to the process whose PID is stored in the /var/run/varnish.pid file. Because this is the varnishd manager process, Varnish will shut down.

Now that you know how to start and stop Varnish, and examine cache hits and misses, the natural question to ask is what does Varnish cache, and for how long?

Varnish is conservative with what it will cache by default, but you can change most of these defaults. It will consider only caching GET and HEAD requests. It won’t cache a request with either a Cookie or Authorization header. It won’t cache a response with either a Set-Cookie or Vary header. One thing Varnish looks at is the Cache-Control header. This header is optional, and it may be present in the Request or the Response. It may contain a list of one or more semicolon-separated directives. This header is meant to apply caching restrictions. However, Varnish won’t alter its caching behavior based on the Cache-Control header, with the exception of the max-age directive. This directive looks like Cache-Control: max-age=n, where n is a number. If Varnish receives the max-age directive in the back end’s response, it will use that value to set the cached response’s expiration (TTL), in seconds. Otherwise, Varnish will set the cached response’s TTL expiration to the value of its default_ttl parameter, which defaults to 120 seconds.

Note:

Varnish has configuration parameters with sensible defaults. For example, the default_ttl parameter defaults to 120 seconds. Configuration parameters are fully explained in the varnishd(1) man page. You may want to change some of the default parameter values. One way to do that is to launch varnishd by using the -p option. This has the downside of having to stop and restart Varnish, which will flush the cache. A better way of changing parameters is by using what Varnish calls the management interface. The management interface is available only if varnishd was started with the -T option. It specifies on what port the management interface should listen. You can connect to the management interface with the varnishadm command. Once connected, you can query parameters and change their values without having to restart Varnish.

To learn more, read the man pages for varnishd, varnishadm and varnish-cli.

You’ll likely want to change what Varnish caches and how long it’s cached for—this is called your caching policy. You express your caching policy in the default.vcl file by writing VCL. VCL stands for Varnish Configuration Language, which is like a very simple scripting language specific to Varnish. VCL is fully explained in the vcl(7) man page, and I recommend reading it.

Before changing default.vcl, let’s think about the process Varnish goes through to fulfill an HTTP request. I call this the request/response cycle, and it all starts when Varnish receives a request. Varnish will parse the HTTP request and store the details in an object known to Varnish simply as req. Now Varnish has a decision to make based entirely on the req object—should it check its cache for a match or just forward the request to the back end without caching the response? If it decides to bypass its cache, the only thing left to do is forward the request to the back end and then forward the response back to the client. However, if it decides to check its cache, things get more interesting. This is called a cache lookup, and the result will either be a hit or a miss. A hit means that Varnish has a response in its cache for the client. A miss means that Varnish doesn’t have a cached response to send, so the only logical thing to do is send the request to the back end and then cache the response it gives before sending it back to the client.

Now that you have an idea of Varnish’s request/response cycle, let’s talk about how to implement your caching policy by changing the decisions Varnish makes in the process. Varnish has a set of subroutines that carry out the process described above. Each of these subroutines performs a different part of the process, and the return value from the subroutine is how you tell Varnish what to do next. In addition to setting the return values, you can inspect and make changes to various objects within the subroutines. These objects represent things like the request and the response. Each subroutine has a default behavior that can be seen in default.vcl. You can redefine these subroutines to get Varnish to behave how you want.

Varnish Subroutines

The Varnish subroutines have default definitions, which are shown in default.vcl. Just because you redefine one of these subroutines doesn’t mean the default definition will not execute. In particular, if you redefine one of the subroutines but don’t return a value, Varnish will proceed to execute the default subroutine. All the default Varnish subroutines return a value, so it makes sens that Varnish uses them as a fallback.

The first subroutine to look at is called vcl_recv. This gets executed after receiving the full client request, which is available in the req object. Here you can inspect and make changes to the original request via the req object. You can use the value of req to decide how to proceed. The return value is how you tell Varnish what to do. I’ll put the return values in parentheses as they are explained. Here you can tell Varnish to bypass the cache and send the back end’s response back to the client (pass). You also can tell Varnish to check its cache for a match (lookup).

Next is the vcl_pass subroutine. If you returned pass in vcl_recv, this is where you’ll be just before sending the request to the back end. You can tell Varnish to continue as planned (pass) or to restart the cycle at the vcl_recv subroutine (restart).

The vcl_miss and vcl_hit subroutines are executed depending on whether Varnish found a suitable response in the cache. From vcl_miss, your main options are to get a response from the back-end server and cache it (fetch) or to get a response from the back end and not cache it (pass). vcl_hit is where you’ll be if Varnish successfully finds a matching response in its cache. From vcl_hit, you have the cached response available to you in the obj object. You can tell Varnish to send the cached response to the client (deliver) or have Varnish ignore the cached response and return a fresh response from the back end (pass).

The vcl_fetch subroutine is where you’ll be after getting a fresh response from the back end. The response will be available to you in the beresp object. You either can tell Varnish to continue as planned (deliver) or to start over (restart).

From vcl_deliver, you can finish the request/response cycle by delivering the response to the client and possibly caching it as well (deliver), or you can start over (restart).

As previously stated, you express your caching policy within the subroutines in default.vcl. The return values tell Varnish what to do next. You can base your return values on many things, including the values held in the request (req) and response (resp) objects mentioned earlier. In addition to req and resp, there also is a client object representing the client, a server object and a beresp object representing the back end’s response. It’s important to realize that not all objects are available in all subroutines. It’s also important to return one of the allowed return values from subroutines. One of the hardest things to remember when starting out with Varnish is which objects are available in which subroutines, and what the legal return values are. To make it easier, I’ve created a couple reference tables. They will help you get up to speed quickly by not having to memorize everything up front or dig through the documentation every time you make a change.

Table 1. This table shows which objects are available in each of the subroutines.

  client server req bereq beresp resp obj
vcl_recv X X X        
vcl_pass X X X X      
vcl_miss X X X X      
vcl_hit X X X       X
vcl_fetch X X X X X    
vcl_deliver X X X     X  

Table 2. This table shows valid return values for each of the subroutines.

  pass lookup error restart deliver fetch pipe hit_for_pass
vcl_recv X X X       X  
vcl_pass X   X X        
vcl_lookup                
vcl_miss X   X     X    
vcl_hit X   X X X      
vcl_fetch     X X X     X
vcl_deliver     X X X      

Tip:

Be sure to read the full explanation of VCL, available subroutines, return values and objects in the vcl(7) man page.

Let’s put it all together by looking at some examples.

Normalizing the request’s Host header:


sub vcl_recv {
    if (req.http.host ~ "^www.example.com") {
        set req.http.host = "example.com";
    }
}

 

Notice you access the request’s host header by using req.http.host. You have full access to all of the request’s headers by putting the header name after req.http. The ~ operator is the match operator. That is followed by a regular expression. If you match, you then use the set keyword and the assignment operator (=) to normalize the hostname to simply “example.com”. A really good reason to normalize the hostname is to keep Varnish from caching duplicate responses. Varnish looks at the hostname and the URL to determine if there’s a match, so the hostnames should be normalized if possible.

Here’s a snippet from the default vcl_recv subroutine:


sub vcl_recv {
    if (req.request != "GET" && req.request != "HEAD") {
        return (pass);
    }
    return (lookup);
}

 

That’s a snippet of the default vcl_recv subroutine. You can see that if it’s not a GET or HEAD request, varnish returns pass and won’t cache the response. If it is a GET or HEAD request, it looks it up in the cache.

Removing request’s Cookies if the URL matches:


sub vcl_recv {
    if (req.url ~ "^/images") {
        unset req.http.cookie;
    }
}

 

That’s an example from the Varnish Web site. It removes cookies from the request if the URL starts with “/images”. This makes sense when you recall that Varnish won’t cache a request with a cookie. By removing the cookie, you allow Varnish to cache the response.

Removing response cookies for image files:


sub vcl_fetch {
    if (req.url ~ "\.(png|gif|jpg)$") {
        unset beresp.http.set-cookie;
        set beresp.ttl = 1h;
    }
}

 

That’s another example from Varnish’s Web site. Here you’re in the vcl_fetch subroutine, which happens after fetching a fresh response from the back end. Recall that the response is held in the beresp object. Notice that here you’re accessing both the request (req) and the response (beresp). If the request is for an image, you remove the Set-Cookie header set by the server and override the cached response’s TTL to one hour. Again, you do this because Varnish won’t cache responses with the Set-Cookie header.

Now, let’s say you want to add a header to the response called X-Hit. The value should be 1 for a cache hit and 0 for a miss. The easiest way to detect a hit is from within the vcl_hit subroutine. Recall that vcl_hit will be executed only when a cache hit occurs. Ideally, you’d set the response header from within vcl_hit, but looking at Table 1 in this article, you see that neither of the response objects (beresp and resp) are available within vcl_hit. One way around this is to set a temporary header in the request, then later set the response header. Let’s take a look at how to solve this.

Adding an X-Hit response header:


sub vcl_hit {
    set req.http.tempheader = "1";
}

sub vcl_miss {
    set req.http.tempheader = "0";
}

sub vcl_deliver {
    set resp.http.X-Hit = "0";
    if (req.http.tempheader) {
        set resp.http.X-Hit = req.http.tempheader;
        unset req.http.tempheader;
    }
}

 

The code in vcl_hit and vcl_miss is straightforward—set a value in a temporary request header to indicate a cache hit or miss. The interesting bit is in vcl_deliver. First, I set a default value for X-Hit to 0, indicating a miss. Next, I detect whether the request’s tempheader was set, and if so, set the response’s X-Hit header to match the temporary header set earlier. I then delete the tempheader to keep things tidy, and I’m all done. The reason I chose the vcl_deliver subroutine is because the response object that will be sent back to the client (resp) is available only within vcl_deliver.

Let’s explore a similar solution that doesn’t work as expected.

Adding an X-Hit response header—the wrong way:


sub vcl_hit {
    set req.http.tempheader = "1";
}

sub vcl_miss {
    set req.http.tempheader = "0";
}

sub vcl_fetch {
    set beresp.http.X-Hit = "0";
    if (req.http.tempheader) {
        set beresp.http.X-Hit = req.http.tempheader;
        unset req.http.tempheader;
    }
}

 

Notice that within vcl_fetch, I’m now altering the back end’s response (beresp), not the final response sent to the client. This code appears to work as expected, but it has a major bug. What happens is that the first request is a miss and fetched from the back end, and that response has X-Hit set to “0” then it’s cached. Subsequent requests result in a cache hit and never enter the vcl_fetch subroutine. The result is that all cache hits continue having X-Hit set to “0”. These are the types of mistakes to look out for when working with Varnish.

The easiest way to avoid these mistakes is to keep those reference tables handy; remember when each subroutine is executed in Varnish’s workflow, and always test the results.

Let’s look at a simple way to tell Varnish to cache everything for one hour. This is shown only as an example and isn’t recommended for a real server.

Cache all responses for one hour:


sub vcl_recv {
    return (lookup);
}

sub vcl_fetch {
    set beresp.ttl = 1h;
    return (deliver);
}

 

Here, I’m overriding two default subroutines with my own. If I hadn’t returned “deliver” from vcl_fetch, Varnish still would have executed its default vcl_fetch subroutine looking for a return value, and this would not have worked as expected.

Once you get Varnish to implement your caching policy, you should run some benchmarks to see if there is any improvement. The benchmarking tool I use here is the Apache benchmark tool, known as ab. You can install this tool as part of the Apache Web server or as a separate package—depending on your system’s package manager. You can read about the various options available to ab in either the man page or at the Apache Web site.

In the benchmark examples below, I have a stock Apache 2.2 installation listening on port 80, and Varnish listening on port 6081. The page I’m testing is a very basic Perl CGI script I wrote that just outputs a one-liner HTML page. It’s important to benchmark the same URL against both the Web server and Varnish so you can make a direct comparison. I run the benchmark from the same machine that Apache and Varnish are running on in order to eliminate the network as a factor. The ab options I use are fairly straightforward. Feel free to experiment with different ab options and see what happens.

Let’s start with 1000 total requests (-n 1000) and a concurrency of 1 (-c 1).

Benchmarking Apache with ab:


ab -c 1 -n 1000 http://localhost/cgi-bin/test

 

Figure 4. Output from ab Command (Apache)

Benchmarking Varnish with ab:


ab -c 1 -n 1000 http://localhost:6081/cgi-bin/test

 

Figure 5. Output from ab Command (Varnish)

As you can see, the ab command provides a lot of useful output. The metrics I’m looking at here are “Time per request” and “Requests per second” (rps). You can see that Apache came in at just over 1ms per request (780 rps), while Varnish came in at 0.1ms (7336 rps)—nearly ten times faster than Apache. This shows that Varnish is faster, at least based on the current setup and isolated testing. It’s a good idea to run ab with various options to get a feel for performance—particularly by changing the concurrency values and seeing what impact that has on your system.

System Load and %iowait

System load is a measure of how much load is being placed on your CPU(s). As a general rule, you want the number to stay below 1.0 per CPU or core on your system. That means if you have a four-core system as in the machine I’m benchmarking here, you want your system’s load to stay below 4.0.

%iowait is a measure of the percentage of CPU time spent waiting on input/output. A high %iowait indicates your system is disk-bound, performing many disk i/o operations causing the system to slow down. For example, if your server had to retrieve 100 files or more for each request, it likely would cause the %iowait time to go up very high indicating that the disk is a bottleneck.

The goal is to not only improve response times, but also to do so with as little impact on system resources as possible. Let’s compare how a prolonged traffic surge affects system resources. Two good measures of system performance are the load average and the %iowait. The load average can be seen with the top utility, and the %iowait can be seen with the iostat command. You’re going to want to keep an eye on both top and iostat during the prolonged load test to see how the numbers change. Let’s fire up top and iostat, each on separate terminals.

Starting iostat with a two-second update interval:


iostat -c 2

 

Starting top:


/usr/bin/top

 

Now you’re ready to run the benchmark. You want ab to run long enough to see the impact on system performance. This typically means anywhere from one minute to ten minutes. Let’s re-run ab with a lot more total requests and a higher concurrency.

Load testing Apache with ab:


ab -c 50 -n 100000 http://localhost/cgi-bin/test

 

Figure 6. System Load Impact of Traffic Surge on Apache

Load testing Varnish with ab:


ab -c 50 -n 1000000 http://localhost:6081/cgi-bin/test

 

Figure 7. System Load Impact of Traffic Surge on Varnish

First let’s compare response times. Although you can’t see it in the screenshots, which were taken just before ab finished, Apache came in at 23ms per request (2097 rps), and Varnish clocked in at 4ms per request (12099 rps). The most drastic difference can be seen in the load averages in top. While Apache brought the system load all the way up to 12, Varnish kept the system load near 0 at 0.4. I did have to wait several minutes for the machine’s load averages to go back down after the Apache load test before load testing Varnish. It’s also best to run these tests on a non-production system that is mostly idle.

Although everyone’s servers and Web sites have different requirements and configurations, Varnish may be able to improve your site’s performance drastically while simultaneously reducing the load on the server.

(via Linuxjournal.com)

 

 

 

PHP Performance Crash Course, Part 1: The Basics

Why does performance matter?

The simple truth is that application performance has a direct impact on your bottom line:

Follow these simple best practices to start improving PHP performance:

Update PHP!

One of the easiest improvements you can make to improve performance and stability is to upgrade your version of PHP. PHP 5.3.x was released in 2009. If you haven’t migrated to PHP 5.4, now is the time! Not only do you benefit from bug fixes and new features, but you will also see faster response times immediately. See PHP.net to get started.

Once you’ve finished upgrading PHP, be sure to disable any unused extensions in production such as xdebug or xhprof.

Use an opcode cache

PHP is an interpreted language, which means that every time a PHP page is requested, the server will interpet the PHP file and compile it into something the machine can understand (opcode). Opcode caches preserve this generated code in a cache so that it will only need to be interpreted on the first request. If you aren’t using an opcode cache you’re missing out on a very easy performance gain. Pick your flavor: APCZend OptimizerXCache, or Eaccellerator. I highly recommend APC, written by the creator of PHP, Rasmus Lerdorf.

Use autoloading

Many developers writing object-oriented applications create one PHP source file per class definition. One of the biggest annoyances in writing PHP is having to write a long list of needed includes at the beginning of each script (one for each class). PHP re-evaluates these require/include expressions over and over during the evaluation period each time a file containing one or more of these expressions is loaded into the runtime. Using an autoloader will enable you to remove all of your require/include statements and benefit from a performance improvement. You can even cache the class map of your autoloader in APC for a small performance improvement.

Optimize your sessions

While HTTP is stateless, most real life web applications require a way to manage user data. In PHP, application state is managed via sessions. The default configuration for PHP is to persist session data to disk. This is extremely slow and not scalable beyond a single server. A better solution is to store your session data in a database and front with an LRU (Least Recently Used) cache with Memcachedor Redis. If you are super smart you will realize you should limit your session data size (4096 bytes) and store all session data in a signed or encrypted cookie.

Use a distributed data cache

Applications usually require data. Data is usually structured and organized in a database. Depending on the data set and how it is accessed it can be expensive to query. An easy solution is to cache the result of the first query in a data cache like Memcached or Redis. If the data changes, you invalidate the cache and make another SQL query to get the updated result set from the database.

I highly recommend the Doctrine ORM for PHP which has built-in caching support for Memcached orRedis.

There are many use cases for a distributed data cache from caching web service responses and app configurations to entire rendered pages.

Do blocking work in the background

Often times web applications have to run tasks that can take a while to complete. In most cases there is no good reason to force the end-user to have to wait for the job to finish. The solution is to queue blocking work to run in background jobs. Background jobs are jobs that are executed outside the main flow of your program, and usually handled by a queue or message system. There are a lot of great solutions that can help solve running backgrounds jobs. The benefits come in terms of both end-user experience and scaling by writing and processing long running jobs from a queue. I am a big fan ofResque for PHP that is a simple toolkit for running tasks from queues. There are a variety of tools that provide queuing or messaging systems that work well with PHP:

I highly recommend Wan Qi Chen’s excellent blog post series about getting started with background jobs and Resque for PHP.

User location update workflow with background jobs

Leverage HTTP caching

HTTP caching is one of the most misunderstood technologies on the Internet. Go read the HTTP caching specification. Don’t worry, I’ll wait. Seriously, go do it! They solved all of these caching design problems a few decades ago. It boils down to expiration or invalidation and when used properly can save your app servers a lot of load. Please read the excellent HTTP caching guide from Mark Nottingam. I highly recommend using Varnish as a reverse proxy cache to alleviate load on your app servers.

Optimize your favorite framework

php_framework_overlay

Deep diving into the specifics of optimizing each framework is outside of the scope of this post, but these principles apply to every framework:

  • Stay up-to-date with the latest stable version of your favorite framework
  • Disable features you are not using (I18N, Security, etc)
  • Enable caching features for view and result set caching

Learn to how to profile code for PHP performance

Xdebug is a PHP extension for powerful debugging. It supports stack and function traces, profiling information and memory allocation and script execution analysis. It allows developers to easily profile PHP code.

WebGrind is an Xdebug profiling web frontend in PHP5. It implements a subset of the features of kcachegrind and installs in seconds and works on all platforms. For quick-and-dirty optimizations it does the job. Here’s a screenshot showing the output from profiling:

Check out Chris Abernethy’s guide to profiling PHP with XDebug and Webgrind.

XHprof is a function-level hierarchical profiler for PHP with a reporting and UI layer. XHProf is capable of reporting function-level inclusive and exclusive wall times, memory usage, CPU times and number of calls for each function. Additionally, it supports the ability to compare two runs (hierarchical DIFF reports) or aggregate results from multiple runs.

AppDynamics is application performance management software designed to help dev and ops troubleshoot problems in complex production apps.

Complete Visibility

Get started with AppDynamics today and get in-depth analysis of your applications performance.

PHP application performance is only part of the battle

Now that you have optimized the server-side, you can spend time improving the client side! In modern web applications most of the end-user experience time is spent waiting on the client side to render. Google has dedicated many resources to helping developers improve client side performance.

(via DZone.com)

 

NGINX 1.4 News: SPDY and PageSpeed

NGINX 1.4 comes with experimental support for SPDY, WebSocket proxying, and gunzip support, while Google enters beta with PageSpeed for NGINX.

Since the release of NGINX 1.3 in May of 2012, the community behind this growing web serverhas worked on a number of new features, the most notable including:

  • SPDY experimental support. The module needs to be enabled with the --with-http_spdy_module configuration parameter. Server push is still not supported.
  • HTTP/1.1 connections can be turned into WebSocket ones using HTTP/1.1’s protocol switch feature.
  • gunzip support for decompressing gzipped data useful when the client cannot do it.

The complete list of enhancements, changes and bug fixes can be found in the Changes 1.4 log.

On the heels of NGINX 1.4 release, Google has announced PageSpeed Beta for NGINX, making available over 40 optimization filters for its users, such as image compression, JavaScript and CSS minifying, HTML rewriting, etc.

According to Google, the ngx_pagespeed module has been used in production by some customers including MaxCDN, a CDN provider, which reported a reduction of “1.57 seconds from our average page load, dropped our bounce rate by 1%, and our exit percentage by 2.5%”.ZippyKid, a WordPress hosting provider, saw “75% reduction in page sizes and a 50% improvement in page rendering speeds” after using PageSpeed for NGINX.

NGINX is currently #3 web server overall with 16.2% market share, and #2 with 31.9% of top 10,000 websites, according to W3Techs.

(via infoq.com)

Using NGINX, PHP-FPM+APC and Varnish to make WordPress Websites fly

WordPress is one of the most popular content platforms on the Internet. It powers the majority of all freshly released websites and has a huge user-community. Running a WordPress site enables editors to easily and regularly publish content which might very well end up on Hacker News – so let us make sure the web server is up to its job!

This setup will easily let you serve hundreds of thousands of users per day, including brief peaks of up to 100 concurrent users on a ~15$/month machine with only 2 cores and 2GB of RAM.

Outrageous Assumptions

This guide assumes that you have a working knowledge of content management systems, web servers and their configurations. Additionally, you should be familiar with installing, starting and stopping services on a remote server via ssh. In short: if you know how to work a CLI, you’ll be fine.

Step 0/9: Use the Source, or: Enjoy the Repository

If you don’t need to adhere to a strict company security policy, use the Dotdeb repository for Debianto gain access to newer versions of NGINX and PHP. For Varnish, get the Varnish Debian repository. They are the easiest to use.

Step 1/9: The Little Engine that Could

Let’s start by firing up your favorite package management tool, potentially with Super Cow Powers, and install NGINX. The following configuration optimizes NGINX for WordPress. Put it into/etc/nginx/nginx.conf to make NGINX use both CPU cores, do proper Gzipping and more. Note: single-line configurations shown here may extend over multiple lines for readability – take care when copying.

user www-data;
worker_processes 2;
pid /var/run/nginx.pid;

events {
    worker_connections 768;
    multi_accept on;
    use epoll;
}

http {

    # Let NGINX get the real client IP for its access logs
    set_real_ip_from 127.0.0.1;
    real_ip_header X-Forwarded-For;

    # Basic Settings
    sendfile on;
    tcp_nopush on;
    tcp_nodelay on;
    keepalive_timeout 20;
    client_max_body_size 15m;
    client_body_timeout 60;
    client_header_timeout 60;
    client_body_buffer_size  1K;
    client_header_buffer_size 1k;
    large_client_header_buffers 4 8k;
    send_timeout 60;
    reset_timedout_connection on;
    types_hash_max_size 2048;
    server_tokens off;

    # server_names_hash_bucket_size 64;
    # server_name_in_redirect off;

    include /etc/nginx/mime.types;
    default_type application/octet-stream;

    # Logging Settings
    # access_log /var/log/nginx/access.log;
    error_log /var/log/nginx/error.log;

    # Log Format
    log_format main '$remote_addr - $remote_user [$time_local] '
    '"$request" $status $body_bytes_sent "$http_referer" '
    '"$http_user_agent" "$http_x_forwarded_for"';

    # Gzip Settings
    gzip on;
    gzip_static on;
    gzip_disable "msie6";
    gzip_vary on;
    gzip_proxied any;
    gzip_comp_level 6;
    gzip_min_length 512;
    gzip_buffers 16 8k;
    gzip_http_version 1.1;
    gzip_types text/css text/javascript text/xml text/plain text/x-component 
    application/javascript application/x-javascript application/json 
    application/xml  application/rss+xml font/truetype application/x-font-ttf 
    font/opentype application/vnd.ms-fontobject image/svg+xml;

    # Virtual Host Configs
    include /etc/nginx/conf.d/*.conf;
    include /etc/nginx/sites-enabled/*;
}

In case your site will make use of custom webfonts, NGINX may need some help to deliver them with the correct MIME type – otherwise it will default to application/octet-stream. Edit/etc/nginx/mime.types and check that the webfont types are set to the following:

    application/vnd.ms-fontobject           eot;
    application/x-font-ttf                  ttf;
    font/opentype                           ott;
    application/font-woff                   woff;

Step 2/9: Who Enters my Domain?

Now that you have done the basic setup for NGNIX, let’s make it play nicely with your domain: remove the symlink to default from /etc/nginx/sites-enabled/, create a new configuration file at /etc/nginx/sites-available/ named yourdomain.tld and symlink to it at/etc/nginx/sites-enabled/yourdomain.tld. NGINX will use the new configuration file which we will fill now. Put the following into /etc/nginx/sites-available/yourdomain.tld:

server {
    # Default server block blacklisting all unconfigured access
    listen [::]:8080 default_server;
    server_name _;
    return 444;
}

server {
    # Configure the domain that will run WordPress
    server_name yourdomain.tld;
    listen [::]:8080 deferred;
    port_in_redirect off;
    server_tokens off;
    autoindex off;

    client_max_body_size 15m;
    client_body_buffer_size 128k;

    # WordPress needs to be in the webroot of /var/www/ in this case
    root /var/www;
    index index.html index.htm index.php;
    try_files $uri $uri/ /index.php?q=$uri&$args;

    # Define default caching of 24h
    expires 86400s;
    add_header Pragma public;
    add_header Cache-Control "max-age=86400, public, must-revalidate, proxy-revalidate";

    # deliver a static 404
    error_page 404 /404.html;
    location  /404.html {
        internal;
    }

    # Deliver 404 instead of 403 "Forbidden"
    error_page 403 = 404;

    # Do not allow access to files giving away your WordPress version
    location ~ /(\.|wp-config.php|readme.html|licence.txt) {
        return 404;
    }

    # Add trailing slash to */wp-admin requests.
    rewrite /wp-admin$ $scheme://$host$uri/ permanent;

    # Don't log robots.txt requests
    location = /robots.txt {
        allow all;
        log_not_found off;
        access_log off;
    }

    # Rewrite for versioned CSS+JS via filemtime
    location ~* ^.+\.(css|js)$ {
        rewrite ^(.+)\.(\d+)\.(css|js)$ $1.$3 last;
        expires 31536000s;
        access_log off;
        log_not_found off;
        add_header Pragma public;
        add_header Cache-Control "max-age=31536000, public";
    }

    # Aggressive caching for static files
    # If you alter static files often, please use 
    # add_header Cache-Control "max-age=31536000, public, must-revalidate, proxy-revalidate";
    location ~* \.(asf|asx|wax|wmv|wmx|avi|bmp|class|divx|doc|docx|eot|exe|
    gif|gz|gzip|ico|jpg|jpeg|jpe|mdb|mid|midi|mov|qt|mp3|m4a|mp4|m4v|mpeg|
    mpg|mpe|mpp|odb|odc|odf|odg|odp|ods|odt|ogg|ogv|otf|pdf|png|pot|pps|
    ppt|pptx|ra|ram|svg|svgz|swf|tar|t?gz|tif|tiff|ttf|wav|webm|wma|woff|
    wri|xla|xls|xlsx|xlt|xlw|zip)$ {
        expires 31536000s;
        access_log off;
        log_not_found off;
        add_header Pragma public;
        add_header Cache-Control "max-age=31536000, public";
    }

    # pass PHP scripts to Fastcgi listening on Unix socket
    # Do not process them if inside WP uploads directory
    # If using Multisite or a custom uploads directory,
    # please set the */uploads/* directory in the regex below
    location ~* (^(?!(?:(?!(php|inc)).)*/uploads/).*?(php)) {
        try_files $uri = 404;
        fastcgi_split_path_info ^(.+.php)(.*)$;
        fastcgi_pass unix:/var/run/php-fpm.socket;
        fastcgi_index index.php;
        fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
        include fastcgi_params;
        fastcgi_intercept_errors on;
        fastcgi_ignore_client_abort off;
        fastcgi_connect_timeout 60;
        fastcgi_send_timeout 180;
        fastcgi_read_timeout 180;
        fastcgi_buffer_size 128k;
        fastcgi_buffers 4 256k;
        fastcgi_busy_buffers_size 256k;
        fastcgi_temp_file_write_size 256k;
    }

    # Deny access to hidden files
    location ~ /\. {
        deny all;
        access_log off;
        log_not_found off;
    }

}

# Redirect all www. queries to non-www
# Change in case your site is to be available at "www.yourdomain.tld"
server {
    listen [::]:8080;
    server_name www.yourdomain.tld;
    rewrite ^ $scheme://yourdomain.tld$request_uri? permanent;
}

This configuration is for a single-site WordPress install at /var/www/ webroot with a media upload limit of 15MB. Read the comments within the configuration if you want to change things: nginx -tis your friend when altering configuration files.

Step 3/9: Let’s get the Elephant into the Room

Next, we’ll get PHP-FPM onboard. Install it any way you like, preferably PHP version 5.4. Just make sure you get APC as well. After successful installation, we’ll need to edit several configuration files. Let’s start by editing the well-known /etc/php5/fpm/php.ini. It’s a huge file, so instead of showing the entire file here, please go through the following configuration line by line and set them to the respective values in your php.ini:

short_open_tag = Off
ignore_user_abort = Off
post_max_size = 15M
upload_max_filesize = 15M
default_charset = "UTF-8"
allow_url_fopen = Off
default_socket_timeout = 30
mysql.allow_persistent = Off

At the very end of your php.ini, add the following block which will configure APC, the opcode cache:

[apc]
apc.stat = "0"
apc.max_file_size = "1M"
apc.localcache = "1"
apc.localcache.size = "256"
apc.shm_segments = "1"
apc.ttl = "3600"
apc.user_ttl = "7200"
apc.gc_ttl = "3600"
apc.cache_by_default = "1"
apc.filters = ""
apc.write_lock = "1"
apc.num_files_hint= "512"
apc.user_entries_hint="4096"
apc.shm_size = "256M"
apc.mmap_file_mask=/tmp/apc.XXXXXX
apc.include_once_override = "0"
apc.file_update_protection="2"
apc.canonicalize = "1"
apc.report_autofilter="0"
apc.stat_ctime="0"
;This should be used when you are finished with PHP file changes.
;As you must clear the APC cache to recompile already cached files.
;If you are still developing, set this to 1.
apc.stat="0"

In your /etc/php5/fpm/php-fpm.conf, please set the following lines to their respective values:

pid = /var/run/php5-fpm.pid
error_log = /var/log/php5-fpm.log
emergency_restart_threshold = 5
emergency_restart_interval = 2
events.mechanism = epoll

Finally, in /etc/php5/fpm/pool.d/www.conf, do the same procedure again:

user = www-data
group = www-data
listen = /var/run/php-fpm.socket
listen.owner = www-data
listen.group = www-data
listen.mode = 0666
listen.allowed_clients = 127.0.0.1
pm = dynamic
pm.max_children = 50
pm.start_servers = 15
pm.min_spare_servers = 5
pm.max_spare_servers = 25
pm.process_idle_timeout = 60s
request_terminate_timeout = 30
security.limit_extensions = .php

and add the following to the end of it:

php_flag[display_errors] = off
php_admin_value[error_reporting] = 0
php_admin_value[error_log] = /var/log/php5-fpm.log
php_admin_flag[log_errors] = on
php_admin_value[memory_limit] = 128M

Kudos! You have just configured PHP to run on a high performance UNIX socket, spawn child processes to deal with requests and cache everything so that there’s less load on the system.

Step 4/9: Give Varnish a Polish

At this point, you already have a lean, mean webserving machine – albeit on port 8080. Now imagine ~95% of all your incoming traffic hitting an additional layer above which keeps static content in RAM. PHP won’t even have to process anything most of the time and your database will receive more load from editors adding new content than from queries caused by the frontend. That’s worth the extra mile, right? So let’s go install Varnish!

After installing, edit /etc/default/varnish to make it use the two cores, static content in RAM and proper timeouts:

DAEMON_OPTS="-a :80 \
    -T localhost:6082 \
    -f /etc/varnish/default.vcl \
    -u www-data -g www-data \
    -S /etc/varnish/secret \
    -p thread_pools=2 \
    -p thread_pool_min=25 \
    -p thread_pool_max=250 \
    -p thread_pool_add_delay=2 \
    -p session_linger=50 \
    -p sess_workspace=262144 \
    -p cli_timeout=40 \
    -s malloc,768m"

And for the very last step in this web server stack setup, put the following configuration into/etc/varnish/default – it’s well commented so you can see what each part does:

# We only have one backend to define: NGINX
backend default {
    .host = "127.0.0.1";
    .port = "8080";
}

# Only allow purging from specific IPs      
acl purge {
    "localhost";
    "127.0.0.1";
}

sub vcl_recv {
    # Handle compression correctly. Different browsers send different
    # "Accept-Encoding" headers, even though they mostly support the same
    # compression mechanisms. By consolidating compression headers into
    # a consistent format, we reduce the cache size and get more hits.
    # @see: http:// varnish.projects.linpro.no/wiki/FAQ/Compression
    if (req.http.Accept-Encoding) {
        if (req.http.Accept-Encoding ~ "gzip") {
            # If the browser supports it, we'll use gzip.
            set req.http.Accept-Encoding = "gzip";
        }
        else if (req.http.Accept-Encoding ~ "deflate") {
            # Next, try deflate if it is supported.
            set req.http.Accept-Encoding = "deflate";
        }
        else {
            # Unknown algorithm. Remove it and send unencoded.
            unset req.http.Accept-Encoding;
        }
    }

    # Set client IP
    if (req.http.x-forwarded-for) {
        set req.http.X-Forwarded-For =
        req.http.X-Forwarded-For + ", " + client.ip;
    } else {
        set req.http.X-Forwarded-For = client.ip;
    }

    # Check if we may purge (only localhost)
    if (req.request == "PURGE") {
        if (!client.ip ~ purge) {
            error 405 "Not allowed.";
        }
        return(lookup);
    }

    if (req.request != "GET" &&
        req.request != "HEAD" &&
        req.request != "PUT" &&
        req.request != "POST" &&
        req.request != "TRACE" &&
        req.request != "OPTIONS" &&
        req.request != "DELETE") {
            # /* Non-RFC2616 or CONNECT which is weird. */
            return (pipe);
    }

    if (req.request != "GET" && req.request != "HEAD") {
        # /* We only deal with GET and HEAD by default */
        return (pass);
    }

    # admin users always miss the cache
    if( req.url ~ "^/wp-(login|admin)" || 
        req.http.Cookie ~ "wordpress_logged_in_" ){
            return (pass);
    }

    # Remove cookies set by Google Analytics (pattern: '__utmABC')
    if (req.http.Cookie) {
        set req.http.Cookie = regsuball(req.http.Cookie,
            "(^|; ) *__utm.=[^;]+;? *", "\1");
        if (req.http.Cookie == "") {
            remove req.http.Cookie;
        }
    }

    # always pass through POST requests and those with basic auth
    if (req.http.Authorization || req.request == "POST") {
        return (pass);
    }

    # Do not cache these paths
    if (req.url ~ "^/wp-cron\.php$" ||
        req.url ~ "^/xmlrpc\.php$" ||
        req.url ~ "^/wp-admin/.*$" ||
        req.url ~ "^/wp-includes/.*$" ||
        req.url ~ "\?s=") {
            return (pass);
    }

    # Define the default grace period to serve cached content
    set req.grace = 30s;

    # By ignoring any other cookies, it is now ok to get a page
    unset req.http.Cookie;
    return (lookup);
}

sub vcl_fetch {
    # remove some headers we never want to see
    unset beresp.http.Server;
    unset beresp.http.X-Powered-By;

    # only allow cookies to be set if we're in admin area
    if( beresp.http.Set-Cookie && req.url !~ "^/wp-(login|admin)" ){
        unset beresp.http.Set-Cookie;
    }

    # don't cache response to posted requests or those with basic auth
    if ( req.request == "POST" || req.http.Authorization ) {
        return (hit_for_pass);
    }

    # don't cache search results
    if( req.url ~ "\?s=" ){
        return (hit_for_pass);
    }

    # only cache status ok
    if ( beresp.status != 200 ) {
        return (hit_for_pass);
    }

    # If our backend returns 5xx status this will reset the grace time
    # set in vcl_recv so that cached content will be served and 
    # the unhealthy backend will not be hammered by requests
    if (beresp.status == 500) {
        set beresp.grace = 60s;
        return (restart);
    }

    # GZip the cached content if possible
    if (beresp.http.content-type ~ "text") {
        set beresp.do_gzip = true;
    }

    # if nothing abovce matched it is now ok to cache the response
    set beresp.ttl = 24h;
    return (deliver);
}

sub vcl_deliver {
    # remove some headers added by varnish
    unset resp.http.Via;
    unset resp.http.X-Varnish;
}

sub vcl_hit {
    # Set up invalidation of the cache so purging gets done properly
    if (req.request == "PURGE") {
        purge;
        error 200 "Purged.";
    }
    return (deliver);
}

sub vcl_miss {
    # Set up invalidation of the cache so purging gets done properly
    if (req.request == "PURGE") {
        purge;
        error 200 "Purged.";
    }
    return (fetch);
}

sub vcl_error {
    if (obj.status == 503) {
                # set obj.http.location = req.http.Location;
                set obj.status = 404;
        set obj.response = "Not Found";
                return (deliver);
    }
}

This configuration will deliver high speed, gzipped content, caching aggressively while letting authenticated backend users view the live site uncached. It also protects the stack from unwelcome traffic while allowing cache purges from certain sources.

Step 5/9: Code Is Poetry

If you didn’t already have WordPress installed before, do it now. There’s plenty of good guidance on this already. Should you come across odd reloads of the same step during install, restart the three services NGINX, PHP-FPM and Varnish on your machine.

After successful installation of WordPress, we’ll need to make it talk to the high performance webserver stack we’ve just built for it. Let’s start by reading, understanding and installing the NGINX compatibility plugin for WordPress. Don’t forget to follow the instructions on the plugin site. Now, WordPress can deal with NGINX.

Next, let’s make WordPress work with PHP’s opcode cache APC: Read, understand and then installthe Object-Cache plugin for WordPress.

The final step is to empower WordPress so it can purge the Varnish cache. Again, read, understand and only then install Pål-Kristian Hamre’s Varnish plugin for WordPress. Afterwards, go to its configuration page within the WordPress admin-interface and set the following parameters:

Varnish Administration IP Address: 127.0.0.1
Varnish Administration Port: 80
Varnish Secret: (get it at /etc/varnish/secret)
Check: "Also purge all page navigation" and "Also purge all comment navigation"
Varnish Version: 3

And that’s all concerning the high performance webserver stack, folks! Your WordPress will now handle a great deal of traffic while keeping TTFB to a minimum. The times of high CPU load and disk I/O troubles are now past.

The Performance Golden Rule

Steve Souders’ Performance Golden Rule clearly states that most performance gains can be achieved by optimizing the frontend side of things instead of scaling the backend to be able to deal with a huge amount of HTTP request. Or, as Patrick Meenan puts it, one should always question “the need to make the requests in the first place”. So let’s not stop at our high performance webserver, but continue with a high performance frontend!

Step 6/9: Bust that Cache Wide Open!

If you have read the configuration files for NGINX, you will have noticed some lines about a cache-buster for CSS and JS files. What it does is altering the link to your theme’s unified stylesheet and JS to something like style.1352707822.css, in which the number is the filemtime. So whenever you alter your files, their apparent filename within the site will change and clients will download the newest version of those files while you don’t need to alter any paths manually. Simply place the following into your theme’s functions.php:

/**
 * Automated cache-buster function via filemtime
 **/
function autoVer($url){
  $name = explode('.', $url);
  $lastext = array_pop($name);
  array_push(
    $name, 
    filemtime($_SERVER['DOCUMENT_ROOT'] . parse_url($url, PHP_URL_PATH)), 
    $lastext);
  echo implode('.', $name) ;
}

Now use the autoVer function when including CSS and JS, e.g. in your functions.php (see wp_register_style) or header.php. Examples:

# Add in your theme's functions.php:
wp_register_style(
  'style', 
  get_bloginfo('stylesheet_directory') . autoVer('style.css'), 
  false, 
  NULL, 
  'all');

# Add in your theme's header.php:
autoVer(get_bloginfo('stylesheet_url'))

Step 7/9: Avoid the wicked @import for WordPress child themes

WordPress child themes rely on @import in their CSS file to get the styles from their parent theme. Sadly, @import is a wicked load-blocker. Here’s a simple trick to have two stylesheets included via link in the website’s header if it’s a child theme because stylesheets called via link can download in parallel. Remove the @import from your child-theme’s CSS file and place the following PHP snippet into your header.php when linking your stylesheets:

<?php
    /*
     * Circumvent @import CSS for WordPress child themes
     * If we're in a child theme, build links for both parent and child CSS
     * This way, we can remove the @import from the child theme's style.css
     * CSS loaded via link can load simultaneously, while @import blocks loading
     * See: http://www.stevesouders.com/blog/2009/04/09/dont-use-import/
     */
    if(is_child_theme()) {
        echo '<link rel="stylesheet" href="';
        autoVer(get_bloginfo('template_url').'/style.css');
        echo '" />'."\n\t\t";
        echo '<link rel="stylesheet" href="';
        autoVer(get_bloginfo('stylesheet_url'));
        echo '" />'."\n";
    } else {
        echo '<link rel="stylesheet" href="';
        autoVer(get_bloginfo('stylesheet_url'));
        echo '" />'."\n";
    } 
?>

Additionally, you should use @media print within your unified stylesheet because, as Stoyan Stefanov has pointed out, browsers will always download all stylesheets no matter the medium.

Step 8/9: Unification Day

After combining your CSS into a unified stylesheet and making it load in a non-blocking manner, you should think about minifying it. There are plenty of decent minification tools around.

Also, you can enable loading JS libraries from Google and other CDNs there. No matter which libraries you use, unify and compress you local JS with e.g. Google Closure Compiler. If you’re using Google Analytics, check out this excellent GA snippet by Mathias Bynens which you can easily integrate into your unified JS. If you’re not using jQuery on the frontend, deregister its default inclusion via your theme’s “functions.php”.

The last thing to be minified is the website source code outputted by WordPress. There are several minification plugins either as standalone or integrated into WP Super Cache or WP Total Cache plugins. I recommend “WP HTML Compression”, especially when used together with the Roots Theme

Step 9/9: Famous Last Words

In case the imperfections of the code outputted by WordPress keep you up at night, your performance optimizations don’t have to end here. Check out the Roots Theme, which finally brings decent relative URL structures, clean navigation menus and much more to WordPress. It’s not as easy as installing a common theme or plugin, but worth it.

If you ever dreamed of using Nicole Sullivan’s excellent OOCSS high performance CSS selectors with WordPress, check out the PHP DOMDocument parser to filter the output of WordPress and set classes for all relevant DOM objects in your source so you can adhere to the OOCSS code standards. While parsing the entire website output during creation by WordPress may seem excessive, consider that it finally gives you total control over syntax and defined classes while editors can use the familiar WordPress interface to manage content. And with such a mighty & multi-layered caching system in place, the frontend won’t slow down at all.

So Long, and Thanks for All the Time

If you’ve reached the point that you’re worrying about the performance implications of CSS selectors and the execution time of PHP’s DOMDocument, it is time to thank you. You have very likely managed to reduce your loading time of your WordPress site by more than 3 seconds. This means that with ~10.000 visitors per day, you’ve saved them about 8 hours of time collectively. That’s amazing and more than enough time to sit back and enjoy some of the season’s spirit – you’ve earned it!

(via calendar.perfplanet.com)