Hard disk reliability examined once more: HGST rules, Seagate is alarming


A year ago we got some insight into hard disk reliability when cloud backup provider Backblaze published its findings for the tens of thousands of disks that it operated. Backblaze uses regular consumer-grade disks in its storage because of the cheaper cost and good-enough reliability, but it also discovered that some kinds of disks fared extremely poorly when used 24/7.

A year later the company has collected even more data and drawn out even more differences between the different disks it uses.

For a second year, the standout reliability leader was HGST. Now a wholly owned subsidiary of Western Digital, HGST inherited the technology and designs from Hitachi (which itself bought IBM’s hard disk division). Across a range of models from 2 to 4 terabytes, the HGST models showed low failure rates; at worse, 2.3 percent failing a year. This includes some of the oldest disks among Backblaze’s collection; 2TB Desktop 7K2000 models are on average 3.9 years old, but still have a failure rate of just 1.1 percent.

At the opposite end of the spectrum are Seagate disks. Last year, the two 1.5TB Seagate models used by Backblaze had failure rates of 25.4 percent (for the Barracuda 7200.11) and 9.9 percent (for the Barracuda LP). Those units fared a little better this time around, with failure rates of 23.8 and 9.6 percent, even though they were the oldest disks in the test (average ages of 4.7 and 4.9 years, respectively). However, their poor performance was eclipsed by the 3TB Barracuda 7200.14 units, which had a whopping 43.1 percent failure rate, in spite of an average age of just 2.2 years.

Backblaze’s storage is largely split between Seagate and HGST disks. HGST’s parent company, Western Digital, is almost absent, not because its disks are bad, but because they came out as consistently more expensive than those from Seagate and HGST.


Newer Seagate disks also show more encouraging results. Although still young, at an average age of just 0.9 years, the 4TB HDD.15 models show a reasonably low 2.6 percent failure rate. Coupled with their low price—Backblaze says that they tend to undercut HGST’s disks—they’ve become the company’s preferred hard drive model.

As before, this doesn’t mean that anyone with a Seagate disk is at risk of an imminent hard disk failure (though you should always have backups!). Backblaze operates disks outside of the manufacturer’s specified parameters. Significantly, most consumer-grade disks aren’t intended to be heavily used 24/7; they’re meant to be operational for about 8 hours a day and replaced every 3 to 5 years. Most home usage environments are likely to be lower in vibration than Backblaze’s 45-disk storage pods, too. In more normal conditions, the Seagates are likely to fare much better.

(Via Arstechnica.com)

StackExchange’s Performance Dashboard

6148448748_ee8eedd346_mStackExchange created a very cool performance dashboard that looks to be updated from real system metrics. Wouldn’t it be fascinating if every site had a similar dashboard?

The dashboard contains information like there are 560 million page views per month, 260,000 sustained connections,  34 TB data transferred per month, 9 web servers with 48GB of RAM handling 185 req/s at 15% CPU usage. There are 4 SQL servers, 2 redis servers, 3 tag engine servers, 3 elasticsearch servers, and 2 HAProxy servers, along with stats on each.

There’s also an excellent discussion thread on reddit that goes into more interesting details, with questions being answered by folks from StackExchange.

StackExchange is still doing innovative work and is very much an example worth learning from. They’ve always danced to their own tune and it’s a catchy tune at that. More at StackOverflow Update: 560M Pageviews A Month, 25 Servers, And It’s All About Performance.

( via HighScalability.com )

StackOverflow Update: 560M Pageviews A Month, 25 Servers, And It’s All About Performance

16238755496_4a3014ebbb_mThe folks at Stack Overflow remain incredibly open about what they are doing and why. So it’s time for another update. What has Stack Overflow been up to?

The network of sites that make up StackExchange, which includes StackOverflow, is now ranked 54th for traffic in the world; they have 110 sites and are growing at a rate of 3 or 4 a month; 4 million users; 40 million answers; and 560 million pageviews a month.

This is with just 25 servers. For everything. That’s high availability, load balancing, caching, databases, searching, and utility functions. All with a relative handful of employees. Now that’s quality engineering.

This update is based on The architecture of StackOverflow (video) by Marco Cecconi and What it takes to run Stack Overflow (post) by Nick Craver. In addition, I’ve merged in comments from various sources. No doubt some of the details are out of date as I meant to write this article long ago, but it should still be representative.

Stack Overflow still uses Microsoft products. Microsoft infrastructure works and is cheap enough, so there’s no compelling reason to change. Yet SO is pragmatic. They use Linux where it makes sense. There’s no purity push to make everything Linux or keep everything Microsoft. That wouldn’t be efficient.

Stack Overflow still uses a scale-up strategy. No clouds in site. With their SQL Servers loaded with 384 GB of RAM and 2TB of SSD, AWS would cost a fortune. The cloud would also slow them down, making it harder to optimize and troubleshoot system issues. Plus, SO doesn’t need a horizontal scaling strategy. Large peak loads, where scaling out makes sense, hasn’t  been a problem because they’ve been quite successful at sizing their system correctly.

So it appears Jeff Atwood’s quote: “Hardware is Cheap, Programmers are Expensive”, still seems to be living lore at the company.

Marco Ceccon in his talk says when talking about architecture you need to answer this question first: what kind of problem is being solved?

First the easy part. What does StackExchange do? It takes topics, creates communities around them, and creates awesome question and answer sites.

The second part relates to scale. As we’ll see next StackExchange is growing quite fast and handles a lot of traffic. How does it do that? Let’s take a look and see….


  • StackExchange network has 110 sites growing at a rate of 3 or 4 a month.

  • 4 million users

  • 8 million questions

  • 40 million answers

  • As a network #54 site for traffic in the world

  • 100% year over year growth

  • 560 million pageviews a month

  • Peak is more like 2600-3000 requests/sec on most weekdays. Programming, being a profession, means weekdays are significantly busier than weekends.

  • 25 servers

  • 2 TB of SQL data all stored on SSDs

  • Each web server has 2x 320GB SSDs in a RAID 1.

  • Each ElasticSearch box has 300 GB also using SSDs.

  • Stack Overflow has a 40:60 read-write ratio.

  • DB servers average 10% CPU utilization

  • 11 web servers, using IIS

  • 2 load balancers, 1 active, using HAProxy

  • 4 active database nodes, using MS SQL

  • 3 application servers implementing the tag engine, anything searching by tag hits

  • 3 machines doing search with ElasticSearch

  • 2 machines for distributed cache and messaging using Redis

  • 2 Networks (each a Nexus 5596 + Fabric Extenders)

  • 2 Cisco 5525-X ASAs (think Firewall)

  • 2 Cisco 3945 Routers

  • 2 read-only SQL Servers for used mainly for the Stack Exchange API

  • VMs also perform functions like deployments, domain controllers, monitoring, ops database for sysadmin goodies, etc.


  • ElasticSearch

  • Redis

  • HAProxy

  • MS SQL

  • Opserver

  • TeamCity

  • Jil – Fast .NET JSON Serializer, built on Sigil

  • Dapper – a micro ORM.


  • The UI has message inbox that is sent a message when you get a new badge, receive a message, significant event, etc. Done using WebSockets and is powered by redis.

  • Search box is powered by ElasticSearch using a REST interface.

  • With so many questions on SO it was impossible to just show the newest questions, they would change too fast, a question every second. Developed an algorithm to look at your pattern of behaviour and show you which questions you would have the most interest in. It’s uses complicated queries based on tags, which is why a specialized Tag Engine was developed.

  • Server side templating is used to generate pages.


  • The 25 servers are not doing much, that is the CPU load is low. It’s calculated SO could run on only 5 servers.

  • The database server is at 10%, except when it bursts while performing a backups.

  • How so low? The databases servers have 384GB of RAM and the web servers are at 10%-15% CPU usage.

  • Scale-up is still working. Other scale-out sites with a similar number of pageviews tend to run on 100, 200, up to 300 servers.

  • Simple system. Built on .Net. Have only 9 projects, others systems have 100s. Reason to have so few projects is is so compilation is lightning fast, which requires planning at the beginning. Compilation takes 10 seconds on a single computer.

  • 110K lines of code. A small number given what it does.

  • This minimalist approach comes with some problems. One problem is not many tests. Tests aren’t needed because there’s a great community. Meta.stackoverflow is a discussion site for the community and where bugs are reported. Meta.stackoverflow is also a beta site for new software. If users find any problems with it they report the bugs that they’ve found, sometimes with solution/patches.

  • Windows 2012 is used in New York but are upgrading to 2012 R2  (Oregon is already on it). For Linux systems it’s Centos 6.4.

  • Load is really almost all over 9 servers, because 10 and 11 are only for meta.stackexchange.com, meta.stackoverflow.com, and the development tier. Those servers also run around 10-20% CPU which means we have quite a bit of headroom available.


  • Intel 330 as the default (web tier, etc.)

  • Intel 520 for mid tier writes like Elastic Search

  • Intel 710 & S3700 for the database tier. S3700 is simply the successor to the high endurance 710 series.

  • Exclusively RAID 1 or RAID 10 (10 being any arrays with 4+ drives). Failures have not been a problem, even with hundreds of intel 2.5″ SSDs in production, a single one hasn’t failed yet. One or more spare parts are kept for each model, but multiple drive failure hasn’t been a concern.

  • ElasticSearch performs much better on SSDs, given SO writes/re-indexes very frequently.

  • SSD changes the use of search. Lucene.net couldn’t handle SO’s concurrent workloads due to locking issues, so they moved to ElasticSearch. It turns out locks around the binary readers really aren’t necessary in an all SSD environment.

  • The only scale-up problems so far is SSD space on the SQL boxes due to the growth pattern of reliability vs. space in the non-consumer space, that isdrives that have capacitors for power loss and such.

High Availability

  • The main datacenter is in New York and the backup datacenter is in Oregon.

  • Redis has 2 slaves, SQL has 2 replicas, tag engine has 3 nodes, elastic has 3 nodes – any other service has high availability as well (and exists in both data centers).

  • Not everything is slaved between data centers (very temporary cache data that’s not needed to eat bandwidth by syncing, etc.) but the big items are, so there is still a shared cache in case of a hard down in the active data center. A start without a cache is possible, but it isn’t very graceful.

  • Nginx was used for SSL, but a transition has been made to using HAProxy to terminate SSL.

  • Total HTTP traffic sent is only about 77% of the total traffic sent. This is because replication is happening to the secondary data center in Oregon as well as other VPN traffic. The majority of this traffic is the data replication to SQL replicas and redis slaves in Oregon.


  • MS SQL Server.

  • Stack Exchange has one database per-site, so Stack Overflow gets one, Super User gets one, Server Fault gets one, and so on. The schema for these is the same. This approach of having different database is effectively a form of partitioning and horizontal scaling.

  • In the primary data center (New York) there is usually 1 master and 1 read-only replica in each cluster. There’s also 1 read-only replica (async) in the DR data center (Oregon). When running in Oregon then the primary is there and both of the New York replicas are read-only and async.

  • There are a few wrinkles. There is one “network wide” database which has things like login credentials, and aggregated data (mostly exposed through stackexchange.com user profiles, or APIs).

  • Careers Stack Overflow, stackexchange.com, and Area 51 all have their own unique database schema.

  • All the schema changes are applied to all site databases at the same time. They need to be backwards compatible so, for example, if you need to rename a column – a worst case scenario – it’s a multiple steps process: add a new column, add code which works with both columns, back fill the new column, change code so it works with the new column only, remove the old column.

  • Partitioning is not required. Indexing takes care of everything and the data just is not large enough. If something warrants a filtered indexes, why not make it way more efficient? Indexing only on DeletionDate = Null and such is a common pattern, others are specific FK types from enums.

  • Votes are in 1 table per item, for example 1 table for post votes, 1 table for comment votes. Most pages we render real-time, caching only for anonymous users. Given that, there’s no cache to update, it’s just a re-query.

  • Scores are denormalized, so querying is often needed. It’s all IDs and dates, the post votes table just has 56,454,478 rows currently. Most queries are just a few milliseconds due to indexing.

  • The Tag Engine is entirely self-contained, which means not having to depend on an external service for very, very core functionality. It’s a huge in-memory struct array structure that is optimized for SO use cases and precomputed results for heavily hit combinations. It’s a simple windows service running on a few boxes working in a redundant team. CPU is about 2-5% almost always. Three boxes are not needed for load, just redundancy. If all those do fail at once, the local web servers will load the tag engine in memory and keep on going.

  • On Dapper’s lack of a compiler checking queries compared to traditional ORM. The compiler is checking against what you told it the database looks like. This can help with lots of things, but still has the fundamental disconnect problem you’ll get at runtime. A huge problem with the tradeoff is the generated SQL is nasty, and finding the original code it came from is often non-trivial. Lack of ability to hint queries, control parameterization, etc. is also a big issue when trying to optimize queries. For example. literal replacement was added to Dapper to help with query parameterization which allows the use of things like filtered indexes. Dapper also intercepts the SQL calls to dapper and add add exactly where it came from. It saves so much time tracking things down.



  • The process:

    • Most programmers work remotely. Programmers code in their own batcave.

    • Compilation is very fast.

    • Then the few test that they have are run.

    • Once compiled, code is moved to a development staging server.

    • New features are hidden via feature switches.

    • Runs on same hardware as the rest of the sites.

    • It’s then moved to Meta.stackoverflow for testing. 1000 users per day use the site, so its a good test.

    • If it passes it goes live on the network and is tested by the larger community.

  • Heavy usage of static classes and methods, for simplicity and better performance.

  • Code is simple because the complicated bits are packaged in a library and open sourced and maintained. The number of .Net projects stays low because community shared parts of the code are used.

  • Developers get two or three monitors. Screens are important, they help you be productive.


  • Cache all the things.

  • 5 levels of caches.

  • 1st: is the network level cache: caching in the browser, CDN, and proxies.

  • 2nd: given for free by the .Net framework and is called the HttpRuntime.Cache. An in-memory, per server cache.

  • 3rd: Redis. Distributed in-memory key-value store. Share cache elements across different servers that serve the same site. If StackOverflow has 9 servers then all servers will be able to find the same cached items.

  • 4th: SQL Server Cache. The entire database is cached in-memory. The entire thing.

  • 5th: SSD. Usually only hit when the SQL server cache is warming up.

  • For example, every help page is cached. Code to access a page is very terse:

    • Static methods and static classes re used. Really bad from an OOP perspective, but really fast and really friendly towards terse code. All code is directly addressed.

    • Caching is handled by a library layer of Redis and Dapper, a micro ORM.

  • To get around garbage collection problems, only one copy of a class used in templates are created and kept in a cache. Everything is measured, including GC operation, from statistics it is known that layers of indirection increase GC pressure to the point of noticeable slowness.

  • CDN hits vary, since the  query string hash is based on file content, it’s only re-fetched on a build. It’s typically 30-50 million hits a day for 300 to 600 GB of bandwidth.

  • A CDN is not used for CPU or I/O load, but to help users find answers faster.


  • Want to deploy 5 times a day. Don’t build grand gigantic things and then put then live. Important because:

    • Can measure performance directly.

    • Forced to build the smallest thing that can possibly work.

  • TeamCity builds then copies to each web tier via a powershell script. The steps for each server are:

    • Tell HAProxy to take the server out of rotation via a POST

    • Delay to let IIS finish current requests (~5 sec)

    • Stop the website (via the same PSSession for all the following)

    • Robocopy files

    • Start the website

    • Re-enable in HAProxy via another POST

  • Almost everything is deployed via puppet or DSC, so upgrading usually consist of just nuking the RAID array and installing from a PXE boot. it’s very fast and you know it’s done right/repeatable.


  • Teams:

    • SRE (System Reliability Engineering): – 5 people

    • Core Dev (Q&A site) : ~6-7 people

    • Core Dev Mobile: 6 people

    • Careers team that does development solely for the SO Careers product: 7 people

  • Devops and developer teams are really close-knit.

  • There’s a lot of movement between teams.

  • Most employees work remotely.

  • Offices are mostly sales, Denver and London exclusively so.

  • All else equal, it is slightly prefered to have people in NYC, because the in-person time is a plus for the casual interaction that happens in between “getting things done”. But the set up makes it possible to do real work and official team collaboration works almost entirely online.

  • They’ve learned that the in-person benefit is more than outweighed by how much you get from being able to hire the best talent that loves the product anywhere, not just the ones willing to live in the city you happen to be in.

  • The most common reason for someone going remote is starting a family. New York’s great, but spacious it is not.

  • Offices are in Manhattan and a lot of talent is there. The data center needs to not be a crazy distance away since it is always being improved. There’s also a slightly faster connection to many backbones in the NYC location – though we’re talking only a few milliseconds (if that) of difference there.

  • Making an awesome team: Love geeks. Early Microsoft, for example, was full of geeks and they conquered the world.

  • Hire from Stack Overflow community. They looks for a passion for coding, a passion for helping others, and a passion for communicating.


  • Budgets are pretty much project based. Money is only spent as infrastructure is added for new projects. The web servers that have such low utilization are the same ones purchased 3 years ago when the data center was built.


  • Move fast and break things. Push it live.

  • Major changes are tested by pushing them. Development has an equally powerful SQL server and it runs on the same web tier, so performance testing isn’t so bad.

  • Very few tests. Stack Overflow doesn’t use many unit tests because of their active community and heavy usage of static code.

  • Infrastructure changes. There’s 2 of everything, so there’s a backup with the old configuration whenever possible, with a quick failback mechanism. For example, keepalived does failback quickly between load balancers.

  • Redundant systems fail over pretty often just to do regular maintenance. SQL backups are tested by having a dedicated server just for restoring them, constantly (that’s a free license – do it). Plan to start full data center failovers every 2 months or so – the secondary data center is read-only at all other times.

  • Unit tests, integration tests and UI tests run on every push. All the tests must succeed before a production build run is even possible. So there’s some mixed messages going on about testing.

  • The things that obviously should have tests have tests. That means most of the things that touch money on the Careers product, and easily unit-testable features on the Core end (things with known inputs, e.g. flagging, our new top bar, etc), for most other things we just do a functionality test by hand and push it to our incubating site (formerly meta.stackoverflow, now meta.stackexchange).

Monitoring / Logging

  • Now considering using http://logstash.net/ for log management. Currently a dedicated service inserts the syslog UDP traffic into a SQL database. Web pages add headers for the timings on the way out which are captured with HAProxy and are included in the syslog traffic.

  • Opserver and Realog. are how many metrics are surfaced. Realog is a logging display system built by Kyle Brandt and Matt Jibson in Go

  • Logging is from the HAProxy load balancer via syslog instead of via IIS. This is a lot more versatile than IIS logs.


  • Hardware is cheaper than developers and efficient code. You are only as fast as your slowest bottleneck and all the current cloud solutions have fundamental performance or capacity limits.
  • Could you build SO well if building for the cloud from day one? Mostl likely. Could you consistency render all your pages performing several up to date queries and cache fetches across that cloud network you don’t control and getting sub 50ms render times? That’s another matter. Unless you’re talking about substantially higher cost (at least 3-4x), the answer is no – it’s still more economical for SO to host in their own servers.

Performance As A Feature

  • StackOverflow puts a heavy emphasis on performance. The goal for the main page  is to load in less than 50ms, but can be as low as 28ms.

  • Programmers are fanatic about reducing page load times and improving the user experience.

  • Timings for every single request to the network are recorded. With these kind of metrics you can make decisions on where to improve your system.

  • The primary reason their servers run at such low utilization is efficient code. Web servers average between 5-15% CPU, 15.5 GB of RAM used and 20-40 Mb/s network traffic.  The SQL servers average around 5-10% CPU, 365 GB of RAM used, and 100-200 Mb/s of network traffic. This has three major benefits: general room to grow before and upgrade is necessary; headroom to stay online for when things go crazy (bad query, bad code, attacks, whatever it may be); and the ability to clock back on power if needed.

Lessons Learned

  • Why use Redis if you use MS products? gabeech: It’s not about OS evangelism. We run things on the platform they run best on. Period. C# runs best on a windows machine, we use IIS. Redis runs best on a *nix machine we use *nix.

  • Overkill as a strategy. Nick Craver on why their network is over provisioned: Is 20 Gb massive overkill? You bet your ass it is, the active SQL servers average around 100-200 Mb out of that 20 Gb pipe.  However, things like backups, rebuilds, etc. can completely saturate it due to how much memory and SSD storage is present, so it does serve a purpose.

  • SSDs Rock. The database nodes all use SSD and the average write time is 0 milliseconds.

  • Know your read/write workload.

  • Keeping things very efficient means new machines are not needed often. Only when a new project comes along that needs different hardware for some reason is new hardware added. Typically memory is added, but other than that efficient code and low utilization means it doesn’t need replacing. So typically talking about adding a) SSDs for more space, or b) new hardware for new projects.

  • Don’t be afraid to specialize. SO uses complicated queries based on tags, which is why a specialized Tag Engine was developed.

  • Do only what needs to be done. Tests weren’t necessary because an active community did the acceptance testing for them. Add projects only when required. Add a line of code only when necessary. You Aint Gone Need It really works.

  • Reinvention is OK. Typical advice is don’t reinvent the wheel, you’ll just make it worse, by making it square, for example. At SO they don’t worry about making a “Square Wheel”. If developers can write something more lightweight than an already developed alternative, then go for it.

  • Go down to the bare metal. Go into the IL (assembly language of .Net). Some coding is in IL, not C#. Look at SQL query plans. Take memory dumps of the web servers to see what is actually going on. Discovered, for example, a split call generated 2GB of garbage.

  • No bureaucracy. There’s always some tools your team needs. For example, an editor, the most recent version of Visual Studio, etc. Just make it happen without a lot of process getting in the way.

  • Garbage collection driven programming. SO goes to great lengths to reduce garbage collection costs, skipping practices like TDD, avoiding layers of abstraction, and using static methods. While extreme, the result is highly performing code. When you’re doing hundreds of millions of objects in a short window, you can actually measure pauses in the app domain while GC runs. These have a pretty decent impact on request performance.

  • The cost of inefficient code can be higher than you think.  Efficient code stretches hardware further, reduces power usage, makes code easier for programmers to understand.

( via HighScalability.com )

The Stunning Scale Of AWS And What It Means For The Future Of The Cloud

16238755496_4a3014ebbb_mJames Hamilton, VP and Distinguished Engineer at Amazon, and long time blogger of interesting stuff, gave an enthusiastic talk at AWS re:Invent 2014 on AWS Innovation at Scale. He’s clearly proud of the work they are doing and it shows.

James shared a few eye popping stats about AWS:

  • 1 million active customers
  • All 14 other cloud providers combined have 1/5th the aggregate capacity of AWS (estimate by Gartner)
  • 449 new services and major features released in 2014
  • Every day, AWS adds enough new server capacity to support all of Amazon’s global infrastructure when it was a $7B annual revenue enterprise (in 2004).
  • S3 has 132% year-over-year growth in data transfer
  • 102Tbps network capacity into a datacenter.

The major theme of the talk is the cloud is a different world. It’s a special environment that allows AWS to do great things at scale, things you can’t do, which is why the transition from on premise x86 servers to the public cloud is happening at a blistering pace. With so many scale driven benefits to the public cloud, it’s a transition that can’t be stopped. The cloud will keep getting more reliable, more functional, and cheaper at a rate that you can’t begin to match with your limited resources, generalist gear, bloated software stacks, slow supply chains, and outdated innovation paradigms.

That’s the PR message at least. But one thing you can say about Amazon is they are living it. They are making it real. So a healthy doubt is healthy, but extrapolating out the lines of fate would also be wise.

One of the fickle finger of fate advantages AWS has is resources. At one million customers they have the scale to keep the engine of expansion and improvement going. Profits aren’t being taken out, money is being reinvested. This is perhaps the most important advantage of scale.

But money without smarts is simply waste. Amazon wants you to know they have the smarts. We’ve heard how Google and Facebook build their own gear, Amazon does too. They build their own networking gear, networking software, racks, and they work with Intel to get faster processor versions of processors than are available on the market. The key is they know everything and control everything about their environment, so they can build simpler gear that does exactly what they want, which turns out to be cheaper and more reliable in the end.

Complete control allows quality metrics to be built into everything. Metrics drive a constant quality increase in all parts of the system, which is why against all odds AWS is getting more reliable as the pace of innovation quickens. Great pools of actionable data turned into knowledge is another huge advantage of scale.

Another thing AWS can do that you can’t is the Availability Zone architecture itself. Each AZ is its own datacenter and AZs within a region are located very close together. This reduces messaging latencies, which means state can be synchronously replicated between AZs, which greatly improves availability compared to the typical approach where redundant datacenters are very far apart.

It’s a talk rich with information and…well, spunk. The real meta-theme of the talk is how Amazon consciously uses scale to their competitive advantage. For Amazon scale isn’t just an expense to be dealt with, scale is a resource to exploit, if you know how.

Here’s my gloss of James Hamilton’s incredible talk…

Everything In The Talk Has A Foundation In Scale

  • Every day, AWS adds enough new server capacity to support all of Amazon’s global infrastructure when it was a $7B annual revenue enterprise (in 2004).

  • 365 days a year component manufacturers have to get gear to server and storage manufacturers, the server and storage manufacturers have to produce the gear and push it into the logistics channel, it has to get from the logistics channel over to the correct datacenter, it has to arrive at the loading dock, people have be there to wheel the racks to the right place in the DC, there has to be power, cooling, networking, the app stack has to be loaded up, it has to be tested, it has to be released to customers.

  • S3 usage: 132% year-over-year growth in data transfer; EC2 usage: 99% YoY usage growth; AWS overall business: over 1 million active customers.

  • All 14 other cloud providers combined have 1/5th the aggregate capacity of AWS (estimate by Gartner)

  • With over a million customers it means you are in a rich ecosystem. You have your pick of software vendors, if you have a problem someone has likely had it before, it’s easier to get your job done fast.

  • Such high growth means Amazon has the resources to keep reinvesting and innovating by increasing breadth and depth of services they offer.

  • Big Transitions generally occur when the economics are far superior, like mainframes to UNIX servers and then UNIX servers to x86 servers. These transitions usually take 10+ years. What’s different about the x86 on premise transition to the cloud is the speed at which it is happening.  The speed of the cloud transition is a function of great economic value along with the low friction for adoption. You don’t need software, you don’t hardware, you can just do it.

There Are Big Cost Problems In Networking

  • Networking is a red alert situation across the industry. It’s the perfect storm.

  • Problem #1: The cost of networking is escalating relative the cost of all other equipment. It’s anti-Moore’s law. All other gear is going down in cost, networking is getting relatively more expensive over time. Relative monthly costs: servers: 57%; networking equipment: 8%; power distribution and cooling: 18%; power: 13%; other: 4%.

  • Problem #2: At the same time networking is getting more expensive, the ratio of networking to compute is going up. That’s partly because Moore’s law is working (still) with servers and compute density is going up. Partly it’s because as the cost of compute falls the amount of advanced analytics performed goes up and analytics are network intensive. Solving big problems using a large number of servers requires a lot of networking. Network traffic has moved east-west rather than the traditional north-south direction.

  • Amazon’s solution 5 years ago was data driven and radical: they built to their own networking designs. Special routers were built. A team was hired to build the protocol stack all the way to the top. And they deployed all this themselves in their network. All services worldwide run on this gear.

    • This strategy turned out to be a lot cheaper. Just the support contract for networking gear was running 10s of millions of dollars.

    • Availability went up. The source of the improvement was simplicity. The problem AWS was trying to solve was simpler than the problem enterprise gear tries to solve. Enterprise gear must adhere to a lot of complicated specs that go unused and only make the system more fragile. By implementing just the functionality that was required meant a much simpler system which lead to higher availability. Any way to win is a good way to win.

    • A cornucopia of metrics. They measure everything. The rule is if a customer has a bad experience using their system their metrics must show it. This means metrics are getting better all the time because customer problems drive better metrics. Once you have metrics that accurately reflect customer experience then goals can be set on making the system better. Every week improvements are made to make things better. If the code didn’t start off better, it gets better over time.

    • Testability. Their gear was better because they tested it better. Enterprise gear is hard to test at scale. They created a $40 million test bed of 8000 servers (3 megawatts). But since this was the cloud they effectively rented the servers for a few months, so it was relatively cheap.

Networking Explained Layer By Layer, From The Very Top To The Network Interface Card

AWS Worldwide Network Backbone

  • 11 AWS regions worldwide. Choose which ones to use by nearness to customers or required jurisdictional boundaries.

  • Private fiber links interconnect most of the major regions. This avoids all the capacity planning problems (Amazon can do better capacity planning), peering issues, and buffering problems that occur on public links. So it’s faster to run their own network, it’s more reliable, cheaper, and lower latency.

Example AWS Region (US East ((Northern Virginia))

  • All regions have at least two availability zones. US East has five AZs.

  • Redundant paths run to transit centers.

  • Each region has redundant transit centers. A transit center connects private links to other AWS regions, private links to AWS Direct Connect customers, and to the internet through peering and paid transit.

  • If one AZ fails all the other AZs keep working.

  • Metro-area DWDM links between AZs

  • 82,864 fiber strands in a regions

  • AZs are less than 2ms apart and usually less than 1ms apart. From a latency perspective they are fairly close, within a few kilometers. Far enough apart for safety, close enough for good latency.

  • 25Tbps peak traffic between AZs

  • AWS offers AZs because:

    • With a single hardened datacenter the best reliability you’ll get is around 99.9% over a mix of applications over a large period of time. High reliability requires running in two datacenters. Traditionally datacenter diversity is from two datacenters that are very far apart because it’s not cost effective to keep datacenters close together. This means longer latencies. LA to NEW is 74ms round trip. Committing to an SSD is 1 to 2ms. You can’t wait 70+ milliseconds for a transaction to commit. Which means applications commit locally and then replicate to the second datacenter. This design in a failure case loses data during the failover. While a true failure is rare, like a building burning down, transient failures are more common, like a load balancer problem for example. So would you failover your connection was down for 3 minutes? No, because data would be lost and it would take a long time to recover that data from other sources. So you lose availability for common events.

    • AZs are milliseconds apart so you can commit to both at the same time. That means if you failover a customer won’t be able to tell because the data replication was synchronous. It’s invisible. It’s hard to write code for this model so you won’t do it for everything. And for some apps a concern for multi-AZ failures might also prevent you from using multiple AZs, but for the rest of apps this is a very powerful model. It’s more costly, but it gives AWS certain advantages.

Example AWS Availability Zone

  • An Availability Zone is always a datacenter in a completely independent building.

  • Amazon has 28+ datacenters. The plus means there are more datacenters in an AZ as a way of extending capacity for an AZ. More datacenters are added within an AZ to extend the capacity of an AZ. Otherwise you would be forced to split your app across AZs, even if you didn’t want to.

  • Some AZs have size fairly substantial datacenters.

  • DCs in an AZ are less than ¼ms apart.

Example Datacenter

  • AWS datacenters are purposely not gigantic. A single datacenter is 25 – 30 megawatts, with between 50,000 – 80,000 servers

  • The return on datacenter largeness diminishes. The advantage of datacenter scale as you build bigger and bigger goes down. Early advantages are huge. Later advantages are small. Going from 2000 to 2500 racks is a little better. A tiny datacenter is too expensive. A really large datacenter is only marginally more expensive per rack than a medium datacenter.

  • Risk increases with larger datacenters. The blast radius if something goes wrong and the datacenter is destroyed, the loss is too big.

  • 102Tbps network capacity into a datacenter.

Example Rack, Server & NIC

  • The only thing that matters for latency is the software latency at either end of a connection. Latency between two servers when sending a message:

    • Your app -> guest OS -> hypervisor -> NIC : the latency is milliseconds

    • through the NIC : the latency is microseconds

    • over the fiber : the latency is nanoseconds

  • SR-IOV (Single Root I/O Virtualization) allows a NIC to provide virtualize in hardware network cards. Each guest gets its own network card. The benefit is > 2x average latency reduction and > 10x latency jitter improvement. This means outliers a down to a 1/10th of what they were. SR-IOV is being deployed now on newer instances types and will eventually be everywhere. The hard part wasn’t adding SR-IOV, it was adding isolation, metering, DDoS protection, and the capacity limits that make SR-IOV useful in a cloud environment.

AWS Custom Server & Storage Designs

  • Cost of a negative situation is not high so expensive unneeded protection can be removed. Servers are designed for what they do, not a general population of users. Amazon knows exactly in what environment the server will run in, they’ll know exactly when something goes wrong, so the servers can be designed with less engineering headroom. The cost of server failure is not that big for them. They are already on site and are very good at replacing harddisks, etc. So a lot of the carefulness in enterprise equipment is not necessary.

  • Processors can be pushed harder. They know the cooling requirements, they influence the mechanical design, they just design good servers, so they can get more performance out of a server. Though a partnership with Intel Amazon has processors that run faster than can be bought on the open market.

  • An example is the design for their own storage rack. It weighs over a ton, 19” wide, and holds 864 disk drives. For some workloads this a wonderful game changing design that helps them get better prices in some areas.

Power Infrastructure

  • Amazon has designed and built their own power substations. It only saves a little money, but they can build them much faster. Utility companies are not used to dealing with the rate AWS is growing at, so they had to build their own.

  • 3 100% carbon neutral regions: US West (Oregon), AWS GovCloud (US), EU (Frankfurt)

Rapid Pace Of Innovation

  • 449 new services and major features released in 2014. 24 in 2008, 48 in 2009, 61 in 2010, 82 in 2011, 159 in 2012, 280 in 2013.

  • AWS is getting more reliable as the pace of innovation quickens. Their goal is to make available to customers the same tools that use to achieve this rate of innovation and high quality.

( via HighScalability.com )

What is a Monolith?

There is currently a strong trend for microservice based architectures and frequent discussions comparing them to monoliths. There is much advice about breaking-up monoliths into microservices and also some amusing fights between proponents of the two paradigms – see the great Microservices vs Monolithic Melee. The term ‘Monolith’ is increasingly being used as a generic insult in the same way that ‘Legacy’ is!

However, I believe that there is a great deal of misunderstanding about exactly what a ‘Monolith’ is and those discussing it are often talking about completely different things.

A monolith can be considered an architectural style or a software development pattern (or anti-pattern if you view it negatively). Styles and patterns usually fit into different Viewtypes (a viewtype is a set, or category, of views that can be easily reconciled with each other [Clements et al., 2010]) and some basic viewtypes we can discuss are:

  • Module – The code units and their relation to each other at compile time.
  • Allocation – The mapping of the software onto its environment.
  • Runtime – The static structure of the software elements and how they interact at runtime.

A monolith could refer to any of the basic viewtypes above.

Module Monolith

If you have a module monolith then all of the code for a system is in a single codebase that is compiled together and produces a single artifact. The code may still be well structured (classes and packages that are coherent and decoupled at a source level rather than a big-ball-of-mud) but it is not split into separate modules for compilation. Conversely a non-monolithic module design may have code split into multiple modules or libraries that can be compiled separately, stored in repositories and referenced when required. There are advantages and disadvantages to both but this tells you very little about how the code is used – it is primarily done for development management.




Allocation Monolith

For an allocation monolith, all of the code is shipped/deployed at the same time. In other words once the compiled code is ‘ready for release’ then a single version is shipped to all nodes. All running components have the same version of the software running at any point in time. This is independent of whether the module structure is a monolith. You may have compiled the entire codebase at once before deployment OR you may have created a set of deployment artifacts from multiple sources and versions. Either way this version for the system is deployed everywhere at once (often by stopping the entire system, rolling out the software and then restarting).

A non-monolithic allocation would involve deploying different versions to individual nodes at different times. This is again independent of the module structure as different versions of a module monolith could be deployed individually.



Runtime Monolith

A runtime monolith will have a single application or process performing the work for the system (although the system may have multiple, external dependencies). Many systems have traditionally been written like this (especially line-of-business systems such as Payroll, Accounts Payable, CMS etc).

Whether the runtime is a monolith is independent of whether the system code is a module monolith or not. A runtime monolith often implies an allocation monolith if there is only one main node/component to be deployed (although this is not the case if a new version of software is rolled out across regions, with separate users, over a period of time).


Note that my examples above are slightly forced for the viewtypes and it won’t be as hard-and-fast in the real world.


Be very carefully when arguing about ‘Microservices vs Monoliths’. A direct comparison is only possible when discussing the Runtime viewtype and properties. You should also not assume that moving away from a Module or Allocation monolith will magically enable a Microservice architecture (although it will probably help). If you are moving to a Microservice architecture then I’d advise you to consider all these viewtypes and align your boundaries across them i.e. don’t just code, build and distribute a monolith that exposes subsets of itself on different nodes.

(Via Codingthearchitecture.com)


Async Fragments: Rediscovering Progressive HTML Rendering with Marko

At eBay, we take site speed very seriously and are always looking for ways to allow developers to create faster-loading web apps. This involves fully understanding and controlling how web pages are delivered to web browsers. Progressive HTML rendering is a relatively old technique that can be used to improve the performance of websites, but it has been lost in a whole new class of web applications. The idea is simple: give the web browser a head start in downloading and rendering the page by flushing out early and multiple times. Browsers have always had the helpful feature of parsing and responding to the HTML as it is being streamed down from the server (even before the response is ended). This feature allows the HTML and external resources to be downloaded earlier, and for parts of the page to be rendered earlier. As a result, both the actual load time and the perceived load time improve.

In this blog post, we will take an in-depth look at a technique we call “Async Fragments” that takes advantage of progressive HTML rendering to improve site speed in ways that do not drastically complicate how web applications are built. For concrete examples we will be using Node.js,Express.js and the Marko templating engine (a JavaScript templating engine that supports streaming, flushing, and asynchronous rendering). Even if you are not using these technologies, this post can give you insight into how your stack of choice could be further optimized.

To see the techniques discussed in this post in action, please take a look at the accompanying sample application.


Progressive HTML rendering is discussed in the post The Lost Art of Progressive HTML Rendering by Jeff Atwood, which was published back in 2005. In addition, the “Flush the Buffer Early” rule is described by the Yahoo! Performance team in their Best Practices for Speeding Up Your Web Site guide. Stoyan Stefanov provides an in-depth look at progressive HTML rendering in his Progressive rendering via multiple flushes post. Facebook discussed how they use a technique they call “BigPipe” to improve page load times and perceived performance by dividing up a page into “pagelets.” Those articles and techniques inspired many of the ideas discussed in this post.

In the Node.js world, its most popular web framework, Express.js, unfortunately recommends a view rendering engine that does not allow streaming and thus prevents progressive HTML rendering. In a recent post, Bypassing Express View Rendering for Speed and Modularity, I described how streaming can be achieved with Express.js; this post is largely a follow-up to discuss how progressive HTML rendering can be achieved with Node.js (with or without Express.js).

Without progressive HTML rendering

A page that does not utilize progressive HTML rendering will have a slower load time because the bytes will not be flushed out until the complete HTML response is built. In addition, after the client finally receives the complete HTML it will then see that it needs to download additional external resources (such as CSS, JavaScript, fonts, and images), and downloading these external resources will require additional round trips. In addition, pages that do not utilize progressive HTML rendering will also have a slower perceived load time, since the screen will not update until the complete HTML is downloaded and the CSS and fonts referenced in the <head> section are downloaded. Without progressive HTML rendering, a server/client waterfall chart might be similar to the following:


The corresponding page controller might look something like this:

function controller(req, res) {
            function loadSearchResults(callback) {
            function loadFilters(callback) {
            function loadAds(callback) {
        function() {
            var viewModel = { ... };
            res.render('search', viewModel);

As you can see in the above code, the page HTML is not rendered until all of the data is asynchronously loaded.

Because the HTML is not flushed until all back-end services are completed, the user will be staring at a blank screen for a large portion of the time. This will result in a sub-par user experience (especially with a poor network connection or with slow back-end services). We can do much better if we flush part of the HTML earlier.

Flushing the head early

A simple trick to improve the responsiveness of a website is to flush the head section immediately. The head section will typically include the links to the external CSS resources (i.e. the <link>tags), as well as the page header and navigation. With this approach the external CSS will be downloaded sooner and the initial page will be painted much sooner as shown in the following waterfall chart:


As you can see in the chart above, flushing the head early reduces the time to render the initial page. This technique improves the responsiveness of the page, but it does not significantly reduce the total time it takes to make the page fully functional. With this approach, the server is still waiting for all back-end services to complete before flushing the final HTML. In addition, downloading of external JavaScript resources will be delayed since <script> tags are placed at the end of the page (assuming you are following best practices) and don’t get sent out until the second and final flush.

Multiple flushes

Instead of flushing only the head early, it is often beneficial to flush multiple times before ending the response. Typically, a page can be divided into multiple fragments where some of the fragments may depend on data asynchronously loaded from various back-end services while others may not depend on any asynchronously loaded data. The fragments that depend on asynchronously loaded data should be rendered asynchronously and flushed as soon as possible.

For now, we will assume that these fragments need to be flushed in the proper HTML order (versus the order that the data asynchronously loads), but we will also show how out-of-order flushing can be used to further improve both page load times and perceived performance. When using “in-order” flushing, fragments that complete out of order will need to be buffered until they are ready to be flushed in the proper order.

In-order flushing of async fragments

As an example, let’s assume we have divided a complex page into the following fragments:


Each fragment is assigned a number based on the order that it appears in the HTML document. In code, our output HTML for the page might look like the following:

    <title>Clothing Store</title>
    <!-- 1a) Head <link> tags -->
        <!-- 1b) Header -->
    <div class="body">
            <!-- 2) Search Results -->
        <section class="filters">
            <!-- 3) Search filters -->
        <section class="ads">
            <!-- 4) Ads -->
        <!-- 5a) Footer -->
    <!-- 5b) Body <script> tags -->

The Marko templating engine provides a way to declaratively bind template fragments to asynchronous data provider functions (or Promises). An asynchronous fragment is rendered when the asynchronous data provider function invokes the provided callback with the data. If the asynchronous fragment is ready to be flushed, then it is immediately flushed to the output stream. Otherwise, if the asynchronous fragment completed out of order then the rendered HTML is buffered in memory until it is ready to be flushed. The Marko templating engine ensures that fragments are flushed in the proper order.

Continuing with the previous example, our HTML page template with asynchronous fragments defined will be similar to the following:

    <title>Clothing Store</title>
    <!-- Head <link> tags -->
        <!-- Header -->
    <div class="body">
            <!-- Search Results -->
            <async-fragment data-provider="data.searchResultsProvider"
                <!-- Do something with the search results data... -->
                    <li for="item in searchResults.items">
        <section class="filters">
            <!-- Search filters -->
            <async-fragment data-provider="data.filtersProvider"
                <!-- Do something with the filters data... -->
        <section class="ads">
            <!-- Ads -->
            <async-fragment data-provider="data.adsProvider"
                <!-- Do something with the ads data... -->
        <!-- Footer -->
    <!-- Body <script> tags -->

The data provider functions should be passed to the template as part of the view model as shown in the following code for a sample page controller:

function controller(req, res) {
            searchResultsProvider: function(callback) {
                performSearch(req.params.category, callback);
            filtersProvider: function(callback) {
            adsProvider: function(callback) {
        res /* Render directly to the output HTTP response stream */);

In this particular example, the “search results” async fragment appears first in the HTML template, and it happens to take the longest time to complete. As a result, all of the subsequent fragments will need to be buffered on the server. The resulting waterfall with in-order flushing of async fragments is shown below:


While the performance of this approach might be fine, we can enable out-of-order flushing for further performance gains as described in the next section.

Out-of-order flushing of async fragments

Marko achieves out-of-order flushing of async fragments by doing the following:

Instead of waiting for an async fragment to finish, a placeholder HTML element with an assigned id is written to the output stream. Out-of-order async fragments are rendered before the ending <body> tag in the order that they complete. Each out-of-order async fragment is rendered into a hidden <div> element. Immediately after the out-of-order fragment, a <script> block is rendered to replace the placeholder DOM node with the DOM nodes of the corresponding out-of-order fragment. When all of the out-of-order async fragments complete, the remaining HTML (e.g. </body></html>) will be flushed and the response ended.

To clarify, here is what the output HTML might look like for a page with out-of-order flushing enabled:

    <title>Clothing Store</title>
    <!-- 1a) Head <link> tags -->
        <!-- 1b) Header -->
    <div class="body">
            <!-- 2) Search Results -->
            <span id="asyncFragment0Placeholder"></span>
        <section class="filters">
            <!-- 3) Search filters -->
            <span id="asyncFragment1Placeholder"></span>
        <section class="ads">
            <!-- 4) Ads -->
            <span id="asyncFragment2Placeholder"></span>
        <!-- 5a) Footer -->
    <!-- 5b) Body <script> tags -->
    // Small amount of code to support rearranging DOM nodes
    // Unminified:
    // https://github.com/raptorjs/marko-async/blob/master/client-reorder-runtime.js
    <div id="asyncFragment1" style="display:none">
        <!-- 4) Ads content -->
    <div id="asyncFragment2" style="display:none">
        <!-- 3) Search filters content -->
    <div id="asyncFragment0" style="display:none">
        <!-- 2) Search results content -->

One caveat with out-of-order flushing is that it requires JavaScript running on the client to move each out-of-order fragment into its proper place in the DOM. Thus, you would only want to enable out-of-order flushing if you know that the client’s web browser has JavaScript enabled. Also, moving DOM nodes may cause the page to be reflowed, which can be visually jarring to the user and result in more client-side CPU usage. If reflow is an issue then there are tricks that can be used to avoid a reflow (e.g., reserving space as part of the initial wireframe). Marko also allows alternative content to be shown while waiting for an out-of-order async fragment.

To enable out-of-order flushing with Marko, the client-reorder="true" attribute must be added to each <async-fragment> tag, and the <async-fragments> tag must be added to the end of the page to serve as the container for rendered out-of-order fragments. Here is the updated<async-fragment> tag for the search results fragment:

<async-fragment data-provider="data.searchResultsProvider"

The updated HTML page template with the new <async-fragments> tag is shown below:

    <title>Clothing Store</title>
    <!-- Head <link> tags -->
    <!-- Body <script> tags -->

In combination with out-of-order flushing, it may be beneficial to move <script> tags that link to external resources to the end of the first chunk (before all of the out-of-order chunks). While the server is busy preparing the rest of the page, the client can start downloading the external JavaScript required to make the page functional. As a result, the user will be able to start interacting with the page sooner.

Our final waterfall with out-of-order flushing will now be similar to the following:


The final waterfall shows that the strategy of out-of-order flushing of asynchronous fragments can significantly improve the load time and perceived load time of a page. The user will be met with a progressive loading of a page that is ready to be interacted with sooner.

Additional considerations

HTTP transport and HTML compression

To allow HTML to be served in parts, chunked transfer encoding should be used for the HTTP response. Chunked transfer encoding uses delimiters to break up the response, and each flush results in a new chunk. If gzip compression is enabled (and it should be) then flushing the pending data to the gzip stream will result in a gzip data frame being written to the response as part of each chunk. Flushing too often will negatively impact the effectiveness of the compression algorithm, but without flushing periodically then progressive HTML rendering will not be available. By default, Marko will flush at the beginning of an <async-fragment> block (in order to send everything that has already completed), as well as when an async fragment completes. This default strategy results in efficient progressive loading of an HTML page as long as there are not too many async fragments.

Binding behavior

For improved usability and responsiveness, there should not be a long delay between rendered HTML being displayed to the user in the web browser and behavior being attached to the associated DOM. At eBay, we use the marko-widgets module to bind behavior to DOM nodes. Marko Widgets supports binding behavior to rendered widgets immediately after each independent async fragment, as illustrated in the accompanying sample app. For immediate binding to work, the required JavaScript code must be included earlier in the page. For more details, please see the marko-widgets module documentation.

Error handling

It is important to note that as soon as a byte is flushed for the HTTP body, then the response is committed; no additional HTTP headers can be sent (e.g., no server-side redirects or cookie-setting), and the HTML that has been sent cannot be “unsent”. Therefore, if an asynchronous data provider errors or times out, then the app must be prepared to show alternative content for thatparticular async fragment. Please see the documentation for the marko-async module for additional information on how to show alternative content in case of an error.


The Async Fragments technique allows web developers to maximize the benefits of progressive HTML rendering to produce web pages that have improved actual and perceived load times. Developers at eBay have found the concept of binding HTML template fragments to asynchronous data providers easy to grasp and utilize. In addition, the flexibility to support both in-order and out-of-order flushing of async fragments makes this technique applicable for all web browsers and user agents.

The Marko templating engine is being used as part of eBay’s latest Node.js stack to improve performance while also simplifying how pages are constructed on both the server and the client. Marko is one of a few templating engines for Node.js and web browsers that support streaming, flushing, and asynchronous rendering. Marko has a simple HTML-based syntax, and the Marko compiler produces small and efficient JavaScript modules as output. We encourage you to try Marko online and in your next Node.js project. Because Marko is a key component of eBay’s internal Node.js stack, and given that it is heavily documented and tested, you can be confident that it will be well supported.

(Via Calendar.perfplanet.com)

HTTP 2.0 is coming!

Why we need another version of HTTP protocol?

HTTP has been in use by the World-Wide Web global information initiative since 1990. However, it is December 2014 and we don’t have anymore simple pages with cross linked HTML documents as it used to be. Instead, we have Web applications, some of them very heavy and requiring a lot of resources. And unfortunately, the version of the HTTP protocol currently used – 1.1, has issues.

HTTP is actually very simple – browser sends request the server, server provides the response and that is it. Very simple, but if you check the chart below you’ll see that there is not only one request and one response, but multiple requests and responses – about 80 – 100 and 1.8MB of data:


Data provided by HTTP Archive.

Now, imagine we have a server in Los Angeles and our client is in Berlin, Germany. All those 80-100 requests should travel from Berlin to L.A. and then get back. That is not fast – for example, the roundtrip time between London and New York is about 56 ms. From Berlin to Los Angeles it is even more. And as we know, first page load is latency bound; latency is the constraining factor for today’s applications.

In order to speed up downloading the needed resources, browsers currently open multiple connections to the server (typically 6 per domain). However, opening a connection is expensive – there is domain name resolution, socket connect, more roundtrips if TLS should be established and so on. From browser point of view, this also means more consumed memory, connections management, using heuristics when to open a new connection and when to reuse existing one and so on.

Web engineers also tried to make sites loading as fast as possible. They invented many different workarounds (aka “optimizations”) like: image sprites, domain sharding, resource inlining, concatenating files, combo services and so on. Inventing more and more tricks may work to some point but what if we were able to fix the issues on the protocol level and avoid these workarounds?

HTTP/2 in a nutshell

HTTP/2 is binary protocol, where browser and server exchange frames. Each frame belongs to a stream via identifier. The key point is that these streams are multiplexed, they have priorities, priorities are specified by the client (browser) and they can be changed runtime. A stream can depend on another stream.

In contrary to HTTP 1.1, in HTTP/2 the headers are compressed. A special algorithm was invented for that purpose and it is called HPACK.

Server push is a feature of HTTP/2 which deserves special attention. Web Developers actually implemented the same idea for years – the mentioned above inlining of resources in the page is an example of that. Since this is on protocol level, instead of embedding CSS and JavaScript files or images directly on the page, server can explicitly push these resources to the browser in relation with a previously made request.

How does an HTTP/2 connection look like?


Image by Ilya Grigorik

In this example we have three streams:

  1. Client initiated request of page.html
  2. Stream, which carries script.js – the server initiated this stream, since it knows already what is the content of page.html and that script.js will be requested from the browser as soon as it parses the HTML.
  3. Stream, which carries style.css – initiated by the server, since this CSS is considered as critical and it will be required from the browser as soon as it parses the HTML file.

HTTP/2 is huge step forward. It is currently on its Draft-16 and the final specification will be ready very soon.

Optimizing the Web stack for HTTP/2 era

Does the above mean that we should discard all optimizations we were doing for years to make our Web applications as fast as possible? Of course not! This just means we have to forget about some of them and start applying others. For example, we still should send as less data as possible from the server to the client, to deal with caching and store files offline. In general, we can split the needed optimizations in two parts:

Optimize the content being served to the browser:

  • Minimizing JavaScript, CSS and HTML files
  • Removing redundant data from images
  • Optimize Critical Path CSS
  • Removing the CSS which is not needed on the page using tools like UnCSS before to send the page to the server
  • Properly specifying ETag to the files and setting far future expires headers
  • Using HTML 5 offline to store already downloaded files and minimize traffic on the next page load

Optimize the server and TCP stack:

  • Check your server and be sure the value of TCP’s Initial Congestion Window (initial cwnd) is 10 segments (IW10). If you use GNU/Linux, just upgrade to 3.2+ to get this feature and another important update – Proportional Rate Reduction for TCP
  • Disable Slow-Start Restart after idle
  • Check and enable if needed Window Scaling
  • Consider to use TCP Fast Open (TFO)

(for more information check the wonderful book “High Performance Browser Networking” by Ilya Grigorik)

We may consider to remove the following “optimizations”:

  • Joining files – nowadays many companies are striking for continues deployment which makes this challenging – a single line of code change invalidates the whole bundle. Also, it forces browser to wait until the whole file arrives before to start processing it
  • Domain sharding – loading resources from different domains in order to avoid browser’s limit of connections per domain (usually 6) is the first “optimization” to be removed. It causes retransmissions and unnecessary latency
  • Resource inlining – prevents caching and inflates the document in which they are being stored. Instead, consider to leave CSS, JavaScript and images as external resources
  • Image sprites – the problem with cache invalidation is present here too. Apart from that, image sprites force browser to consume more CPU and memory during the process of decoding the the entire sprite
  • Using cookie free domains

The new modules API from ES6 and HTTP/2

For years we were splitting our JavaScript code in modules, stored in different files. Since JavaScript did not provide module system prior to version 6, the community invented two different main formats – AMD and CommonJS. Of course, custom formats, like those used by YUI Loader existed too. In ECMAScript 6 however we have a brand new way of creating modules. The API for loading them looks like this:

Declarative syntax:

import {foo, bar} from 'file.js';

Programmatic loader API:

System.import('my_module') .then(my_module => {
  // ...
.catch(error => {
  // ...

Imagine this module has 10 different dependencies, each of them stored in separate file.
If we don’t change anything on build time, then the browser will request the file, which contains the main module, and then it will make many additional requests in order to download all dependencies.
Since we have HTTP/2 now the browser won’t open multiple connections. However, in order to process the main module, browser still has to wait for all dependencies to be downloaded over the network. This means – download one file, parse it, then oh – it requires another module, download again the required file and so on, until all dependencies are being resolved.

One possible fix of the above issue could be to change this on build time. This means, you may concatenate all modules in one file and then overwrite the originally specified import statements to look for modules in that joined file. However, this has drawbacks and they are the same as if we were joining files for the HTTP 1.1 era.

Another fix, which may be considered is to leverage HTTP/2 push promises. In the example above this means you may try to push the dependencies when the main module is being requested. If the browser already has them then it may abort (reset) the stream by sending RST_STREAM frame.

Patrick Meenan however pointed me to a very interesting issue – on practice browser may not be able to abort the stream quickly enough. By the time the pushed resources hit the client and are validated against the cache, the entire resource will already be in buffers (on the network and in the server) and the whole file will be downloaded anyway. It will work if we can be sure that the resources aren’t in the browser cache, otherwise we will end up sending them anyway. This is an interesting point for further research.

HTTP/2 implementations

You may start playing with HTTP/2 today. There are many server implementations – grab one and start experimenting.

The main browser vendors support HTTP/2 already:

  • Internet Explorer supports HTTP/2 from IE 11 on Windows 10 Technical Preview,
  • Firefox has enabled HTTP/2 by default in version 34 and
  • current version of Chrome supports HTTP/2, but it is not enabled by default. It may be enabled by adding a command line flag --enable-spdy4 when Chrome is being launched or via chrome://flags.

Currently only HTTP/2 over TLS is implemented in all browsers.

Other interesting protocols to keep an eye on

QUIC is a another protocol, started by Google as natural extension of the research on protocols like SPDY and HTTP/2. Here I won’t give many details, it is enough to mention that it has all features of SPDY and HTTP/2, but it is built on top of UDP. The goal is to avoid head-of-line blocking like in SPDY or HTTP/2 and to establish a connection much faster than TCP+TLS is capable to do.

For more information about HTTP/2 and QUIC, please watch my JSConfEU talk.

(Via Calendar.perfplanet.com)