How OpenCL Could Open the Gates for FPGAs

In this special guest feature from Scientific Computing World, Robert Roe explains how OpenCL may make FPGAs an attractive option.

Over the past few years, high-performance computing (HPC) has become used to heterogeneous hardware, principally mixing GPUs and CPUs, but now, with both major FPGA manufacturers in conformance with the OpenCL standard, the door is effectively open for the wider use of FPGAs in high-performance computing.In January 2015, FPGAs took a step closer to the mainstream of high-performance computing with the announcement that Xilinx’s development environment for systems and software engineers, SDAccel, had been certified as conforming to the OpenCL standard for parallel programming of heterogeneous systems.

The changing landscape of HPC, with the move towards data-centric computing, could favour FPGAs with very high I/O throughput. However, it remains to be seen if FPGAs will be used as an accelerator or if supercomputers might be built using FPGA as the main processor technology.

One of the attractions of FPGAs is that they consume very little power but, as with GPUs initially, the barrier to adoption has been the difficulty of programming them. Manufacturers and vendors are now releasing compilers that will optimise code written in C and C++ to make use of the flexible nature of FPGA architecture.

Easier to program

Mike Strickland, director of the computer and storage business unit at Altera said: “The problem was that we did not have the ease of use, we did not have a software-friendly interface back in 2008. The huge enabler here has been OpenCL.”

Larry Getman, VP of strategic marketing and planning at Xilinx said: ‘When FPGAs first started they could do very basic things such as Boolean algebra and it was really used for glue logic. Over the years, FPGAs have really advanced and evolved with more hardened structures which are much more specialised.’

Getman continued: ‘Over the years FPGAs have gone from being glue logic to harder things like radio head systems, that do a lot of DSP processing; very high-performance vision applications; wireless radio; medical equipment; and radar systems. So they are used in high-performance computing, but for applications that use very specialised algorithms.’

Getman concluded: ‘The reason people use FPGAs for these applications is simple, they offer a much higher level of performance per Watt than trying to run the same application in pure software code.’

FPGAs are programmable semiconductor devices that are based around a matrix of configurable logic blocks (CLBs) connected through programmable interconnects. This is where the FPGA gets the term ‘field programmable’ as an FPGA can be programmed and optimised for a specific application. Efficient programming can take advantage of the inherent parallelism of the FPGA architecture delivering a higher level of performance than accelerators that have a less flexible architecture.

Millions of threads running at the same time

Devadas Varma, senior director of software Research and Development at Xilinx said: ‘A CPU, if it is single core CPU, executes one instruction at a time and if you have four cores, eight cores, that are multithreaded then you can do eight or sixteen sets of instructions, for example. If you compare this to an FPGA, which is a blank set of millions of components that you decide to interconnect, theoretically speaking you could have thousands or even millions of threads running at the same time.’

Reuven Weintraub, founder and chief technology officer at Gidel, highlighted the differences between FPGAs and the processors used in CPUs today. He said: ‘They are the same and they are different. They are the same from the perspective that both of them are programmable. The difference is coming from the fact that in the FPGA all the instructions would run in parallel. Actually the FPGA is not a processor; it is compiled to be a dedicated set of hardware components according to the requirements of the algorithm – that is what gives it the efficiency, power savings and so on.’

Traditionally this power efficiency, scalability, and flexible architecture came at the price of more complex programming: code needed to address the hardware and the flow of data across the various components, in addition to providing the basic instruction set to be computed in the logic blocks. However, major FPGA manufacturers Altera and Xilinx have both been working on their own OpenCL based solutions which have the potential to make FPGA acceleration a real possibility for more general HPC workloads.

Development toolkits

Xilinx has recently released SDAccel, a development environment that includes software tools including its own compiler, tools for code development, profiling, and debugging, and provides a GPU-like work environment. Getman said: ‘Our goal is to make an FPGA as easy to program as a GPU. SDAccel, which is OpenCL based, does allow people to program in OpenCL and C or C++ and they can now target the FPGA at a very high level.’

In addition, SDAccel provides functionality to swap multiple kernels in and out of the FPGA without disrupting the interface between the server CPU and the FPGA. This could be a key enabler of FPGAs in real-world data centres where turning off some of your resources while you re-optimise them for the next application is not an economically viable strategy at present.

Altera has been working closely with the Khronos group, which oversees a number of open computing standards including OpenCL, OpenGL, and WebGL. Altera released a development toolkit, Altera’s SDK for OpenCL, in May 2013. Strickland said: ‘In May 2013 we achieved a very important conformance test with the standards body – the Khronos group – that manages OpenCL. We had to pass 8,000 tests and that really strengthened the credibility of what we are doing with the FPGA.’

Strickland continued: ‘In the past, there were a lot of FPGA compiler tools that took care of the logic but not the data management. They could take lines of C and automatically generate lines of RTL but they did not take care of how that data would come from the CPU, the optimisation of external memory bandwidth off the FPGA, and that is a large amount of the work.’

Traditionally optimising algorithms to utilise fully the parallel architectures of FPGA technology involved significant experience using HDLs (hardware description languages) because they allowed programmers to write code that would address the FPGA at register-transfer level (RTL).

RTL enables programmers to describe the flow of data between hardware registers, and the logical operations performed on that data. This is typically what creates the difference in performance between more general processors and FPGAs, which can be optimised much more efficiently for a specific algorithm.

The difficulty is that that kind of coding requires expertise and can be very time consuming. Hand-coded RTL may go through several iterations as programmers test the most efficient ways to parallelise the instruction set to take advantage of the programmable hardware on the FPGA.

Strickland said: “With OpenCL or the OpenCL compiler, you still write something that is like C code that targets the FPGA. The big difference I would say is the instruction set. The big innovation has been the back end of our complier which can now take that C code and efficiently use the FPGA.”

Strickland noted that Altera’s compiler ‘does more than 200 optimisations when you write some C code. It is doing things like seeing the order in which you access memory so that it can group memory addresses together, improving the efficiency of that memory interface.’

Converting code from different languages into an RTL description has been possible for some time, but these developments in OpenCL make it much easier for programmers without extensive knowledge of HDLs, such as VHDL and Verilog, to make use of FPGAs.

However OpenCL is not the final piece of the puzzle for FPGA programming. Strickland said: ‘Over time you may want to have other high-level interfaces. There is a standard called SPIR (Standard Portable Intermediate Representation). The idea is that this allows you to kind of split up your compiler between the front end and the back end, enabling people to use different high-level language interfaces on the front end.’

Strickland continued: ‘In universities now there is research into domain-specific languages, so people are trying to accomplish a certain class of algorithms may benefit from having a higher level interface than even C. The idea behind exposing this intermediate compiler interface is you can now start working with the ecosystem to have front ends with higher-level interfaces.’

Over the past few years, there have been two ideas behind the best way to program FPGAs: high-level synthesis (HLS) or OpenCL. As OpenCL has matured, Xilinx decided to adopt the standard but to keep the work it had done developing HLS technology and integrate that into the development environment conforming to the OpenCL standard.

Getman said: “The main problem is that C is very much designed to go cycle to cycle, step by step. Unfortunately hardware doesn’t. Hardware has a lot of things running at the same time.” This aspect was what made HLS attractive as a compiler that can take OpenCL, C or C++ and architecturally optimise it for the FPGA hardware.

Xilinx acquired AutoESL and its HLS tool AutoPilot in 2011 and began integrating it into its own development tools for FPGAs. Getman said: ‘That was really the big switching point. For many years, people had been promising really great results with HLS but in reality the results were a lot bigger and a lot slower than what could have been done by hand.’

Getman continued: ‘We have integrated this technology into our tools and added a lot to it. This is really one of the big differentiators from our competition, even though we both have OpenCL support. This technology allows our users the opportunity to create their own libraries in real-time using C, C++ or OpenCL, rather than have to wait for the vendor to create specific libraries or specific algorithms for them.

Varma said: “The silver bullet in HLS is the ability to take a sequential description that has been written in C and then find this parallelism, the concurrencies, without the user having to think. That was a necessary technology before we could do anything. It has been adopted by thousands of users already as a standalone technology, but what we do is embed that technology inside OpenCL compilers so that now it can be utilised in full software mode and it is fully compatible with OpenCL.”

Getman said: “We consciously made a switch over the last few years to expand our customer base by both continuing technology development for our traditional users as well as expand our tool flow to cater to software coders.”

A key facet of this technology is that Xilinx is letting programmers take the work they have done in C and port it over to OpenCL using the technology from HLS that is now integrated into its compilers. Varma said: ‘One thing that changes when you go from software to hardware programming is that C programmers, OpenCL programmers, are used to dealing with a lot of libraries. They do not have to write matrix multiplications or filters or those kinds of things, because they are always available as library elements. Now hardware languages often have libraries, but they are very specific implementations that you cannot just change for your use.’

Varma concluded: “By writing in C, our HLS technology can re-compile that very efficiently and immediately. This gives you a tremendous capability.”

Coprocessor or something bigger?

FPGA manufacturers like Altera and Xilinx have been focusing their attention on using FPGAs in HPC as coprocessors or accelerators that would be used in much the same way as GPUs.

Getman said: “The biggest use model is really processor plus FPGA. The reason for that is there are still things that you want to run on a processor. You really want a processor to do what it is good at. Typically an FPGA will be used through something like a PCIE slot and it will be used as an acceleration engine for the things that are really difficult for the processor.”

This view was shared by Devadas Varma who highlighted some of the functionality in an earlier release of OpenCL that increased the potential for CPU/GPU/FPGA synergy.

Varma said: ‘The tool we have developed supports OpenCL 1.2 and importantly it can co-exist with CPUs and GPUs. In fact in our upcoming release we will support partitioning workloads into GPUs, we already support this feature regarding CPUs. That is definitely where we are heading.’

However this was not a view shared by Reuven Weintraub, at Gidel, who felt that to regard an FPGA simply as a coprocessor was to miss much of the point and many of the advantages that FPGAs could offer to computing. Weintraub said: “For me a coprocessor is like the H87 was, you make certain lines of code in the processor and then you say “there’s a line of code for you” and it returns and this goes back and forth. The big advantage of running with the FPGA is that the FPGA can have a lot of pipelining inside of it, solve a lot of things and have a lot of memory.”

He explained that an FPGA contains a ‘huge array of registers that are immediately available’ by taking advantage of the on-board memory and high-throughput that FPGAs can handle, meaning that ‘you do not necessarily have to use the cache because the data is being moved in and out in the correct order.’

Weintraub concluded: “Therefore it is better to give a task to the FPGA rather than giving just a few up codes and the going back and forth. It is more task oriented. Computing is a balance between the processing, memory access, networking and storage, but everything has to be balanced. If you want to utilize a good FPGA then you need to give it a task that makes use of its internal memory so that it can move things from one job to another.”

Gidel has considerable experience in this field. Gidel provided the FPGAs for the Novo-G supercomputer, housed at the University of Florida, the largest re-configurable supercomputer available for research.

The university is a lead partner in the ‘Center for High-Performance Reconfigurable Computing’ (CHREC), a US national research centre funded by the National Science Foundation.

In development at the UF site since 2009, Novo-G features 192, 40nm FPGAs (Altera Stratix-IV E530) and 192, 65nm FPGAs (Stratix-III E260).

These 384 FPGAs are housed in 96 quad-FPGA boards (Gidel ProcStar-IV and ProcStar-III) and supported by quad-core Nehalem Xeon processors, GTX-480 GPUs, 20Gb/s non-blocking InfiniBand, GigE, and approximately 3TB of total RAM, most of it directly attached to the FPGAs. An upgrade is underway to add 32 top-end, 28nm FPGAs (Stratix-V GSD8) to the system.

According to the article ‘Novo-G: At the Forefront of Scalable Reconfigurable Supercomputing’ written by Alan George, Herman Lam, and Greg Stitt, three researchers from the university, Novo-G achieved speeds rivaling the largest conventional supercomputers in existence – yet at a fraction of their size, energy, and cost.

But although processing speed and energy efficiency were important, they concluded that the principal impact of a reconfigurable supercomputer like Novo-G was the freedom that its innovative design can give to scientists to conduct more types of analysis, and examine larger datasets.

The potential is there.

(via InsideHPC.com)

Hard disk reliability examined once more: HGST rules, Seagate is alarming

hard-disk-platter

A year ago we got some insight into hard disk reliability when cloud backup provider Backblaze published its findings for the tens of thousands of disks that it operated. Backblaze uses regular consumer-grade disks in its storage because of the cheaper cost and good-enough reliability, but it also discovered that some kinds of disks fared extremely poorly when used 24/7.

A year later the company has collected even more data and drawn out even more differences between the different disks it uses.

For a second year, the standout reliability leader was HGST. Now a wholly owned subsidiary of Western Digital, HGST inherited the technology and designs from Hitachi (which itself bought IBM’s hard disk division). Across a range of models from 2 to 4 terabytes, the HGST models showed low failure rates; at worse, 2.3 percent failing a year. This includes some of the oldest disks among Backblaze’s collection; 2TB Desktop 7K2000 models are on average 3.9 years old, but still have a failure rate of just 1.1 percent.

At the opposite end of the spectrum are Seagate disks. Last year, the two 1.5TB Seagate models used by Backblaze had failure rates of 25.4 percent (for the Barracuda 7200.11) and 9.9 percent (for the Barracuda LP). Those units fared a little better this time around, with failure rates of 23.8 and 9.6 percent, even though they were the oldest disks in the test (average ages of 4.7 and 4.9 years, respectively). However, their poor performance was eclipsed by the 3TB Barracuda 7200.14 units, which had a whopping 43.1 percent failure rate, in spite of an average age of just 2.2 years.

Backblaze’s storage is largely split between Seagate and HGST disks. HGST’s parent company, Western Digital, is almost absent, not because its disks are bad, but because they came out as consistently more expensive than those from Seagate and HGST.

blog-drive-failure-by-manufacturer

Newer Seagate disks also show more encouraging results. Although still young, at an average age of just 0.9 years, the 4TB HDD.15 models show a reasonably low 2.6 percent failure rate. Coupled with their low price—Backblaze says that they tend to undercut HGST’s disks—they’ve become the company’s preferred hard drive model.

As before, this doesn’t mean that anyone with a Seagate disk is at risk of an imminent hard disk failure (though you should always have backups!). Backblaze operates disks outside of the manufacturer’s specified parameters. Significantly, most consumer-grade disks aren’t intended to be heavily used 24/7; they’re meant to be operational for about 8 hours a day and replaced every 3 to 5 years. Most home usage environments are likely to be lower in vibration than Backblaze’s 45-disk storage pods, too. In more normal conditions, the Seagates are likely to fare much better.

(Via Arstechnica.com)

StackExchange’s Performance Dashboard

6148448748_ee8eedd346_mStackExchange created a very cool performance dashboard that looks to be updated from real system metrics. Wouldn’t it be fascinating if every site had a similar dashboard?

The dashboard contains information like there are 560 million page views per month, 260,000 sustained connections,  34 TB data transferred per month, 9 web servers with 48GB of RAM handling 185 req/s at 15% CPU usage. There are 4 SQL servers, 2 redis servers, 3 tag engine servers, 3 elasticsearch servers, and 2 HAProxy servers, along with stats on each.

There’s also an excellent discussion thread on reddit that goes into more interesting details, with questions being answered by folks from StackExchange.

StackExchange is still doing innovative work and is very much an example worth learning from. They’ve always danced to their own tune and it’s a catchy tune at that. More at StackOverflow Update: 560M Pageviews A Month, 25 Servers, And It’s All About Performance.

( via HighScalability.com )

StackOverflow Update: 560M Pageviews A Month, 25 Servers, And It’s All About Performance

16238755496_4a3014ebbb_mThe folks at Stack Overflow remain incredibly open about what they are doing and why. So it’s time for another update. What has Stack Overflow been up to?

The network of sites that make up StackExchange, which includes StackOverflow, is now ranked 54th for traffic in the world; they have 110 sites and are growing at a rate of 3 or 4 a month; 4 million users; 40 million answers; and 560 million pageviews a month.

This is with just 25 servers. For everything. That’s high availability, load balancing, caching, databases, searching, and utility functions. All with a relative handful of employees. Now that’s quality engineering.

This update is based on The architecture of StackOverflow (video) by Marco Cecconi and What it takes to run Stack Overflow (post) by Nick Craver. In addition, I’ve merged in comments from various sources. No doubt some of the details are out of date as I meant to write this article long ago, but it should still be representative.

Stack Overflow still uses Microsoft products. Microsoft infrastructure works and is cheap enough, so there’s no compelling reason to change. Yet SO is pragmatic. They use Linux where it makes sense. There’s no purity push to make everything Linux or keep everything Microsoft. That wouldn’t be efficient.

Stack Overflow still uses a scale-up strategy. No clouds in site. With their SQL Servers loaded with 384 GB of RAM and 2TB of SSD, AWS would cost a fortune. The cloud would also slow them down, making it harder to optimize and troubleshoot system issues. Plus, SO doesn’t need a horizontal scaling strategy. Large peak loads, where scaling out makes sense, hasn’t  been a problem because they’ve been quite successful at sizing their system correctly.

So it appears Jeff Atwood’s quote: “Hardware is Cheap, Programmers are Expensive”, still seems to be living lore at the company.

Marco Ceccon in his talk says when talking about architecture you need to answer this question first: what kind of problem is being solved?

First the easy part. What does StackExchange do? It takes topics, creates communities around them, and creates awesome question and answer sites.

The second part relates to scale. As we’ll see next StackExchange is growing quite fast and handles a lot of traffic. How does it do that? Let’s take a look and see….

Stats

  • StackExchange network has 110 sites growing at a rate of 3 or 4 a month.

  • 4 million users

  • 8 million questions

  • 40 million answers

  • As a network #54 site for traffic in the world

  • 100% year over year growth

  • 560 million pageviews a month

  • Peak is more like 2600-3000 requests/sec on most weekdays. Programming, being a profession, means weekdays are significantly busier than weekends.

  • 25 servers

  • 2 TB of SQL data all stored on SSDs

  • Each web server has 2x 320GB SSDs in a RAID 1.

  • Each ElasticSearch box has 300 GB also using SSDs.

  • Stack Overflow has a 40:60 read-write ratio.

  • DB servers average 10% CPU utilization

  • 11 web servers, using IIS

  • 2 load balancers, 1 active, using HAProxy

  • 4 active database nodes, using MS SQL

  • 3 application servers implementing the tag engine, anything searching by tag hits

  • 3 machines doing search with ElasticSearch

  • 2 machines for distributed cache and messaging using Redis

  • 2 Networks (each a Nexus 5596 + Fabric Extenders)

  • 2 Cisco 5525-X ASAs (think Firewall)

  • 2 Cisco 3945 Routers

  • 2 read-only SQL Servers for used mainly for the Stack Exchange API

  • VMs also perform functions like deployments, domain controllers, monitoring, ops database for sysadmin goodies, etc.

Platform

  • ElasticSearch

  • Redis

  • HAProxy

  • MS SQL

  • Opserver

  • TeamCity

  • Jil – Fast .NET JSON Serializer, built on Sigil

  • Dapper – a micro ORM.

UI

  • The UI has message inbox that is sent a message when you get a new badge, receive a message, significant event, etc. Done using WebSockets and is powered by redis.

  • Search box is powered by ElasticSearch using a REST interface.

  • With so many questions on SO it was impossible to just show the newest questions, they would change too fast, a question every second. Developed an algorithm to look at your pattern of behaviour and show you which questions you would have the most interest in. It’s uses complicated queries based on tags, which is why a specialized Tag Engine was developed.

  • Server side templating is used to generate pages.

Servers

  • The 25 servers are not doing much, that is the CPU load is low. It’s calculated SO could run on only 5 servers.

  • The database server is at 10%, except when it bursts while performing a backups.

  • How so low? The databases servers have 384GB of RAM and the web servers are at 10%-15% CPU usage.

  • Scale-up is still working. Other scale-out sites with a similar number of pageviews tend to run on 100, 200, up to 300 servers.

  • Simple system. Built on .Net. Have only 9 projects, others systems have 100s. Reason to have so few projects is is so compilation is lightning fast, which requires planning at the beginning. Compilation takes 10 seconds on a single computer.

  • 110K lines of code. A small number given what it does.

  • This minimalist approach comes with some problems. One problem is not many tests. Tests aren’t needed because there’s a great community. Meta.stackoverflow is a discussion site for the community and where bugs are reported. Meta.stackoverflow is also a beta site for new software. If users find any problems with it they report the bugs that they’ve found, sometimes with solution/patches.

  • Windows 2012 is used in New York but are upgrading to 2012 R2  (Oregon is already on it). For Linux systems it’s Centos 6.4.

  • Load is really almost all over 9 servers, because 10 and 11 are only for meta.stackexchange.com, meta.stackoverflow.com, and the development tier. Those servers also run around 10-20% CPU which means we have quite a bit of headroom available.

SSDs

  • Intel 330 as the default (web tier, etc.)

  • Intel 520 for mid tier writes like Elastic Search

  • Intel 710 & S3700 for the database tier. S3700 is simply the successor to the high endurance 710 series.

  • Exclusively RAID 1 or RAID 10 (10 being any arrays with 4+ drives). Failures have not been a problem, even with hundreds of intel 2.5″ SSDs in production, a single one hasn’t failed yet. One or more spare parts are kept for each model, but multiple drive failure hasn’t been a concern.

  • ElasticSearch performs much better on SSDs, given SO writes/re-indexes very frequently.

  • SSD changes the use of search. Lucene.net couldn’t handle SO’s concurrent workloads due to locking issues, so they moved to ElasticSearch. It turns out locks around the binary readers really aren’t necessary in an all SSD environment.

  • The only scale-up problems so far is SSD space on the SQL boxes due to the growth pattern of reliability vs. space in the non-consumer space, that isdrives that have capacitors for power loss and such.

High Availability

  • The main datacenter is in New York and the backup datacenter is in Oregon.

  • Redis has 2 slaves, SQL has 2 replicas, tag engine has 3 nodes, elastic has 3 nodes – any other service has high availability as well (and exists in both data centers).

  • Not everything is slaved between data centers (very temporary cache data that’s not needed to eat bandwidth by syncing, etc.) but the big items are, so there is still a shared cache in case of a hard down in the active data center. A start without a cache is possible, but it isn’t very graceful.

  • Nginx was used for SSL, but a transition has been made to using HAProxy to terminate SSL.

  • Total HTTP traffic sent is only about 77% of the total traffic sent. This is because replication is happening to the secondary data center in Oregon as well as other VPN traffic. The majority of this traffic is the data replication to SQL replicas and redis slaves in Oregon.

Databasing

  • MS SQL Server.

  • Stack Exchange has one database per-site, so Stack Overflow gets one, Super User gets one, Server Fault gets one, and so on. The schema for these is the same. This approach of having different database is effectively a form of partitioning and horizontal scaling.

  • In the primary data center (New York) there is usually 1 master and 1 read-only replica in each cluster. There’s also 1 read-only replica (async) in the DR data center (Oregon). When running in Oregon then the primary is there and both of the New York replicas are read-only and async.

  • There are a few wrinkles. There is one “network wide” database which has things like login credentials, and aggregated data (mostly exposed through stackexchange.com user profiles, or APIs).

  • Careers Stack Overflow, stackexchange.com, and Area 51 all have their own unique database schema.

  • All the schema changes are applied to all site databases at the same time. They need to be backwards compatible so, for example, if you need to rename a column – a worst case scenario – it’s a multiple steps process: add a new column, add code which works with both columns, back fill the new column, change code so it works with the new column only, remove the old column.

  • Partitioning is not required. Indexing takes care of everything and the data just is not large enough. If something warrants a filtered indexes, why not make it way more efficient? Indexing only on DeletionDate = Null and such is a common pattern, others are specific FK types from enums.

  • Votes are in 1 table per item, for example 1 table for post votes, 1 table for comment votes. Most pages we render real-time, caching only for anonymous users. Given that, there’s no cache to update, it’s just a re-query.

  • Scores are denormalized, so querying is often needed. It’s all IDs and dates, the post votes table just has 56,454,478 rows currently. Most queries are just a few milliseconds due to indexing.

  • The Tag Engine is entirely self-contained, which means not having to depend on an external service for very, very core functionality. It’s a huge in-memory struct array structure that is optimized for SO use cases and precomputed results for heavily hit combinations. It’s a simple windows service running on a few boxes working in a redundant team. CPU is about 2-5% almost always. Three boxes are not needed for load, just redundancy. If all those do fail at once, the local web servers will load the tag engine in memory and keep on going.

  • On Dapper’s lack of a compiler checking queries compared to traditional ORM. The compiler is checking against what you told it the database looks like. This can help with lots of things, but still has the fundamental disconnect problem you’ll get at runtime. A huge problem with the tradeoff is the generated SQL is nasty, and finding the original code it came from is often non-trivial. Lack of ability to hint queries, control parameterization, etc. is also a big issue when trying to optimize queries. For example. literal replacement was added to Dapper to help with query parameterization which allows the use of things like filtered indexes. Dapper also intercepts the SQL calls to dapper and add add exactly where it came from. It saves so much time tracking things down.

     

Coding

  • The process:

    • Most programmers work remotely. Programmers code in their own batcave.

    • Compilation is very fast.

    • Then the few test that they have are run.

    • Once compiled, code is moved to a development staging server.

    • New features are hidden via feature switches.

    • Runs on same hardware as the rest of the sites.

    • It’s then moved to Meta.stackoverflow for testing. 1000 users per day use the site, so its a good test.

    • If it passes it goes live on the network and is tested by the larger community.

  • Heavy usage of static classes and methods, for simplicity and better performance.

  • Code is simple because the complicated bits are packaged in a library and open sourced and maintained. The number of .Net projects stays low because community shared parts of the code are used.

  • Developers get two or three monitors. Screens are important, they help you be productive.

Caching

  • Cache all the things.

  • 5 levels of caches.

  • 1st: is the network level cache: caching in the browser, CDN, and proxies.

  • 2nd: given for free by the .Net framework and is called the HttpRuntime.Cache. An in-memory, per server cache.

  • 3rd: Redis. Distributed in-memory key-value store. Share cache elements across different servers that serve the same site. If StackOverflow has 9 servers then all servers will be able to find the same cached items.

  • 4th: SQL Server Cache. The entire database is cached in-memory. The entire thing.

  • 5th: SSD. Usually only hit when the SQL server cache is warming up.

  • For example, every help page is cached. Code to access a page is very terse:

    • Static methods and static classes re used. Really bad from an OOP perspective, but really fast and really friendly towards terse code. All code is directly addressed.

    • Caching is handled by a library layer of Redis and Dapper, a micro ORM.

  • To get around garbage collection problems, only one copy of a class used in templates are created and kept in a cache. Everything is measured, including GC operation, from statistics it is known that layers of indirection increase GC pressure to the point of noticeable slowness.

  • CDN hits vary, since the  query string hash is based on file content, it’s only re-fetched on a build. It’s typically 30-50 million hits a day for 300 to 600 GB of bandwidth.

  • A CDN is not used for CPU or I/O load, but to help users find answers faster.

Deploying

  • Want to deploy 5 times a day. Don’t build grand gigantic things and then put then live. Important because:

    • Can measure performance directly.

    • Forced to build the smallest thing that can possibly work.

  • TeamCity builds then copies to each web tier via a powershell script. The steps for each server are:

    • Tell HAProxy to take the server out of rotation via a POST

    • Delay to let IIS finish current requests (~5 sec)

    • Stop the website (via the same PSSession for all the following)

    • Robocopy files

    • Start the website

    • Re-enable in HAProxy via another POST

  • Almost everything is deployed via puppet or DSC, so upgrading usually consist of just nuking the RAID array and installing from a PXE boot. it’s very fast and you know it’s done right/repeatable.

Teaming

  • Teams:

    • SRE (System Reliability Engineering): – 5 people

    • Core Dev (Q&A site) : ~6-7 people

    • Core Dev Mobile: 6 people

    • Careers team that does development solely for the SO Careers product: 7 people

  • Devops and developer teams are really close-knit.

  • There’s a lot of movement between teams.

  • Most employees work remotely.

  • Offices are mostly sales, Denver and London exclusively so.

  • All else equal, it is slightly prefered to have people in NYC, because the in-person time is a plus for the casual interaction that happens in between “getting things done”. But the set up makes it possible to do real work and official team collaboration works almost entirely online.

  • They’ve learned that the in-person benefit is more than outweighed by how much you get from being able to hire the best talent that loves the product anywhere, not just the ones willing to live in the city you happen to be in.

  • The most common reason for someone going remote is starting a family. New York’s great, but spacious it is not.

  • Offices are in Manhattan and a lot of talent is there. The data center needs to not be a crazy distance away since it is always being improved. There’s also a slightly faster connection to many backbones in the NYC location – though we’re talking only a few milliseconds (if that) of difference there.

  • Making an awesome team: Love geeks. Early Microsoft, for example, was full of geeks and they conquered the world.

  • Hire from Stack Overflow community. They looks for a passion for coding, a passion for helping others, and a passion for communicating.

Budgeting

  • Budgets are pretty much project based. Money is only spent as infrastructure is added for new projects. The web servers that have such low utilization are the same ones purchased 3 years ago when the data center was built.

Testing

  • Move fast and break things. Push it live.

  • Major changes are tested by pushing them. Development has an equally powerful SQL server and it runs on the same web tier, so performance testing isn’t so bad.

  • Very few tests. Stack Overflow doesn’t use many unit tests because of their active community and heavy usage of static code.

  • Infrastructure changes. There’s 2 of everything, so there’s a backup with the old configuration whenever possible, with a quick failback mechanism. For example, keepalived does failback quickly between load balancers.

  • Redundant systems fail over pretty often just to do regular maintenance. SQL backups are tested by having a dedicated server just for restoring them, constantly (that’s a free license – do it). Plan to start full data center failovers every 2 months or so – the secondary data center is read-only at all other times.

  • Unit tests, integration tests and UI tests run on every push. All the tests must succeed before a production build run is even possible. So there’s some mixed messages going on about testing.

  • The things that obviously should have tests have tests. That means most of the things that touch money on the Careers product, and easily unit-testable features on the Core end (things with known inputs, e.g. flagging, our new top bar, etc), for most other things we just do a functionality test by hand and push it to our incubating site (formerly meta.stackoverflow, now meta.stackexchange).

Monitoring / Logging

  • Now considering using http://logstash.net/ for log management. Currently a dedicated service inserts the syslog UDP traffic into a SQL database. Web pages add headers for the timings on the way out which are captured with HAProxy and are included in the syslog traffic.

  • Opserver and Realog. are how many metrics are surfaced. Realog is a logging display system built by Kyle Brandt and Matt Jibson in Go

  • Logging is from the HAProxy load balancer via syslog instead of via IIS. This is a lot more versatile than IIS logs.

Clouding

  • Hardware is cheaper than developers and efficient code. You are only as fast as your slowest bottleneck and all the current cloud solutions have fundamental performance or capacity limits.
  • Could you build SO well if building for the cloud from day one? Mostl likely. Could you consistency render all your pages performing several up to date queries and cache fetches across that cloud network you don’t control and getting sub 50ms render times? That’s another matter. Unless you’re talking about substantially higher cost (at least 3-4x), the answer is no – it’s still more economical for SO to host in their own servers.

Performance As A Feature

  • StackOverflow puts a heavy emphasis on performance. The goal for the main page  is to load in less than 50ms, but can be as low as 28ms.

  • Programmers are fanatic about reducing page load times and improving the user experience.

  • Timings for every single request to the network are recorded. With these kind of metrics you can make decisions on where to improve your system.

  • The primary reason their servers run at such low utilization is efficient code. Web servers average between 5-15% CPU, 15.5 GB of RAM used and 20-40 Mb/s network traffic.  The SQL servers average around 5-10% CPU, 365 GB of RAM used, and 100-200 Mb/s of network traffic. This has three major benefits: general room to grow before and upgrade is necessary; headroom to stay online for when things go crazy (bad query, bad code, attacks, whatever it may be); and the ability to clock back on power if needed.

Lessons Learned

  • Why use Redis if you use MS products? gabeech: It’s not about OS evangelism. We run things on the platform they run best on. Period. C# runs best on a windows machine, we use IIS. Redis runs best on a *nix machine we use *nix.

  • Overkill as a strategy. Nick Craver on why their network is over provisioned: Is 20 Gb massive overkill? You bet your ass it is, the active SQL servers average around 100-200 Mb out of that 20 Gb pipe.  However, things like backups, rebuilds, etc. can completely saturate it due to how much memory and SSD storage is present, so it does serve a purpose.

  • SSDs Rock. The database nodes all use SSD and the average write time is 0 milliseconds.

  • Know your read/write workload.

  • Keeping things very efficient means new machines are not needed often. Only when a new project comes along that needs different hardware for some reason is new hardware added. Typically memory is added, but other than that efficient code and low utilization means it doesn’t need replacing. So typically talking about adding a) SSDs for more space, or b) new hardware for new projects.

  • Don’t be afraid to specialize. SO uses complicated queries based on tags, which is why a specialized Tag Engine was developed.

  • Do only what needs to be done. Tests weren’t necessary because an active community did the acceptance testing for them. Add projects only when required. Add a line of code only when necessary. You Aint Gone Need It really works.

  • Reinvention is OK. Typical advice is don’t reinvent the wheel, you’ll just make it worse, by making it square, for example. At SO they don’t worry about making a “Square Wheel”. If developers can write something more lightweight than an already developed alternative, then go for it.

  • Go down to the bare metal. Go into the IL (assembly language of .Net). Some coding is in IL, not C#. Look at SQL query plans. Take memory dumps of the web servers to see what is actually going on. Discovered, for example, a split call generated 2GB of garbage.

  • No bureaucracy. There’s always some tools your team needs. For example, an editor, the most recent version of Visual Studio, etc. Just make it happen without a lot of process getting in the way.

  • Garbage collection driven programming. SO goes to great lengths to reduce garbage collection costs, skipping practices like TDD, avoiding layers of abstraction, and using static methods. While extreme, the result is highly performing code. When you’re doing hundreds of millions of objects in a short window, you can actually measure pauses in the app domain while GC runs. These have a pretty decent impact on request performance.

  • The cost of inefficient code can be higher than you think.  Efficient code stretches hardware further, reduces power usage, makes code easier for programmers to understand.

( via HighScalability.com )

The Stunning Scale Of AWS And What It Means For The Future Of The Cloud

16238755496_4a3014ebbb_mJames Hamilton, VP and Distinguished Engineer at Amazon, and long time blogger of interesting stuff, gave an enthusiastic talk at AWS re:Invent 2014 on AWS Innovation at Scale. He’s clearly proud of the work they are doing and it shows.

James shared a few eye popping stats about AWS:

  • 1 million active customers
  • All 14 other cloud providers combined have 1/5th the aggregate capacity of AWS (estimate by Gartner)
  • 449 new services and major features released in 2014
  • Every day, AWS adds enough new server capacity to support all of Amazon’s global infrastructure when it was a $7B annual revenue enterprise (in 2004).
  • S3 has 132% year-over-year growth in data transfer
  • 102Tbps network capacity into a datacenter.

The major theme of the talk is the cloud is a different world. It’s a special environment that allows AWS to do great things at scale, things you can’t do, which is why the transition from on premise x86 servers to the public cloud is happening at a blistering pace. With so many scale driven benefits to the public cloud, it’s a transition that can’t be stopped. The cloud will keep getting more reliable, more functional, and cheaper at a rate that you can’t begin to match with your limited resources, generalist gear, bloated software stacks, slow supply chains, and outdated innovation paradigms.

That’s the PR message at least. But one thing you can say about Amazon is they are living it. They are making it real. So a healthy doubt is healthy, but extrapolating out the lines of fate would also be wise.

One of the fickle finger of fate advantages AWS has is resources. At one million customers they have the scale to keep the engine of expansion and improvement going. Profits aren’t being taken out, money is being reinvested. This is perhaps the most important advantage of scale.

But money without smarts is simply waste. Amazon wants you to know they have the smarts. We’ve heard how Google and Facebook build their own gear, Amazon does too. They build their own networking gear, networking software, racks, and they work with Intel to get faster processor versions of processors than are available on the market. The key is they know everything and control everything about their environment, so they can build simpler gear that does exactly what they want, which turns out to be cheaper and more reliable in the end.

Complete control allows quality metrics to be built into everything. Metrics drive a constant quality increase in all parts of the system, which is why against all odds AWS is getting more reliable as the pace of innovation quickens. Great pools of actionable data turned into knowledge is another huge advantage of scale.

Another thing AWS can do that you can’t is the Availability Zone architecture itself. Each AZ is its own datacenter and AZs within a region are located very close together. This reduces messaging latencies, which means state can be synchronously replicated between AZs, which greatly improves availability compared to the typical approach where redundant datacenters are very far apart.

It’s a talk rich with information and…well, spunk. The real meta-theme of the talk is how Amazon consciously uses scale to their competitive advantage. For Amazon scale isn’t just an expense to be dealt with, scale is a resource to exploit, if you know how.

Here’s my gloss of James Hamilton’s incredible talk…

Everything In The Talk Has A Foundation In Scale

  • Every day, AWS adds enough new server capacity to support all of Amazon’s global infrastructure when it was a $7B annual revenue enterprise (in 2004).

  • 365 days a year component manufacturers have to get gear to server and storage manufacturers, the server and storage manufacturers have to produce the gear and push it into the logistics channel, it has to get from the logistics channel over to the correct datacenter, it has to arrive at the loading dock, people have be there to wheel the racks to the right place in the DC, there has to be power, cooling, networking, the app stack has to be loaded up, it has to be tested, it has to be released to customers.

  • S3 usage: 132% year-over-year growth in data transfer; EC2 usage: 99% YoY usage growth; AWS overall business: over 1 million active customers.

  • All 14 other cloud providers combined have 1/5th the aggregate capacity of AWS (estimate by Gartner)

  • With over a million customers it means you are in a rich ecosystem. You have your pick of software vendors, if you have a problem someone has likely had it before, it’s easier to get your job done fast.

  • Such high growth means Amazon has the resources to keep reinvesting and innovating by increasing breadth and depth of services they offer.

  • Big Transitions generally occur when the economics are far superior, like mainframes to UNIX servers and then UNIX servers to x86 servers. These transitions usually take 10+ years. What’s different about the x86 on premise transition to the cloud is the speed at which it is happening.  The speed of the cloud transition is a function of great economic value along with the low friction for adoption. You don’t need software, you don’t hardware, you can just do it.

There Are Big Cost Problems In Networking

  • Networking is a red alert situation across the industry. It’s the perfect storm.

  • Problem #1: The cost of networking is escalating relative the cost of all other equipment. It’s anti-Moore’s law. All other gear is going down in cost, networking is getting relatively more expensive over time. Relative monthly costs: servers: 57%; networking equipment: 8%; power distribution and cooling: 18%; power: 13%; other: 4%.

  • Problem #2: At the same time networking is getting more expensive, the ratio of networking to compute is going up. That’s partly because Moore’s law is working (still) with servers and compute density is going up. Partly it’s because as the cost of compute falls the amount of advanced analytics performed goes up and analytics are network intensive. Solving big problems using a large number of servers requires a lot of networking. Network traffic has moved east-west rather than the traditional north-south direction.

  • Amazon’s solution 5 years ago was data driven and radical: they built to their own networking designs. Special routers were built. A team was hired to build the protocol stack all the way to the top. And they deployed all this themselves in their network. All services worldwide run on this gear.

    • This strategy turned out to be a lot cheaper. Just the support contract for networking gear was running 10s of millions of dollars.

    • Availability went up. The source of the improvement was simplicity. The problem AWS was trying to solve was simpler than the problem enterprise gear tries to solve. Enterprise gear must adhere to a lot of complicated specs that go unused and only make the system more fragile. By implementing just the functionality that was required meant a much simpler system which lead to higher availability. Any way to win is a good way to win.

    • A cornucopia of metrics. They measure everything. The rule is if a customer has a bad experience using their system their metrics must show it. This means metrics are getting better all the time because customer problems drive better metrics. Once you have metrics that accurately reflect customer experience then goals can be set on making the system better. Every week improvements are made to make things better. If the code didn’t start off better, it gets better over time.

    • Testability. Their gear was better because they tested it better. Enterprise gear is hard to test at scale. They created a $40 million test bed of 8000 servers (3 megawatts). But since this was the cloud they effectively rented the servers for a few months, so it was relatively cheap.

Networking Explained Layer By Layer, From The Very Top To The Network Interface Card

AWS Worldwide Network Backbone

  • 11 AWS regions worldwide. Choose which ones to use by nearness to customers or required jurisdictional boundaries.

  • Private fiber links interconnect most of the major regions. This avoids all the capacity planning problems (Amazon can do better capacity planning), peering issues, and buffering problems that occur on public links. So it’s faster to run their own network, it’s more reliable, cheaper, and lower latency.

Example AWS Region (US East ((Northern Virginia))

  • All regions have at least two availability zones. US East has five AZs.

  • Redundant paths run to transit centers.

  • Each region has redundant transit centers. A transit center connects private links to other AWS regions, private links to AWS Direct Connect customers, and to the internet through peering and paid transit.

  • If one AZ fails all the other AZs keep working.

  • Metro-area DWDM links between AZs

  • 82,864 fiber strands in a regions

  • AZs are less than 2ms apart and usually less than 1ms apart. From a latency perspective they are fairly close, within a few kilometers. Far enough apart for safety, close enough for good latency.

  • 25Tbps peak traffic between AZs

  • AWS offers AZs because:

    • With a single hardened datacenter the best reliability you’ll get is around 99.9% over a mix of applications over a large period of time. High reliability requires running in two datacenters. Traditionally datacenter diversity is from two datacenters that are very far apart because it’s not cost effective to keep datacenters close together. This means longer latencies. LA to NEW is 74ms round trip. Committing to an SSD is 1 to 2ms. You can’t wait 70+ milliseconds for a transaction to commit. Which means applications commit locally and then replicate to the second datacenter. This design in a failure case loses data during the failover. While a true failure is rare, like a building burning down, transient failures are more common, like a load balancer problem for example. So would you failover your connection was down for 3 minutes? No, because data would be lost and it would take a long time to recover that data from other sources. So you lose availability for common events.

    • AZs are milliseconds apart so you can commit to both at the same time. That means if you failover a customer won’t be able to tell because the data replication was synchronous. It’s invisible. It’s hard to write code for this model so you won’t do it for everything. And for some apps a concern for multi-AZ failures might also prevent you from using multiple AZs, but for the rest of apps this is a very powerful model. It’s more costly, but it gives AWS certain advantages.

Example AWS Availability Zone

  • An Availability Zone is always a datacenter in a completely independent building.

  • Amazon has 28+ datacenters. The plus means there are more datacenters in an AZ as a way of extending capacity for an AZ. More datacenters are added within an AZ to extend the capacity of an AZ. Otherwise you would be forced to split your app across AZs, even if you didn’t want to.

  • Some AZs have size fairly substantial datacenters.

  • DCs in an AZ are less than ¼ms apart.

Example Datacenter

  • AWS datacenters are purposely not gigantic. A single datacenter is 25 – 30 megawatts, with between 50,000 – 80,000 servers

  • The return on datacenter largeness diminishes. The advantage of datacenter scale as you build bigger and bigger goes down. Early advantages are huge. Later advantages are small. Going from 2000 to 2500 racks is a little better. A tiny datacenter is too expensive. A really large datacenter is only marginally more expensive per rack than a medium datacenter.

  • Risk increases with larger datacenters. The blast radius if something goes wrong and the datacenter is destroyed, the loss is too big.

  • 102Tbps network capacity into a datacenter.

Example Rack, Server & NIC

  • The only thing that matters for latency is the software latency at either end of a connection. Latency between two servers when sending a message:

    • Your app -> guest OS -> hypervisor -> NIC : the latency is milliseconds

    • through the NIC : the latency is microseconds

    • over the fiber : the latency is nanoseconds

  • SR-IOV (Single Root I/O Virtualization) allows a NIC to provide virtualize in hardware network cards. Each guest gets its own network card. The benefit is > 2x average latency reduction and > 10x latency jitter improvement. This means outliers a down to a 1/10th of what they were. SR-IOV is being deployed now on newer instances types and will eventually be everywhere. The hard part wasn’t adding SR-IOV, it was adding isolation, metering, DDoS protection, and the capacity limits that make SR-IOV useful in a cloud environment.

AWS Custom Server & Storage Designs

  • Cost of a negative situation is not high so expensive unneeded protection can be removed. Servers are designed for what they do, not a general population of users. Amazon knows exactly in what environment the server will run in, they’ll know exactly when something goes wrong, so the servers can be designed with less engineering headroom. The cost of server failure is not that big for them. They are already on site and are very good at replacing harddisks, etc. So a lot of the carefulness in enterprise equipment is not necessary.

  • Processors can be pushed harder. They know the cooling requirements, they influence the mechanical design, they just design good servers, so they can get more performance out of a server. Though a partnership with Intel Amazon has processors that run faster than can be bought on the open market.

  • An example is the design for their own storage rack. It weighs over a ton, 19” wide, and holds 864 disk drives. For some workloads this a wonderful game changing design that helps them get better prices in some areas.

Power Infrastructure

  • Amazon has designed and built their own power substations. It only saves a little money, but they can build them much faster. Utility companies are not used to dealing with the rate AWS is growing at, so they had to build their own.

  • 3 100% carbon neutral regions: US West (Oregon), AWS GovCloud (US), EU (Frankfurt)

Rapid Pace Of Innovation

  • 449 new services and major features released in 2014. 24 in 2008, 48 in 2009, 61 in 2010, 82 in 2011, 159 in 2012, 280 in 2013.

  • AWS is getting more reliable as the pace of innovation quickens. Their goal is to make available to customers the same tools that use to achieve this rate of innovation and high quality.

( via HighScalability.com )

What is a Monolith?

There is currently a strong trend for microservice based architectures and frequent discussions comparing them to monoliths. There is much advice about breaking-up monoliths into microservices and also some amusing fights between proponents of the two paradigms – see the great Microservices vs Monolithic Melee. The term ‘Monolith’ is increasingly being used as a generic insult in the same way that ‘Legacy’ is!

However, I believe that there is a great deal of misunderstanding about exactly what a ‘Monolith’ is and those discussing it are often talking about completely different things.

A monolith can be considered an architectural style or a software development pattern (or anti-pattern if you view it negatively). Styles and patterns usually fit into different Viewtypes (a viewtype is a set, or category, of views that can be easily reconciled with each other [Clements et al., 2010]) and some basic viewtypes we can discuss are:

  • Module – The code units and their relation to each other at compile time.
  • Allocation – The mapping of the software onto its environment.
  • Runtime – The static structure of the software elements and how they interact at runtime.

A monolith could refer to any of the basic viewtypes above.

Module Monolith

If you have a module monolith then all of the code for a system is in a single codebase that is compiled together and produces a single artifact. The code may still be well structured (classes and packages that are coherent and decoupled at a source level rather than a big-ball-of-mud) but it is not split into separate modules for compilation. Conversely a non-monolithic module design may have code split into multiple modules or libraries that can be compiled separately, stored in repositories and referenced when required. There are advantages and disadvantages to both but this tells you very little about how the code is used – it is primarily done for development management.

module

 

 

Allocation Monolith

For an allocation monolith, all of the code is shipped/deployed at the same time. In other words once the compiled code is ‘ready for release’ then a single version is shipped to all nodes. All running components have the same version of the software running at any point in time. This is independent of whether the module structure is a monolith. You may have compiled the entire codebase at once before deployment OR you may have created a set of deployment artifacts from multiple sources and versions. Either way this version for the system is deployed everywhere at once (often by stopping the entire system, rolling out the software and then restarting).

A non-monolithic allocation would involve deploying different versions to individual nodes at different times. This is again independent of the module structure as different versions of a module monolith could be deployed individually.

allocation

 

Runtime Monolith

A runtime monolith will have a single application or process performing the work for the system (although the system may have multiple, external dependencies). Many systems have traditionally been written like this (especially line-of-business systems such as Payroll, Accounts Payable, CMS etc).

Whether the runtime is a monolith is independent of whether the system code is a module monolith or not. A runtime monolith often implies an allocation monolith if there is only one main node/component to be deployed (although this is not the case if a new version of software is rolled out across regions, with separate users, over a period of time).

runtime

Note that my examples above are slightly forced for the viewtypes and it won’t be as hard-and-fast in the real world.

Conclusion

Be very carefully when arguing about ‘Microservices vs Monoliths’. A direct comparison is only possible when discussing the Runtime viewtype and properties. You should also not assume that moving away from a Module or Allocation monolith will magically enable a Microservice architecture (although it will probably help). If you are moving to a Microservice architecture then I’d advise you to consider all these viewtypes and align your boundaries across them i.e. don’t just code, build and distribute a monolith that exposes subsets of itself on different nodes.

(Via Codingthearchitecture.com)

 

Async Fragments: Rediscovering Progressive HTML Rendering with Marko

At eBay, we take site speed very seriously and are always looking for ways to allow developers to create faster-loading web apps. This involves fully understanding and controlling how web pages are delivered to web browsers. Progressive HTML rendering is a relatively old technique that can be used to improve the performance of websites, but it has been lost in a whole new class of web applications. The idea is simple: give the web browser a head start in downloading and rendering the page by flushing out early and multiple times. Browsers have always had the helpful feature of parsing and responding to the HTML as it is being streamed down from the server (even before the response is ended). This feature allows the HTML and external resources to be downloaded earlier, and for parts of the page to be rendered earlier. As a result, both the actual load time and the perceived load time improve.

In this blog post, we will take an in-depth look at a technique we call “Async Fragments” that takes advantage of progressive HTML rendering to improve site speed in ways that do not drastically complicate how web applications are built. For concrete examples we will be using Node.js,Express.js and the Marko templating engine (a JavaScript templating engine that supports streaming, flushing, and asynchronous rendering). Even if you are not using these technologies, this post can give you insight into how your stack of choice could be further optimized.

To see the techniques discussed in this post in action, please take a look at the accompanying sample application.

Background

Progressive HTML rendering is discussed in the post The Lost Art of Progressive HTML Rendering by Jeff Atwood, which was published back in 2005. In addition, the “Flush the Buffer Early” rule is described by the Yahoo! Performance team in their Best Practices for Speeding Up Your Web Site guide. Stoyan Stefanov provides an in-depth look at progressive HTML rendering in his Progressive rendering via multiple flushes post. Facebook discussed how they use a technique they call “BigPipe” to improve page load times and perceived performance by dividing up a page into “pagelets.” Those articles and techniques inspired many of the ideas discussed in this post.

In the Node.js world, its most popular web framework, Express.js, unfortunately recommends a view rendering engine that does not allow streaming and thus prevents progressive HTML rendering. In a recent post, Bypassing Express View Rendering for Speed and Modularity, I described how streaming can be achieved with Express.js; this post is largely a follow-up to discuss how progressive HTML rendering can be achieved with Node.js (with or without Express.js).

Without progressive HTML rendering

A page that does not utilize progressive HTML rendering will have a slower load time because the bytes will not be flushed out until the complete HTML response is built. In addition, after the client finally receives the complete HTML it will then see that it needs to download additional external resources (such as CSS, JavaScript, fonts, and images), and downloading these external resources will require additional round trips. In addition, pages that do not utilize progressive HTML rendering will also have a slower perceived load time, since the screen will not update until the complete HTML is downloaded and the CSS and fonts referenced in the <head> section are downloaded. Without progressive HTML rendering, a server/client waterfall chart might be similar to the following:

ps1

The corresponding page controller might look something like this:

function controller(req, res) {
    async.parallel([
            function loadSearchResults(callback) {
                ...
            },
            function loadFilters(callback) {
                ...
            },
            function loadAds(callback) {
                ...
            }
        ],
        function() {
            ...
            var viewModel = { ... };
            res.render('search', viewModel);
        })
}

As you can see in the above code, the page HTML is not rendered until all of the data is asynchronously loaded.

Because the HTML is not flushed until all back-end services are completed, the user will be staring at a blank screen for a large portion of the time. This will result in a sub-par user experience (especially with a poor network connection or with slow back-end services). We can do much better if we flush part of the HTML earlier.

Flushing the head early

A simple trick to improve the responsiveness of a website is to flush the head section immediately. The head section will typically include the links to the external CSS resources (i.e. the <link>tags), as well as the page header and navigation. With this approach the external CSS will be downloaded sooner and the initial page will be painted much sooner as shown in the following waterfall chart:

ps2

As you can see in the chart above, flushing the head early reduces the time to render the initial page. This technique improves the responsiveness of the page, but it does not significantly reduce the total time it takes to make the page fully functional. With this approach, the server is still waiting for all back-end services to complete before flushing the final HTML. In addition, downloading of external JavaScript resources will be delayed since <script> tags are placed at the end of the page (assuming you are following best practices) and don’t get sent out until the second and final flush.

Multiple flushes

Instead of flushing only the head early, it is often beneficial to flush multiple times before ending the response. Typically, a page can be divided into multiple fragments where some of the fragments may depend on data asynchronously loaded from various back-end services while others may not depend on any asynchronously loaded data. The fragments that depend on asynchronously loaded data should be rendered asynchronously and flushed as soon as possible.

For now, we will assume that these fragments need to be flushed in the proper HTML order (versus the order that the data asynchronously loads), but we will also show how out-of-order flushing can be used to further improve both page load times and perceived performance. When using “in-order” flushing, fragments that complete out of order will need to be buffered until they are ready to be flushed in the proper order.

In-order flushing of async fragments

As an example, let’s assume we have divided a complex page into the following fragments:

ps3

Each fragment is assigned a number based on the order that it appears in the HTML document. In code, our output HTML for the page might look like the following:

<html>
<head>
    <title>Clothing Store</title>
    <!-- 1a) Head <link> tags -->
</head>
<body>
    <header>
        <!-- 1b) Header -->
    </header>
    <div class="body">
        <main>
            <!-- 2) Search Results -->
        </main>
        <section class="filters">
            <!-- 3) Search filters -->
        </section>
        <section class="ads">
            <!-- 4) Ads -->
        </section>
    </div>
    <footer>
        <!-- 5a) Footer -->
    </footer>
    <!-- 5b) Body <script> tags -->
</body>
</html>

The Marko templating engine provides a way to declaratively bind template fragments to asynchronous data provider functions (or Promises). An asynchronous fragment is rendered when the asynchronous data provider function invokes the provided callback with the data. If the asynchronous fragment is ready to be flushed, then it is immediately flushed to the output stream. Otherwise, if the asynchronous fragment completed out of order then the rendered HTML is buffered in memory until it is ready to be flushed. The Marko templating engine ensures that fragments are flushed in the proper order.

Continuing with the previous example, our HTML page template with asynchronous fragments defined will be similar to the following:

<html>
<head>
    <title>Clothing Store</title>
    <!-- Head <link> tags -->
</head>
<body>
    <header>
        <!-- Header -->
    </header>
    <div class="body">
        <main>
            <!-- Search Results -->
            <async-fragment data-provider="data.searchResultsProvider"
                var="searchResults">
 
                <!-- Do something with the search results data... -->
                <ul>
                    <li for="item in searchResults.items">
                        $item.title
                    </li>
                </ul>
 
            </async-fragment>
        </main>
        <section class="filters">
 
            <!-- Search filters -->
            <async-fragment data-provider="data.filtersProvider"
                var="filters">
                <!-- Do something with the filters data... -->
            </async-fragment>
 
        </section>
        <section class="ads">
 
            <!-- Ads -->
            <async-fragment data-provider="data.adsProvider"
                var="ads">
                <!-- Do something with the ads data... -->
            </async-fragment>
 
        </section>
    </div>
    <footer>
        <!-- Footer -->
    </footer>
    <!-- Body <script> tags -->
</body>
</html>

The data provider functions should be passed to the template as part of the view model as shown in the following code for a sample page controller:

function controller(req, res) {
    template.render({
            searchResultsProvider: function(callback) {
                performSearch(req.params.category, callback);
            },
 
            filtersProvider: function(callback) {
                ...
            },
 
            adsProvider: function(callback) {
                ...
            }
        },
        res /* Render directly to the output HTTP response stream */);
}

In this particular example, the “search results” async fragment appears first in the HTML template, and it happens to take the longest time to complete. As a result, all of the subsequent fragments will need to be buffered on the server. The resulting waterfall with in-order flushing of async fragments is shown below:

ps4

While the performance of this approach might be fine, we can enable out-of-order flushing for further performance gains as described in the next section.

Out-of-order flushing of async fragments

Marko achieves out-of-order flushing of async fragments by doing the following:

Instead of waiting for an async fragment to finish, a placeholder HTML element with an assigned id is written to the output stream. Out-of-order async fragments are rendered before the ending <body> tag in the order that they complete. Each out-of-order async fragment is rendered into a hidden <div> element. Immediately after the out-of-order fragment, a <script> block is rendered to replace the placeholder DOM node with the DOM nodes of the corresponding out-of-order fragment. When all of the out-of-order async fragments complete, the remaining HTML (e.g. </body></html>) will be flushed and the response ended.

To clarify, here is what the output HTML might look like for a page with out-of-order flushing enabled:

<html>
<head>
    <title>Clothing Store</title>
    <!-- 1a) Head <link> tags -->
</head>
<body>
    <header>
        <!-- 1b) Header -->
    </header>
    <div class="body">
        <main>
            <!-- 2) Search Results -->
            <span id="asyncFragment0Placeholder"></span>
        </main>
        <section class="filters">
            <!-- 3) Search filters -->
            <span id="asyncFragment1Placeholder"></span>
        </section>
        <section class="ads">
            <!-- 4) Ads -->
            <span id="asyncFragment2Placeholder"></span>
        </section>
    </div>
    <footer>
        <!-- 5a) Footer -->
    </footer>
 
    <!-- 5b) Body <script> tags -->
 
    <script>
    window.$af=function(){
    // Small amount of code to support rearranging DOM nodes
    // Unminified:
    // https://github.com/raptorjs/marko-async/blob/master/client-reorder-runtime.js
    };
    </script>
 
    <div id="asyncFragment1" style="display:none">
        <!-- 4) Ads content -->
    </div>
    <script>$af(1)</script>
 
    <div id="asyncFragment2" style="display:none">
        <!-- 3) Search filters content -->
    </div>
    <script>$af(2)</script>
 
    <div id="asyncFragment0" style="display:none">
        <!-- 2) Search results content -->
    </div>
    <script>$af(0)</script>
 
</body>
</html>

One caveat with out-of-order flushing is that it requires JavaScript running on the client to move each out-of-order fragment into its proper place in the DOM. Thus, you would only want to enable out-of-order flushing if you know that the client’s web browser has JavaScript enabled. Also, moving DOM nodes may cause the page to be reflowed, which can be visually jarring to the user and result in more client-side CPU usage. If reflow is an issue then there are tricks that can be used to avoid a reflow (e.g., reserving space as part of the initial wireframe). Marko also allows alternative content to be shown while waiting for an out-of-order async fragment.

To enable out-of-order flushing with Marko, the client-reorder="true" attribute must be added to each <async-fragment> tag, and the <async-fragments> tag must be added to the end of the page to serve as the container for rendered out-of-order fragments. Here is the updated<async-fragment> tag for the search results fragment:

<async-fragment data-provider="data.searchResultsProvider"
    var="searchResults"
    client-reorder="true">
    ...
</async-fragment>

The updated HTML page template with the new <async-fragments> tag is shown below:

<html>
<head>
    <title>Clothing Store</title>
    <!-- Head <link> tags -->
</head>
<body>
    ...
 
    <!-- Body <script> tags -->
 
    <async-fragments/>
</body>
</html>

In combination with out-of-order flushing, it may be beneficial to move <script> tags that link to external resources to the end of the first chunk (before all of the out-of-order chunks). While the server is busy preparing the rest of the page, the client can start downloading the external JavaScript required to make the page functional. As a result, the user will be able to start interacting with the page sooner.

Our final waterfall with out-of-order flushing will now be similar to the following:

ps5

The final waterfall shows that the strategy of out-of-order flushing of asynchronous fragments can significantly improve the load time and perceived load time of a page. The user will be met with a progressive loading of a page that is ready to be interacted with sooner.

Additional considerations

HTTP transport and HTML compression

To allow HTML to be served in parts, chunked transfer encoding should be used for the HTTP response. Chunked transfer encoding uses delimiters to break up the response, and each flush results in a new chunk. If gzip compression is enabled (and it should be) then flushing the pending data to the gzip stream will result in a gzip data frame being written to the response as part of each chunk. Flushing too often will negatively impact the effectiveness of the compression algorithm, but without flushing periodically then progressive HTML rendering will not be available. By default, Marko will flush at the beginning of an <async-fragment> block (in order to send everything that has already completed), as well as when an async fragment completes. This default strategy results in efficient progressive loading of an HTML page as long as there are not too many async fragments.

Binding behavior

For improved usability and responsiveness, there should not be a long delay between rendered HTML being displayed to the user in the web browser and behavior being attached to the associated DOM. At eBay, we use the marko-widgets module to bind behavior to DOM nodes. Marko Widgets supports binding behavior to rendered widgets immediately after each independent async fragment, as illustrated in the accompanying sample app. For immediate binding to work, the required JavaScript code must be included earlier in the page. For more details, please see the marko-widgets module documentation.

Error handling

It is important to note that as soon as a byte is flushed for the HTTP body, then the response is committed; no additional HTTP headers can be sent (e.g., no server-side redirects or cookie-setting), and the HTML that has been sent cannot be “unsent”. Therefore, if an asynchronous data provider errors or times out, then the app must be prepared to show alternative content for thatparticular async fragment. Please see the documentation for the marko-async module for additional information on how to show alternative content in case of an error.

Summary

The Async Fragments technique allows web developers to maximize the benefits of progressive HTML rendering to produce web pages that have improved actual and perceived load times. Developers at eBay have found the concept of binding HTML template fragments to asynchronous data providers easy to grasp and utilize. In addition, the flexibility to support both in-order and out-of-order flushing of async fragments makes this technique applicable for all web browsers and user agents.

The Marko templating engine is being used as part of eBay’s latest Node.js stack to improve performance while also simplifying how pages are constructed on both the server and the client. Marko is one of a few templating engines for Node.js and web browsers that support streaming, flushing, and asynchronous rendering. Marko has a simple HTML-based syntax, and the Marko compiler produces small and efficient JavaScript modules as output. We encourage you to try Marko online and in your next Node.js project. Because Marko is a key component of eBay’s internal Node.js stack, and given that it is heavily documented and tested, you can be confident that it will be well supported.

(Via Calendar.perfplanet.com)