The Architecture Of Algolia’s Distributed Search Network

dsn-cover
Algolia started in 2012 as an offline search engine SDK for mobile. At this time we had no idea that within two years we would have built a worldwide distributed search network.

Today Algolia serves more than 2 billion user generated queries per month from 12 regions worldwide, our average server response time is 6.7ms and 90% of queries are answered in less than 15ms. Our unavailability rate on search is below 10-6 which represents less than 3 seconds per month.

The challenges we faced with the offline mobile SDK were technical limitations imposed by the nature of mobile. These challenges forced us to think differently when developing our algorithms because classic server-side approaches would not work.

Our product has evolved greatly since then. We would like to share our experiences with building and scaling our REST API built on top of those algorithms.

We will explain how we are using a distributed consensus for high-availability and synchronization of data in different regions around the world and how we are doing the routing of queries to the closest locations via an anycast DNS.

The data size misconception

Before designing the architecture, we first had to identify the major use cases we needed to support. This was especially true when considering our scaling needs. We had to know if our customers would need to index Gigabytes, Terabytes, or Petabytes of data. The architecture would be different depending on how many of those use cases we needed to handle.

When people think about search, most think about very big use cases like Google’s web page indexing or Facebook’s indexing of trillions of posts. If you stop and think about the search boxes you see every day, the majority of them do not search massively big datasets. Netflix searches approximately 10,000 titles and Amazon’s database in the US contains around 200,000,000 products. The data from both of these cases can be stored on a single machine! We are not saying that having a single machine is a good setup, but keeping in mind all that data can fit on one machine is really important since cross-machine synchronization is a big source of complexity and performance loss.

The road to high-availability

When building a SaaS API, high availability is a big concern as removing all single points of failure (SPOF) is extremely challenging. We spent weeks brainstorming the ideal search architecture for our service while keeping in mind our product would be geared towards user facing search.

Master-Slave Vs. Master-Master

By temporarily restricting the problem to each index being stored on a single machine, we simplified our high availability setup to several machines hosted in different data centers. With this setup, the first solution we thought of was to have a master-slave setup with one master machine receiving all indexing operations and then replicating them to one or more slave machines. With this approach, we could easily load balance search queries across all the machines.

The problem with this master-slave approach is that our high availability only works for search queries. All indexing operations need to go to the master. This architecture is too risky for a service company. All it takes is for the master to be down, which will happen, and clients will start having indexing errors.

We must implement a master-master architecture! The key element to enabling a master-master setup is to have a way of agreeing on a single result among a group of machines. We need to have shared knowledge between all machines which stays consistent under all circumstances, even when there is a network split between machines.

Introducing The Distributed Coherency

For a search engine, one of the best ways to introduce this shared knowledge is to treat the write operations as a unique stream of operations that must be applied in a certain order. When we have several operations coming at the exact same time, we need to assign them a sequence ID. This ID can then be used to ensure the sequence is applied exactly the same way on all replicas.

In order to assign a sequence ID (a number incremented by one after each job), we need to have a shared global state on the next sequence ID between machines. ZooKeeper opensource software is the de-facto solution for distributed knowledge in a cluster and we initially started to use ZooKeeper with the following sequence:

  1. When a machine receives a job, it copies the job to all replicas using a temporary name.

  2. That machine then takes the distributed lock.

  3. Reads the last sequence ID in ZooKeeper and sends an order to copy the temporary file as sequence ID + 1 on all machines. This is equivalent to a two phase commit.

  4. If we have a majority of positive answers from the machines (quorum), we save sequence ID + 1 in Zookeeper.

  5. The distributed lock is then released.

  6. Finally, the client sending the job is informed of the result. This would be success if there is a majority of commit.

Unfortunately this sequence is not right because if a machine that acquires the lock crashes or restarts between steps 3 and 4, we can end up in a state where the job is committed on some machines, a more complex sequence is needed.

The packaging of ZooKeeper as an external service via a TCP connection makes it really difficult to have it right and requires to use a big timeout (default timeout is set to 4 seconds, representing two ticks of two seconds each).

As a consequence, every failure event, either from hardware or software, would freeze our entire system for the duration of this timeout. It might seem acceptable, but in our case we wanted to test a failure very often in production (like the Monkey testing approach of Netflix).

The Raft Consensus Algorithm

Around the time we were running into these problems, the RAFT consensus algorithm was published. It was clear right away that this algorithm fit our use case perfectly. The state machine of RAFT is our index and the log is the list of index jobs to be executed. I already knew about the PAXOS protocol but did not have a strong enough understanding of it and all the variants to be confident enough to implement it myself. RAFT, on the other hand, was much clearer. If was a perfect match for what we needed and even without stable open source implementations at that time, I was confident enough in my understanding to implement it as the basis of our architecture.

The hardest part of implementing consensus algorithms is making sure there are no bugs in the system. To handle that, I opted for a monkey testing approach by randomly killing processes using a sleep before restarting. To test it even further, I simulated network drops and degradations via the firewall. This type of testing helped us find many bugs. Once we were operating for several days without any problems, I was very confident the implementation was done correctly.

Replicate At Application Or Filesystem Level?

We have chosen to distribute the write operations to all machines and execute them locally rather than replicating the final results on filesystem. We made this choice for two reasons:

  • It is faster. Indexing is done in parallel on all machines, it is faster than replicating the resulting binary files that can be big

  • It is compatible with multiple regions. If we replicate the files after indexing, we need to have a process that will rewrite the whole index. This means we could have huge amounts of data to transfer. The size of data to transfer is very inefficient if you need to transfer it to different geographic regions around the world (ex. New York to Singapore).

Each machine will receive all write operation jobs in the correct order and process them as soon as possible independently of other machines. This means all machines are assured to be at the same state but not necessarily at the same time. This is because the changes may not be committed on all machines at exactly the same moment.

The Compromise On Consistency

In distributed computing, the CAP Theorem states that it is impossible for a distributed computing system to simultaneously provide all three of the following:

  • Consistency: all nodes see the same data at the same time.

  • Availability: a guarantee that every request receives a response about whether it succeeded or failed.

  • Partition tolerance: the system continues to operate despite arbitrary message loss or failure of part of the system.

According to this theorem, we compromised on Consistency. We don’t guarantee that all nodes see exactly the same data at the same time but they will all receive the updates. In other words, we can have small cases where the machines are not synchronized. In reality, this is not a problem because when a customer performs a write operation we apply that job on all hosts. There is less than one second between the time of application on the first and last machine so it is normally not visible for end users. The only inconsistency possible is whether the last updated received is already applied or not, which is compatible with the use cases of our clients.

General Architecture

Definition Of A Cluster

Having a distributed consensus between machines is mandatory in order to have a high availability infrastructure but there is unfortunately a big drawback. This consensus requires several round trips between the machines, so the number of possible consensus per second is directly related to the latency between the different machines. They need to be close to have a high number of consensus per second. To be able to support several regions without sacrificing the number of possible write operations means that we need to have several clusters, each cluster will contains three machines that will act as perfect replicas.

Having one cluster per region is the minimum needed for consensus, but is still far from perfect:

  • We cannot make all customers fit on one machine.

  • The more customers we have, the less number of write operations per second each unique customer will be able to perform. This is because the maximum number of consensus per second is fixed.

In order to work around this problem, we decided to apply the same concept at the region level: each region will have several clusters of three machines. One cluster can host from one to several customers depending on the size of the data they have. This concept is close to what virtualization is doing on a physical machine. We are able to put several customers on a cluster except one customer can grow and change their usage dynamically. In order to do this, we need to develop and automate the following processes:

  • Migrate one customer to another cluster if the cluster has too much data or number of write operations.

  • Add a new machine to the cluster if the volume of queries is too big.

  • Change the number of shards or split one customer across several clusters if their volume of data is too big.

If we have these processes in place, a customer won’t be assigned to a cluster permanently. Assignment will change depending on their own usage as well as the cluster’s usage. This means we need a way to assign a customer to a cluster.

Assigning A Customer To A Cluster

The standard way to manage this assignment is to have one unique DNS entry per customer. This is similar to how Amazon Cloudfront works. Each customer is assigned a unique DNS entry of the form customerID.cloudfront.net that can then target a different set of machines depending on the customer.

We chose to go with the same approach. Each customer is assigned a unique application ID which is linked to a DNS record of the form APPID.algolia.io. This DNS record targets a specific cluster with all machines in the cluster being part of the DNS record so there is load balancing done via DNS. We also use health check mechanisms to detect machine failures and remove them from the DNS resolution.

The health check mechanism is still not sufficient to provide a good SLA even with a very low TTL on the DNS records (TTL is the time the client is allowed to keep the DNS answer cached). The problem is that a host may go down but a user still has the host in cache. The user will continue to send queries to it until the cache expires. It gets even worse because TTL is not an exact science. There are cases where systems do not respect the TTL. We have seen DNS records with a TTL of one minute transformed into a TTL of 30 minutes by some DNS servers.

In order to further improve high availability and avoid a machine failure impacting users, we generate another set of DNS records for each customer of the form APPID-1.algolia.io, APPID-2.algolia.io, and APPID-3.algolia.io. The idea behind these DNS records is to allow our API clients to retry other records when a TCP connect timeout is reached (usually set to one second). Our standard implementation is to shuffle the list of DNS records and try them in sequential order.

Combined with carefully-controlled retry and timeout logic in our API clients, this proved to be a better and cheaper solution than using specialized load balancer.

Later, we discovered the trendy .IO TLD was not a good choice for performance. There are fewer DNS servers in the anycast network of .IO compared to .NET and the ones there were saturated. This resulted in a lot of timeouts that slowed down the name resolution. We have since solved these performance problems by switching to algolia.net domains while keeping backwards compatibility by continuing to support algolia.io.

What about Scalability of a cluster?

Our choice of using several clusters allows us to add more customers without too much risk of impacting existing customers because of the isolation between clusters. But we still had concerns about the scalability of one cluster that needed to be addressed.

The first limiting factor in the scalability of a cluster is the number of write operations per second due to the consensus. In order to mitigate this factor, we introduced a batch method in our API that encapsulates a set of write operations in one operation from the consensus point of view. The problem is that some customers still perform write operations without batching which can have a negative impact on indexing speed for other customers of the cluster.

In order to reduce this performance impact, we have made two changes to our architecture:

  • We added a batching strategy when there is contention on the consensus by automatically aggregating all write operations of each customer inside a unique operation from the consensus point of view. In practice, this means that we are reordering the sequence of jobs but without an impact on the semantics of the operations. For example, if there are 1,000 jobs pending for consensus and 990 are from one customer, we will merge 990 write operations into one even if there are jobs of other customers interlaced with them.

  • We added a consensus scheduler that controls the number of write operations per second entering the consensus for each application ID. This avoids one customer being able to use all the bandwidth of the consensus.

Before we implemented these improvements, we tried a rate limit strategy by returning a 429 HTTP status code. It was apparent very quickly that this was too painful for our customers to have to watch for this response and implement a retry strategy. Today, our biggest customer performs more than one billion write operations per day on a single cluster of three machines which is an average of 11,500 operations per second with bursts of more than 150,000.

The second problem was to find the best hardware setup and avoid any potential bottlenecks such as CPU or I/O that could compromise the scalability of a cluster. Since the beginning we made the choice to use our own bare metal servers in order to fully control the performance of our service and avoid wasting any resources. Selecting the correct hardware proved to be a challenging task.

At the end of 2012, we started with a small setup consisting of: Intel Xeon E3 1245v2, 2x Intel SSD 320 series 120GB in raid 0, and 32GB of RAM. This hardware was reasonable in terms of price, more powerful than cloud platforms, and allowed us to start the service in Europe and US-East.

This setup allowed us to tune the kernel for I/O scheduling and virtual memory which was critical for us to take advantage of all available physical resources. Even so, we soon discovered our limits were the amount of RAM and I/O. We were using around 10GB of RAM for indexing which left only 20GB of RAM for caching of files used for performing search queries. Our goal had always been to have customer indices in memory in order to have a service optimized for millisecond response times. The current hardware setup was designed for 20GB of index data which was too small.

After this first setup, we tried different hardware machines with single and dual socket CPUs, 128GB and 256GB of RAM, and different models/sizes of SSD.

We finally found an optimal setup with a machine containing an Intel Xeon E5 1650v2, 128GB of RAM, and 2x400GB Intel S3700 SSD. The model of the SSD was very important for durability. We burned a lot of SSDs before finding the correct model that can operate in production for years.

In the end, the final architecture we built allowed us to scale well in all areas with only one condition: we needed to have free resources available at any moment. It might seem crazy in 2015 to deal with the pain of having to manage bare metal servers, but the gain we have in terms of quality of service and price for our customers is well worth it. We are able to offer a fully packaged search engine with replication to three different locations, in memory indices, and with excellent performance in more locations than AWS!

Is it complex to operate?

Limit The Number Of Processes

Each machine contains only three processes. The first is a nginx server with all our query interpretation code embedded inside as a module. To answer a query, we memory map the index files and directly execute the query inside the nginx worker without communicating to another process or machine. The only exception is when the customer data does not fit on one machine which is rare.

The second process is a redis key/value store that we use to check rates and limits as well as storing real time logs and counters for each application ID. These counters are used to build our real time dashboard which can be viewed when you connect to your account. This is useful for visualizing your last API calls and for debugging.

The last process is the builder. This is the process responsible for handling all write operations. When the nginx process receives a write operation, it forwards the operation to the builder to perform the consensus. It is also responsible for building the indices and contains a lot of monitoring code that checks for errors in our service such as crashes, slow indexing, indexing errors, etc. Depending on the severity of the problem, some are reported by SMS via Twilio’s API while others are reported directly to PagerDuty. Each time a new problem is detected in production and not reported we make sure to add a new probe to watch for this type of error in the future.

Ease Of Deployment

The simplicity of this stack makes deployments easy. Before we deploy any code we apply a bunch of unit tests and non regression tests. Once all those tests are passing, we gradually deploy to clusters.

Our deployments should never impact production nor be visible to end users. At the same time, we also want to generate a host failure in consensus in order to check everything is working as expected. In order to achieve both goals, we deploy each machine of a cluster independently and apply the following procedures:

  1. Fetch new nginx and builder binaries.

  2. Gracefully restart the nginx web server and relaunch nginx using the new binary without losing any user queries.

  3. Kill the builder and launch it using the new binary. This triggers a failure in RAFT on the deployment of each machine with allows us to make sure our failover is working as expected.

The simplicity of operating our system was an important goal in our architecture. We did not want nor believe deployment should be constrained by the architecture.

Achieving A Good Worldwide Coverage

Services are becoming more and more global. Serving search queries from only one worldwide region is far from optimal. For example, having search hosted in US-East will have a big difference in usability depending on where users are searching from. Latency will go from a few milliseconds for users in US-East to several hundred milliseconds for users in Asia without counting the bandwidth limitations of saturated oversea fibers.

We have seen some companies use a CDN on top of a search engine to address these issues. This ends up causing more problems than value for us because invalidating cache is a nightmare and it only improves the speed for a small percentage of queries that are frequently made. It was clear to us that in order to solve this problem we would need to replicate indices to different regions and have them loaded in memory in order to answer user queries efficiently.

What we need is an inter-region replication on top of our existing cluster replication. The replica can be stored on one machine since the replica will only be used for search queries. All write operations will still go to the original cluster of the customer.

Each customer can select the set of data centers they want to have as a replicate, so a replicate machine in a specific region can receive data from several clusters and a cluster can send data to several replicates.

The implementation of this architecture is modeled on our consensus based stream of operations. Each cluster transforms its own stream of write operations after consensus into a version for each replicate making sure to replace jobs that are not relevant for this replicate with no-op jobs. This stream of operations is then sent to all replicates as a batch of operations to avoid as much latency as possible. Sending jobs one by one would result in too many round trips with the replicates.

On the cluster, write operations are kept on the machines until they are acknowledged by all replicates.

The last part of DSN is to redirect the end user directly to the closest location. In order to do that we added another DNS record in the form of APPID-dsn.algolia.net that takes care of the resolution to the closest data center. We first used the Route53 DNS service of Amazon but rapidly hit its limits.

  • The latency-based routing is limited to the AWS regions and we have locations not covered by AWS like India, Hong Kong, Canada and Russia.

  • The geo-based routing is horrible. You need to indicate for each country what the DNS resolution will be. This is a classic approach a lot of hosted DNS providers are taking but in our case it would be a nightmare to support and would not provide enough relevancy. For example, we have several data centers in the US.

After a lot of benchmarking and discussion, we decided upon using NSOne for several reasons:

  • Their Anycast network is very good and better balanced than AWS for us. For example, they have a POP in India and Africa.

  • Their filter logic is really good. For each customer we can specify the list of machines that are associated with them (including replicates) and use a geo filter to sort them by distance. We are then able to keep the best one.

  • They support EDNS client subnets. This is important for us in order to be more relevant. We use the IP of the final user instead of the IP of their DNS server for resolution.

In terms of performance, we have been able to reach global worldwide synchronization at the second level. You can try it out on Product Hunt’s search (hosted in US-East, US-West, India, Australia, and Europe) or on Hacker News’ search (hosted in US-East, US-West, India, and Europe).

Conclusion

We spent a lot of time building our distributed and scalable architecture and have faced a lot of different problems. I hope this article gives you a better understanding about how we resolved those problems and provides a useful guide on how to design your own services.

I’m seeing more and more services that are currently facing problems similar to us, having a worldwide audience with multi-region infrastructure but with some worldwide consistent information like login or content. Having a multi-region infrastructure today is mandatory to achieve an excellent user experience. This approach can be used for example to distribute read-only replicates of a database that will be consistent worldwide!
(via HighScalability.com)

MongoDB 3.0 with a new storage engine

A lot has happened in MongoDB technology over the past year. For starters:

  • The big news in MongoDB 3.0* is the WiredTiger storage engine. The top-level claims for that are that one should “typically” expect (individual cases can of course vary greatly):
    • 7-10X improvement in write performance.
    • No change in read performance (which however was boosted in MongoDB 2.6).
    • ~70% reduction in data size due to compression (disk only).
    • ~50% reduction in index size due to compression (disk and memory both).
  • MongoDB has been adding administration modules.
    • A remote/cloud version came out with, if I understand correctly, MongoDB 2.6.
    • An on-premise version came out with 3.0.
    • They have similar features, but are expected to grow apart from each other over time. They have different names.

*Newly-released MongoDB 3.0 is what was previously going to be MongoDB 2.8. My clients at MongoDB finally decided to give a “bigger” release a new first-digit version number.

To forestall confusion, let me quickly add:

  • MongoDB acquired the WiredTiger product and company, and continues to sell the product on a standalone basis, as well as bundling a version into MongoDB. This could cause confusion because …
  • … the standalone version of WiredTiger has numerous capabilities that are not in the bundled MongoDB storage engine.
  • There’s some ambiguity as to when MongoDB first “ships” a feature, in that …
  • … code goes to open source with an earlier version number than it goes into the packaged product.

I should also clarify that the addition of WiredTiger is really two different events:

  • MongoDB added the ability to have multiple plug-compatible storage engines. Depending on how one counts, MongoDB now ships two or three engines:
    • Its legacy engine, now called MMAP v1 (for “Memory Map”). MMAP continues to be enhanced.
    • The WiredTiger engine.
    • A “please don’t put this immature thing into production yet” memory-only engine.
  • WiredTiger is now the particular storage engine MongoDB recommends for most use cases.

I’m not aware of any other storage engines using this architecture at this time. In particular, last I heard TokuMX was not an example. (Edit: Actually, see Tim Callaghan’s comment below.)

Most of the issues in MongoDB write performance have revolved aroundlocking, the story on which is approximately:

  • Until MongoDB 2.2, locks were held at the process level. (One MongoDB process can control multiple databases.)
  • As of MongoDB 2.2, locks were held at the database level, and some sanity was added as to how long they would last.
  • As of MongoDB 3.0, MMAP locks are held at the collection level.
  • WiredTiger locks are held at the document level. Thus MongoDB 3.0 with WiredTiger breaks what was previously a huge write performance bottleneck.

In understanding that, I found it helpful to do a partial review of what “documents” and so on in MongoDB really are.

  • A MongoDB document is somewhat like a record, except that it can be more like what in a relational database would be all the records that define a business object, across dozens or hundreds of tables.*
  • A MongoDB collection is somewhat like a table, although the documents that comprise it do not need to each have the same structure.
  • MongoDB documents want to be capped at 16 MB in size. If you need one bigger, there’s a special capability called GridFS to break it into lots of little pieces (default = 1KB) while treating it as a single document logically.

*One consequence — MongoDB’s single-document ACID guarantees aren’t quite as lame as single-record ACID guarantees would be in an RDBMS.

By the way:

  • Row-level locking was a hugely important feature in RDBMS about 20 years ago. Sybase’s lack of it is a big part of what doomed them to second-tier status.
  • Going forward, MongoDB has made the unsurprising marketing decision to talk about “locks” as little as possible, relying instead on alternate terms such as “concurrency control”.

Since its replication mechanism is transparent to the storage engine, MongoDB allows one to use different storage engines for different replicas of data. Reasons one might want to do this include:

  • Fastest persistent writes (WiredTiger engine).
  • Fastest reads (wholly in-memory engine).
  • Migration from one engine to another.
  • Integration with some other data store. (Imagine, for example, a future storage engine that works over HDFS. It probably wouldn’t have top performance, but it might make Hadoop integration easier.)

In theory one can even do a bit of information lifecycle management (ILM), by using different storage engines for different subsets of database, by:

  • Pinning specific shards of data to specific servers.
  • Using different storage engines on those different servers.

That said, similar stories have long been told about MySQL, and I’m not aware of many users who run multiple storage engines side by side.

The MongoDB WiredTiger option is shipping with a couple of options for block-level compression (plus prefix compression that is being used for indexes only). The full WiredTiger product also has some forms of columnar compression for data.

One other feature in MongoDB 3.0 is the ability to have 50 replicas of data (the previous figure was 12). MongoDB can’t think of a great reason to have more than 3 replicas per data center or more than 2 replicas per metropolitan area, but some customers want to replicate data to numerous locations around the world.
(via dbms2.com)

How OpenCL Could Open the Gates for FPGAs

In this special guest feature from Scientific Computing World, Robert Roe explains how OpenCL may make FPGAs an attractive option.

Over the past few years, high-performance computing (HPC) has become used to heterogeneous hardware, principally mixing GPUs and CPUs, but now, with both major FPGA manufacturers in conformance with the OpenCL standard, the door is effectively open for the wider use of FPGAs in high-performance computing.In January 2015, FPGAs took a step closer to the mainstream of high-performance computing with the announcement that Xilinx’s development environment for systems and software engineers, SDAccel, had been certified as conforming to the OpenCL standard for parallel programming of heterogeneous systems.

The changing landscape of HPC, with the move towards data-centric computing, could favour FPGAs with very high I/O throughput. However, it remains to be seen if FPGAs will be used as an accelerator or if supercomputers might be built using FPGA as the main processor technology.

One of the attractions of FPGAs is that they consume very little power but, as with GPUs initially, the barrier to adoption has been the difficulty of programming them. Manufacturers and vendors are now releasing compilers that will optimise code written in C and C++ to make use of the flexible nature of FPGA architecture.

Easier to program

Mike Strickland, director of the computer and storage business unit at Altera said: “The problem was that we did not have the ease of use, we did not have a software-friendly interface back in 2008. The huge enabler here has been OpenCL.”

Larry Getman, VP of strategic marketing and planning at Xilinx said: ‘When FPGAs first started they could do very basic things such as Boolean algebra and it was really used for glue logic. Over the years, FPGAs have really advanced and evolved with more hardened structures which are much more specialised.’

Getman continued: ‘Over the years FPGAs have gone from being glue logic to harder things like radio head systems, that do a lot of DSP processing; very high-performance vision applications; wireless radio; medical equipment; and radar systems. So they are used in high-performance computing, but for applications that use very specialised algorithms.’

Getman concluded: ‘The reason people use FPGAs for these applications is simple, they offer a much higher level of performance per Watt than trying to run the same application in pure software code.’

FPGAs are programmable semiconductor devices that are based around a matrix of configurable logic blocks (CLBs) connected through programmable interconnects. This is where the FPGA gets the term ‘field programmable’ as an FPGA can be programmed and optimised for a specific application. Efficient programming can take advantage of the inherent parallelism of the FPGA architecture delivering a higher level of performance than accelerators that have a less flexible architecture.

Millions of threads running at the same time

Devadas Varma, senior director of software Research and Development at Xilinx said: ‘A CPU, if it is single core CPU, executes one instruction at a time and if you have four cores, eight cores, that are multithreaded then you can do eight or sixteen sets of instructions, for example. If you compare this to an FPGA, which is a blank set of millions of components that you decide to interconnect, theoretically speaking you could have thousands or even millions of threads running at the same time.’

Reuven Weintraub, founder and chief technology officer at Gidel, highlighted the differences between FPGAs and the processors used in CPUs today. He said: ‘They are the same and they are different. They are the same from the perspective that both of them are programmable. The difference is coming from the fact that in the FPGA all the instructions would run in parallel. Actually the FPGA is not a processor; it is compiled to be a dedicated set of hardware components according to the requirements of the algorithm – that is what gives it the efficiency, power savings and so on.’

Traditionally this power efficiency, scalability, and flexible architecture came at the price of more complex programming: code needed to address the hardware and the flow of data across the various components, in addition to providing the basic instruction set to be computed in the logic blocks. However, major FPGA manufacturers Altera and Xilinx have both been working on their own OpenCL based solutions which have the potential to make FPGA acceleration a real possibility for more general HPC workloads.

Development toolkits

Xilinx has recently released SDAccel, a development environment that includes software tools including its own compiler, tools for code development, profiling, and debugging, and provides a GPU-like work environment. Getman said: ‘Our goal is to make an FPGA as easy to program as a GPU. SDAccel, which is OpenCL based, does allow people to program in OpenCL and C or C++ and they can now target the FPGA at a very high level.’

In addition, SDAccel provides functionality to swap multiple kernels in and out of the FPGA without disrupting the interface between the server CPU and the FPGA. This could be a key enabler of FPGAs in real-world data centres where turning off some of your resources while you re-optimise them for the next application is not an economically viable strategy at present.

Altera has been working closely with the Khronos group, which oversees a number of open computing standards including OpenCL, OpenGL, and WebGL. Altera released a development toolkit, Altera’s SDK for OpenCL, in May 2013. Strickland said: ‘In May 2013 we achieved a very important conformance test with the standards body – the Khronos group – that manages OpenCL. We had to pass 8,000 tests and that really strengthened the credibility of what we are doing with the FPGA.’

Strickland continued: ‘In the past, there were a lot of FPGA compiler tools that took care of the logic but not the data management. They could take lines of C and automatically generate lines of RTL but they did not take care of how that data would come from the CPU, the optimisation of external memory bandwidth off the FPGA, and that is a large amount of the work.’

Traditionally optimising algorithms to utilise fully the parallel architectures of FPGA technology involved significant experience using HDLs (hardware description languages) because they allowed programmers to write code that would address the FPGA at register-transfer level (RTL).

RTL enables programmers to describe the flow of data between hardware registers, and the logical operations performed on that data. This is typically what creates the difference in performance between more general processors and FPGAs, which can be optimised much more efficiently for a specific algorithm.

The difficulty is that that kind of coding requires expertise and can be very time consuming. Hand-coded RTL may go through several iterations as programmers test the most efficient ways to parallelise the instruction set to take advantage of the programmable hardware on the FPGA.

Strickland said: “With OpenCL or the OpenCL compiler, you still write something that is like C code that targets the FPGA. The big difference I would say is the instruction set. The big innovation has been the back end of our complier which can now take that C code and efficiently use the FPGA.”

Strickland noted that Altera’s compiler ‘does more than 200 optimisations when you write some C code. It is doing things like seeing the order in which you access memory so that it can group memory addresses together, improving the efficiency of that memory interface.’

Converting code from different languages into an RTL description has been possible for some time, but these developments in OpenCL make it much easier for programmers without extensive knowledge of HDLs, such as VHDL and Verilog, to make use of FPGAs.

However OpenCL is not the final piece of the puzzle for FPGA programming. Strickland said: ‘Over time you may want to have other high-level interfaces. There is a standard called SPIR (Standard Portable Intermediate Representation). The idea is that this allows you to kind of split up your compiler between the front end and the back end, enabling people to use different high-level language interfaces on the front end.’

Strickland continued: ‘In universities now there is research into domain-specific languages, so people are trying to accomplish a certain class of algorithms may benefit from having a higher level interface than even C. The idea behind exposing this intermediate compiler interface is you can now start working with the ecosystem to have front ends with higher-level interfaces.’

Over the past few years, there have been two ideas behind the best way to program FPGAs: high-level synthesis (HLS) or OpenCL. As OpenCL has matured, Xilinx decided to adopt the standard but to keep the work it had done developing HLS technology and integrate that into the development environment conforming to the OpenCL standard.

Getman said: “The main problem is that C is very much designed to go cycle to cycle, step by step. Unfortunately hardware doesn’t. Hardware has a lot of things running at the same time.” This aspect was what made HLS attractive as a compiler that can take OpenCL, C or C++ and architecturally optimise it for the FPGA hardware.

Xilinx acquired AutoESL and its HLS tool AutoPilot in 2011 and began integrating it into its own development tools for FPGAs. Getman said: ‘That was really the big switching point. For many years, people had been promising really great results with HLS but in reality the results were a lot bigger and a lot slower than what could have been done by hand.’

Getman continued: ‘We have integrated this technology into our tools and added a lot to it. This is really one of the big differentiators from our competition, even though we both have OpenCL support. This technology allows our users the opportunity to create their own libraries in real-time using C, C++ or OpenCL, rather than have to wait for the vendor to create specific libraries or specific algorithms for them.

Varma said: “The silver bullet in HLS is the ability to take a sequential description that has been written in C and then find this parallelism, the concurrencies, without the user having to think. That was a necessary technology before we could do anything. It has been adopted by thousands of users already as a standalone technology, but what we do is embed that technology inside OpenCL compilers so that now it can be utilised in full software mode and it is fully compatible with OpenCL.”

Getman said: “We consciously made a switch over the last few years to expand our customer base by both continuing technology development for our traditional users as well as expand our tool flow to cater to software coders.”

A key facet of this technology is that Xilinx is letting programmers take the work they have done in C and port it over to OpenCL using the technology from HLS that is now integrated into its compilers. Varma said: ‘One thing that changes when you go from software to hardware programming is that C programmers, OpenCL programmers, are used to dealing with a lot of libraries. They do not have to write matrix multiplications or filters or those kinds of things, because they are always available as library elements. Now hardware languages often have libraries, but they are very specific implementations that you cannot just change for your use.’

Varma concluded: “By writing in C, our HLS technology can re-compile that very efficiently and immediately. This gives you a tremendous capability.”

Coprocessor or something bigger?

FPGA manufacturers like Altera and Xilinx have been focusing their attention on using FPGAs in HPC as coprocessors or accelerators that would be used in much the same way as GPUs.

Getman said: “The biggest use model is really processor plus FPGA. The reason for that is there are still things that you want to run on a processor. You really want a processor to do what it is good at. Typically an FPGA will be used through something like a PCIE slot and it will be used as an acceleration engine for the things that are really difficult for the processor.”

This view was shared by Devadas Varma who highlighted some of the functionality in an earlier release of OpenCL that increased the potential for CPU/GPU/FPGA synergy.

Varma said: ‘The tool we have developed supports OpenCL 1.2 and importantly it can co-exist with CPUs and GPUs. In fact in our upcoming release we will support partitioning workloads into GPUs, we already support this feature regarding CPUs. That is definitely where we are heading.’

However this was not a view shared by Reuven Weintraub, at Gidel, who felt that to regard an FPGA simply as a coprocessor was to miss much of the point and many of the advantages that FPGAs could offer to computing. Weintraub said: “For me a coprocessor is like the H87 was, you make certain lines of code in the processor and then you say “there’s a line of code for you” and it returns and this goes back and forth. The big advantage of running with the FPGA is that the FPGA can have a lot of pipelining inside of it, solve a lot of things and have a lot of memory.”

He explained that an FPGA contains a ‘huge array of registers that are immediately available’ by taking advantage of the on-board memory and high-throughput that FPGAs can handle, meaning that ‘you do not necessarily have to use the cache because the data is being moved in and out in the correct order.’

Weintraub concluded: “Therefore it is better to give a task to the FPGA rather than giving just a few up codes and the going back and forth. It is more task oriented. Computing is a balance between the processing, memory access, networking and storage, but everything has to be balanced. If you want to utilize a good FPGA then you need to give it a task that makes use of its internal memory so that it can move things from one job to another.”

Gidel has considerable experience in this field. Gidel provided the FPGAs for the Novo-G supercomputer, housed at the University of Florida, the largest re-configurable supercomputer available for research.

The university is a lead partner in the ‘Center for High-Performance Reconfigurable Computing’ (CHREC), a US national research centre funded by the National Science Foundation.

In development at the UF site since 2009, Novo-G features 192, 40nm FPGAs (Altera Stratix-IV E530) and 192, 65nm FPGAs (Stratix-III E260).

These 384 FPGAs are housed in 96 quad-FPGA boards (Gidel ProcStar-IV and ProcStar-III) and supported by quad-core Nehalem Xeon processors, GTX-480 GPUs, 20Gb/s non-blocking InfiniBand, GigE, and approximately 3TB of total RAM, most of it directly attached to the FPGAs. An upgrade is underway to add 32 top-end, 28nm FPGAs (Stratix-V GSD8) to the system.

According to the article ‘Novo-G: At the Forefront of Scalable Reconfigurable Supercomputing’ written by Alan George, Herman Lam, and Greg Stitt, three researchers from the university, Novo-G achieved speeds rivaling the largest conventional supercomputers in existence – yet at a fraction of their size, energy, and cost.

But although processing speed and energy efficiency were important, they concluded that the principal impact of a reconfigurable supercomputer like Novo-G was the freedom that its innovative design can give to scientists to conduct more types of analysis, and examine larger datasets.

The potential is there.

(via InsideHPC.com)

Hard disk reliability examined once more: HGST rules, Seagate is alarming

hard-disk-platter

A year ago we got some insight into hard disk reliability when cloud backup provider Backblaze published its findings for the tens of thousands of disks that it operated. Backblaze uses regular consumer-grade disks in its storage because of the cheaper cost and good-enough reliability, but it also discovered that some kinds of disks fared extremely poorly when used 24/7.

A year later the company has collected even more data and drawn out even more differences between the different disks it uses.

For a second year, the standout reliability leader was HGST. Now a wholly owned subsidiary of Western Digital, HGST inherited the technology and designs from Hitachi (which itself bought IBM’s hard disk division). Across a range of models from 2 to 4 terabytes, the HGST models showed low failure rates; at worse, 2.3 percent failing a year. This includes some of the oldest disks among Backblaze’s collection; 2TB Desktop 7K2000 models are on average 3.9 years old, but still have a failure rate of just 1.1 percent.

At the opposite end of the spectrum are Seagate disks. Last year, the two 1.5TB Seagate models used by Backblaze had failure rates of 25.4 percent (for the Barracuda 7200.11) and 9.9 percent (for the Barracuda LP). Those units fared a little better this time around, with failure rates of 23.8 and 9.6 percent, even though they were the oldest disks in the test (average ages of 4.7 and 4.9 years, respectively). However, their poor performance was eclipsed by the 3TB Barracuda 7200.14 units, which had a whopping 43.1 percent failure rate, in spite of an average age of just 2.2 years.

Backblaze’s storage is largely split between Seagate and HGST disks. HGST’s parent company, Western Digital, is almost absent, not because its disks are bad, but because they came out as consistently more expensive than those from Seagate and HGST.

blog-drive-failure-by-manufacturer

Newer Seagate disks also show more encouraging results. Although still young, at an average age of just 0.9 years, the 4TB HDD.15 models show a reasonably low 2.6 percent failure rate. Coupled with their low price—Backblaze says that they tend to undercut HGST’s disks—they’ve become the company’s preferred hard drive model.

As before, this doesn’t mean that anyone with a Seagate disk is at risk of an imminent hard disk failure (though you should always have backups!). Backblaze operates disks outside of the manufacturer’s specified parameters. Significantly, most consumer-grade disks aren’t intended to be heavily used 24/7; they’re meant to be operational for about 8 hours a day and replaced every 3 to 5 years. Most home usage environments are likely to be lower in vibration than Backblaze’s 45-disk storage pods, too. In more normal conditions, the Seagates are likely to fare much better.

(Via Arstechnica.com)

StackExchange’s Performance Dashboard

6148448748_ee8eedd346_mStackExchange created a very cool performance dashboard that looks to be updated from real system metrics. Wouldn’t it be fascinating if every site had a similar dashboard?

The dashboard contains information like there are 560 million page views per month, 260,000 sustained connections,  34 TB data transferred per month, 9 web servers with 48GB of RAM handling 185 req/s at 15% CPU usage. There are 4 SQL servers, 2 redis servers, 3 tag engine servers, 3 elasticsearch servers, and 2 HAProxy servers, along with stats on each.

There’s also an excellent discussion thread on reddit that goes into more interesting details, with questions being answered by folks from StackExchange.

StackExchange is still doing innovative work and is very much an example worth learning from. They’ve always danced to their own tune and it’s a catchy tune at that. More at StackOverflow Update: 560M Pageviews A Month, 25 Servers, And It’s All About Performance.

( via HighScalability.com )

StackOverflow Update: 560M Pageviews A Month, 25 Servers, And It’s All About Performance

16238755496_4a3014ebbb_mThe folks at Stack Overflow remain incredibly open about what they are doing and why. So it’s time for another update. What has Stack Overflow been up to?

The network of sites that make up StackExchange, which includes StackOverflow, is now ranked 54th for traffic in the world; they have 110 sites and are growing at a rate of 3 or 4 a month; 4 million users; 40 million answers; and 560 million pageviews a month.

This is with just 25 servers. For everything. That’s high availability, load balancing, caching, databases, searching, and utility functions. All with a relative handful of employees. Now that’s quality engineering.

This update is based on The architecture of StackOverflow (video) by Marco Cecconi and What it takes to run Stack Overflow (post) by Nick Craver. In addition, I’ve merged in comments from various sources. No doubt some of the details are out of date as I meant to write this article long ago, but it should still be representative.

Stack Overflow still uses Microsoft products. Microsoft infrastructure works and is cheap enough, so there’s no compelling reason to change. Yet SO is pragmatic. They use Linux where it makes sense. There’s no purity push to make everything Linux or keep everything Microsoft. That wouldn’t be efficient.

Stack Overflow still uses a scale-up strategy. No clouds in site. With their SQL Servers loaded with 384 GB of RAM and 2TB of SSD, AWS would cost a fortune. The cloud would also slow them down, making it harder to optimize and troubleshoot system issues. Plus, SO doesn’t need a horizontal scaling strategy. Large peak loads, where scaling out makes sense, hasn’t  been a problem because they’ve been quite successful at sizing their system correctly.

So it appears Jeff Atwood’s quote: “Hardware is Cheap, Programmers are Expensive”, still seems to be living lore at the company.

Marco Ceccon in his talk says when talking about architecture you need to answer this question first: what kind of problem is being solved?

First the easy part. What does StackExchange do? It takes topics, creates communities around them, and creates awesome question and answer sites.

The second part relates to scale. As we’ll see next StackExchange is growing quite fast and handles a lot of traffic. How does it do that? Let’s take a look and see….

Stats

  • StackExchange network has 110 sites growing at a rate of 3 or 4 a month.

  • 4 million users

  • 8 million questions

  • 40 million answers

  • As a network #54 site for traffic in the world

  • 100% year over year growth

  • 560 million pageviews a month

  • Peak is more like 2600-3000 requests/sec on most weekdays. Programming, being a profession, means weekdays are significantly busier than weekends.

  • 25 servers

  • 2 TB of SQL data all stored on SSDs

  • Each web server has 2x 320GB SSDs in a RAID 1.

  • Each ElasticSearch box has 300 GB also using SSDs.

  • Stack Overflow has a 40:60 read-write ratio.

  • DB servers average 10% CPU utilization

  • 11 web servers, using IIS

  • 2 load balancers, 1 active, using HAProxy

  • 4 active database nodes, using MS SQL

  • 3 application servers implementing the tag engine, anything searching by tag hits

  • 3 machines doing search with ElasticSearch

  • 2 machines for distributed cache and messaging using Redis

  • 2 Networks (each a Nexus 5596 + Fabric Extenders)

  • 2 Cisco 5525-X ASAs (think Firewall)

  • 2 Cisco 3945 Routers

  • 2 read-only SQL Servers for used mainly for the Stack Exchange API

  • VMs also perform functions like deployments, domain controllers, monitoring, ops database for sysadmin goodies, etc.

Platform

  • ElasticSearch

  • Redis

  • HAProxy

  • MS SQL

  • Opserver

  • TeamCity

  • Jil – Fast .NET JSON Serializer, built on Sigil

  • Dapper – a micro ORM.

UI

  • The UI has message inbox that is sent a message when you get a new badge, receive a message, significant event, etc. Done using WebSockets and is powered by redis.

  • Search box is powered by ElasticSearch using a REST interface.

  • With so many questions on SO it was impossible to just show the newest questions, they would change too fast, a question every second. Developed an algorithm to look at your pattern of behaviour and show you which questions you would have the most interest in. It’s uses complicated queries based on tags, which is why a specialized Tag Engine was developed.

  • Server side templating is used to generate pages.

Servers

  • The 25 servers are not doing much, that is the CPU load is low. It’s calculated SO could run on only 5 servers.

  • The database server is at 10%, except when it bursts while performing a backups.

  • How so low? The databases servers have 384GB of RAM and the web servers are at 10%-15% CPU usage.

  • Scale-up is still working. Other scale-out sites with a similar number of pageviews tend to run on 100, 200, up to 300 servers.

  • Simple system. Built on .Net. Have only 9 projects, others systems have 100s. Reason to have so few projects is is so compilation is lightning fast, which requires planning at the beginning. Compilation takes 10 seconds on a single computer.

  • 110K lines of code. A small number given what it does.

  • This minimalist approach comes with some problems. One problem is not many tests. Tests aren’t needed because there’s a great community. Meta.stackoverflow is a discussion site for the community and where bugs are reported. Meta.stackoverflow is also a beta site for new software. If users find any problems with it they report the bugs that they’ve found, sometimes with solution/patches.

  • Windows 2012 is used in New York but are upgrading to 2012 R2  (Oregon is already on it). For Linux systems it’s Centos 6.4.

  • Load is really almost all over 9 servers, because 10 and 11 are only for meta.stackexchange.com, meta.stackoverflow.com, and the development tier. Those servers also run around 10-20% CPU which means we have quite a bit of headroom available.

SSDs

  • Intel 330 as the default (web tier, etc.)

  • Intel 520 for mid tier writes like Elastic Search

  • Intel 710 & S3700 for the database tier. S3700 is simply the successor to the high endurance 710 series.

  • Exclusively RAID 1 or RAID 10 (10 being any arrays with 4+ drives). Failures have not been a problem, even with hundreds of intel 2.5″ SSDs in production, a single one hasn’t failed yet. One or more spare parts are kept for each model, but multiple drive failure hasn’t been a concern.

  • ElasticSearch performs much better on SSDs, given SO writes/re-indexes very frequently.

  • SSD changes the use of search. Lucene.net couldn’t handle SO’s concurrent workloads due to locking issues, so they moved to ElasticSearch. It turns out locks around the binary readers really aren’t necessary in an all SSD environment.

  • The only scale-up problems so far is SSD space on the SQL boxes due to the growth pattern of reliability vs. space in the non-consumer space, that isdrives that have capacitors for power loss and such.

High Availability

  • The main datacenter is in New York and the backup datacenter is in Oregon.

  • Redis has 2 slaves, SQL has 2 replicas, tag engine has 3 nodes, elastic has 3 nodes – any other service has high availability as well (and exists in both data centers).

  • Not everything is slaved between data centers (very temporary cache data that’s not needed to eat bandwidth by syncing, etc.) but the big items are, so there is still a shared cache in case of a hard down in the active data center. A start without a cache is possible, but it isn’t very graceful.

  • Nginx was used for SSL, but a transition has been made to using HAProxy to terminate SSL.

  • Total HTTP traffic sent is only about 77% of the total traffic sent. This is because replication is happening to the secondary data center in Oregon as well as other VPN traffic. The majority of this traffic is the data replication to SQL replicas and redis slaves in Oregon.

Databasing

  • MS SQL Server.

  • Stack Exchange has one database per-site, so Stack Overflow gets one, Super User gets one, Server Fault gets one, and so on. The schema for these is the same. This approach of having different database is effectively a form of partitioning and horizontal scaling.

  • In the primary data center (New York) there is usually 1 master and 1 read-only replica in each cluster. There’s also 1 read-only replica (async) in the DR data center (Oregon). When running in Oregon then the primary is there and both of the New York replicas are read-only and async.

  • There are a few wrinkles. There is one “network wide” database which has things like login credentials, and aggregated data (mostly exposed through stackexchange.com user profiles, or APIs).

  • Careers Stack Overflow, stackexchange.com, and Area 51 all have their own unique database schema.

  • All the schema changes are applied to all site databases at the same time. They need to be backwards compatible so, for example, if you need to rename a column – a worst case scenario – it’s a multiple steps process: add a new column, add code which works with both columns, back fill the new column, change code so it works with the new column only, remove the old column.

  • Partitioning is not required. Indexing takes care of everything and the data just is not large enough. If something warrants a filtered indexes, why not make it way more efficient? Indexing only on DeletionDate = Null and such is a common pattern, others are specific FK types from enums.

  • Votes are in 1 table per item, for example 1 table for post votes, 1 table for comment votes. Most pages we render real-time, caching only for anonymous users. Given that, there’s no cache to update, it’s just a re-query.

  • Scores are denormalized, so querying is often needed. It’s all IDs and dates, the post votes table just has 56,454,478 rows currently. Most queries are just a few milliseconds due to indexing.

  • The Tag Engine is entirely self-contained, which means not having to depend on an external service for very, very core functionality. It’s a huge in-memory struct array structure that is optimized for SO use cases and precomputed results for heavily hit combinations. It’s a simple windows service running on a few boxes working in a redundant team. CPU is about 2-5% almost always. Three boxes are not needed for load, just redundancy. If all those do fail at once, the local web servers will load the tag engine in memory and keep on going.

  • On Dapper’s lack of a compiler checking queries compared to traditional ORM. The compiler is checking against what you told it the database looks like. This can help with lots of things, but still has the fundamental disconnect problem you’ll get at runtime. A huge problem with the tradeoff is the generated SQL is nasty, and finding the original code it came from is often non-trivial. Lack of ability to hint queries, control parameterization, etc. is also a big issue when trying to optimize queries. For example. literal replacement was added to Dapper to help with query parameterization which allows the use of things like filtered indexes. Dapper also intercepts the SQL calls to dapper and add add exactly where it came from. It saves so much time tracking things down.

     

Coding

  • The process:

    • Most programmers work remotely. Programmers code in their own batcave.

    • Compilation is very fast.

    • Then the few test that they have are run.

    • Once compiled, code is moved to a development staging server.

    • New features are hidden via feature switches.

    • Runs on same hardware as the rest of the sites.

    • It’s then moved to Meta.stackoverflow for testing. 1000 users per day use the site, so its a good test.

    • If it passes it goes live on the network and is tested by the larger community.

  • Heavy usage of static classes and methods, for simplicity and better performance.

  • Code is simple because the complicated bits are packaged in a library and open sourced and maintained. The number of .Net projects stays low because community shared parts of the code are used.

  • Developers get two or three monitors. Screens are important, they help you be productive.

Caching

  • Cache all the things.

  • 5 levels of caches.

  • 1st: is the network level cache: caching in the browser, CDN, and proxies.

  • 2nd: given for free by the .Net framework and is called the HttpRuntime.Cache. An in-memory, per server cache.

  • 3rd: Redis. Distributed in-memory key-value store. Share cache elements across different servers that serve the same site. If StackOverflow has 9 servers then all servers will be able to find the same cached items.

  • 4th: SQL Server Cache. The entire database is cached in-memory. The entire thing.

  • 5th: SSD. Usually only hit when the SQL server cache is warming up.

  • For example, every help page is cached. Code to access a page is very terse:

    • Static methods and static classes re used. Really bad from an OOP perspective, but really fast and really friendly towards terse code. All code is directly addressed.

    • Caching is handled by a library layer of Redis and Dapper, a micro ORM.

  • To get around garbage collection problems, only one copy of a class used in templates are created and kept in a cache. Everything is measured, including GC operation, from statistics it is known that layers of indirection increase GC pressure to the point of noticeable slowness.

  • CDN hits vary, since the  query string hash is based on file content, it’s only re-fetched on a build. It’s typically 30-50 million hits a day for 300 to 600 GB of bandwidth.

  • A CDN is not used for CPU or I/O load, but to help users find answers faster.

Deploying

  • Want to deploy 5 times a day. Don’t build grand gigantic things and then put then live. Important because:

    • Can measure performance directly.

    • Forced to build the smallest thing that can possibly work.

  • TeamCity builds then copies to each web tier via a powershell script. The steps for each server are:

    • Tell HAProxy to take the server out of rotation via a POST

    • Delay to let IIS finish current requests (~5 sec)

    • Stop the website (via the same PSSession for all the following)

    • Robocopy files

    • Start the website

    • Re-enable in HAProxy via another POST

  • Almost everything is deployed via puppet or DSC, so upgrading usually consist of just nuking the RAID array and installing from a PXE boot. it’s very fast and you know it’s done right/repeatable.

Teaming

  • Teams:

    • SRE (System Reliability Engineering): – 5 people

    • Core Dev (Q&A site) : ~6-7 people

    • Core Dev Mobile: 6 people

    • Careers team that does development solely for the SO Careers product: 7 people

  • Devops and developer teams are really close-knit.

  • There’s a lot of movement between teams.

  • Most employees work remotely.

  • Offices are mostly sales, Denver and London exclusively so.

  • All else equal, it is slightly prefered to have people in NYC, because the in-person time is a plus for the casual interaction that happens in between “getting things done”. But the set up makes it possible to do real work and official team collaboration works almost entirely online.

  • They’ve learned that the in-person benefit is more than outweighed by how much you get from being able to hire the best talent that loves the product anywhere, not just the ones willing to live in the city you happen to be in.

  • The most common reason for someone going remote is starting a family. New York’s great, but spacious it is not.

  • Offices are in Manhattan and a lot of talent is there. The data center needs to not be a crazy distance away since it is always being improved. There’s also a slightly faster connection to many backbones in the NYC location – though we’re talking only a few milliseconds (if that) of difference there.

  • Making an awesome team: Love geeks. Early Microsoft, for example, was full of geeks and they conquered the world.

  • Hire from Stack Overflow community. They looks for a passion for coding, a passion for helping others, and a passion for communicating.

Budgeting

  • Budgets are pretty much project based. Money is only spent as infrastructure is added for new projects. The web servers that have such low utilization are the same ones purchased 3 years ago when the data center was built.

Testing

  • Move fast and break things. Push it live.

  • Major changes are tested by pushing them. Development has an equally powerful SQL server and it runs on the same web tier, so performance testing isn’t so bad.

  • Very few tests. Stack Overflow doesn’t use many unit tests because of their active community and heavy usage of static code.

  • Infrastructure changes. There’s 2 of everything, so there’s a backup with the old configuration whenever possible, with a quick failback mechanism. For example, keepalived does failback quickly between load balancers.

  • Redundant systems fail over pretty often just to do regular maintenance. SQL backups are tested by having a dedicated server just for restoring them, constantly (that’s a free license – do it). Plan to start full data center failovers every 2 months or so – the secondary data center is read-only at all other times.

  • Unit tests, integration tests and UI tests run on every push. All the tests must succeed before a production build run is even possible. So there’s some mixed messages going on about testing.

  • The things that obviously should have tests have tests. That means most of the things that touch money on the Careers product, and easily unit-testable features on the Core end (things with known inputs, e.g. flagging, our new top bar, etc), for most other things we just do a functionality test by hand and push it to our incubating site (formerly meta.stackoverflow, now meta.stackexchange).

Monitoring / Logging

  • Now considering using http://logstash.net/ for log management. Currently a dedicated service inserts the syslog UDP traffic into a SQL database. Web pages add headers for the timings on the way out which are captured with HAProxy and are included in the syslog traffic.

  • Opserver and Realog. are how many metrics are surfaced. Realog is a logging display system built by Kyle Brandt and Matt Jibson in Go

  • Logging is from the HAProxy load balancer via syslog instead of via IIS. This is a lot more versatile than IIS logs.

Clouding

  • Hardware is cheaper than developers and efficient code. You are only as fast as your slowest bottleneck and all the current cloud solutions have fundamental performance or capacity limits.
  • Could you build SO well if building for the cloud from day one? Mostl likely. Could you consistency render all your pages performing several up to date queries and cache fetches across that cloud network you don’t control and getting sub 50ms render times? That’s another matter. Unless you’re talking about substantially higher cost (at least 3-4x), the answer is no – it’s still more economical for SO to host in their own servers.

Performance As A Feature

  • StackOverflow puts a heavy emphasis on performance. The goal for the main page  is to load in less than 50ms, but can be as low as 28ms.

  • Programmers are fanatic about reducing page load times and improving the user experience.

  • Timings for every single request to the network are recorded. With these kind of metrics you can make decisions on where to improve your system.

  • The primary reason their servers run at such low utilization is efficient code. Web servers average between 5-15% CPU, 15.5 GB of RAM used and 20-40 Mb/s network traffic.  The SQL servers average around 5-10% CPU, 365 GB of RAM used, and 100-200 Mb/s of network traffic. This has three major benefits: general room to grow before and upgrade is necessary; headroom to stay online for when things go crazy (bad query, bad code, attacks, whatever it may be); and the ability to clock back on power if needed.

Lessons Learned

  • Why use Redis if you use MS products? gabeech: It’s not about OS evangelism. We run things on the platform they run best on. Period. C# runs best on a windows machine, we use IIS. Redis runs best on a *nix machine we use *nix.

  • Overkill as a strategy. Nick Craver on why their network is over provisioned: Is 20 Gb massive overkill? You bet your ass it is, the active SQL servers average around 100-200 Mb out of that 20 Gb pipe.  However, things like backups, rebuilds, etc. can completely saturate it due to how much memory and SSD storage is present, so it does serve a purpose.

  • SSDs Rock. The database nodes all use SSD and the average write time is 0 milliseconds.

  • Know your read/write workload.

  • Keeping things very efficient means new machines are not needed often. Only when a new project comes along that needs different hardware for some reason is new hardware added. Typically memory is added, but other than that efficient code and low utilization means it doesn’t need replacing. So typically talking about adding a) SSDs for more space, or b) new hardware for new projects.

  • Don’t be afraid to specialize. SO uses complicated queries based on tags, which is why a specialized Tag Engine was developed.

  • Do only what needs to be done. Tests weren’t necessary because an active community did the acceptance testing for them. Add projects only when required. Add a line of code only when necessary. You Aint Gone Need It really works.

  • Reinvention is OK. Typical advice is don’t reinvent the wheel, you’ll just make it worse, by making it square, for example. At SO they don’t worry about making a “Square Wheel”. If developers can write something more lightweight than an already developed alternative, then go for it.

  • Go down to the bare metal. Go into the IL (assembly language of .Net). Some coding is in IL, not C#. Look at SQL query plans. Take memory dumps of the web servers to see what is actually going on. Discovered, for example, a split call generated 2GB of garbage.

  • No bureaucracy. There’s always some tools your team needs. For example, an editor, the most recent version of Visual Studio, etc. Just make it happen without a lot of process getting in the way.

  • Garbage collection driven programming. SO goes to great lengths to reduce garbage collection costs, skipping practices like TDD, avoiding layers of abstraction, and using static methods. While extreme, the result is highly performing code. When you’re doing hundreds of millions of objects in a short window, you can actually measure pauses in the app domain while GC runs. These have a pretty decent impact on request performance.

  • The cost of inefficient code can be higher than you think.  Efficient code stretches hardware further, reduces power usage, makes code easier for programmers to understand.

( via HighScalability.com )

The Stunning Scale Of AWS And What It Means For The Future Of The Cloud

16238755496_4a3014ebbb_mJames Hamilton, VP and Distinguished Engineer at Amazon, and long time blogger of interesting stuff, gave an enthusiastic talk at AWS re:Invent 2014 on AWS Innovation at Scale. He’s clearly proud of the work they are doing and it shows.

James shared a few eye popping stats about AWS:

  • 1 million active customers
  • All 14 other cloud providers combined have 1/5th the aggregate capacity of AWS (estimate by Gartner)
  • 449 new services and major features released in 2014
  • Every day, AWS adds enough new server capacity to support all of Amazon’s global infrastructure when it was a $7B annual revenue enterprise (in 2004).
  • S3 has 132% year-over-year growth in data transfer
  • 102Tbps network capacity into a datacenter.

The major theme of the talk is the cloud is a different world. It’s a special environment that allows AWS to do great things at scale, things you can’t do, which is why the transition from on premise x86 servers to the public cloud is happening at a blistering pace. With so many scale driven benefits to the public cloud, it’s a transition that can’t be stopped. The cloud will keep getting more reliable, more functional, and cheaper at a rate that you can’t begin to match with your limited resources, generalist gear, bloated software stacks, slow supply chains, and outdated innovation paradigms.

That’s the PR message at least. But one thing you can say about Amazon is they are living it. They are making it real. So a healthy doubt is healthy, but extrapolating out the lines of fate would also be wise.

One of the fickle finger of fate advantages AWS has is resources. At one million customers they have the scale to keep the engine of expansion and improvement going. Profits aren’t being taken out, money is being reinvested. This is perhaps the most important advantage of scale.

But money without smarts is simply waste. Amazon wants you to know they have the smarts. We’ve heard how Google and Facebook build their own gear, Amazon does too. They build their own networking gear, networking software, racks, and they work with Intel to get faster processor versions of processors than are available on the market. The key is they know everything and control everything about their environment, so they can build simpler gear that does exactly what they want, which turns out to be cheaper and more reliable in the end.

Complete control allows quality metrics to be built into everything. Metrics drive a constant quality increase in all parts of the system, which is why against all odds AWS is getting more reliable as the pace of innovation quickens. Great pools of actionable data turned into knowledge is another huge advantage of scale.

Another thing AWS can do that you can’t is the Availability Zone architecture itself. Each AZ is its own datacenter and AZs within a region are located very close together. This reduces messaging latencies, which means state can be synchronously replicated between AZs, which greatly improves availability compared to the typical approach where redundant datacenters are very far apart.

It’s a talk rich with information and…well, spunk. The real meta-theme of the talk is how Amazon consciously uses scale to their competitive advantage. For Amazon scale isn’t just an expense to be dealt with, scale is a resource to exploit, if you know how.

Here’s my gloss of James Hamilton’s incredible talk…

Everything In The Talk Has A Foundation In Scale

  • Every day, AWS adds enough new server capacity to support all of Amazon’s global infrastructure when it was a $7B annual revenue enterprise (in 2004).

  • 365 days a year component manufacturers have to get gear to server and storage manufacturers, the server and storage manufacturers have to produce the gear and push it into the logistics channel, it has to get from the logistics channel over to the correct datacenter, it has to arrive at the loading dock, people have be there to wheel the racks to the right place in the DC, there has to be power, cooling, networking, the app stack has to be loaded up, it has to be tested, it has to be released to customers.

  • S3 usage: 132% year-over-year growth in data transfer; EC2 usage: 99% YoY usage growth; AWS overall business: over 1 million active customers.

  • All 14 other cloud providers combined have 1/5th the aggregate capacity of AWS (estimate by Gartner)

  • With over a million customers it means you are in a rich ecosystem. You have your pick of software vendors, if you have a problem someone has likely had it before, it’s easier to get your job done fast.

  • Such high growth means Amazon has the resources to keep reinvesting and innovating by increasing breadth and depth of services they offer.

  • Big Transitions generally occur when the economics are far superior, like mainframes to UNIX servers and then UNIX servers to x86 servers. These transitions usually take 10+ years. What’s different about the x86 on premise transition to the cloud is the speed at which it is happening.  The speed of the cloud transition is a function of great economic value along with the low friction for adoption. You don’t need software, you don’t hardware, you can just do it.

There Are Big Cost Problems In Networking

  • Networking is a red alert situation across the industry. It’s the perfect storm.

  • Problem #1: The cost of networking is escalating relative the cost of all other equipment. It’s anti-Moore’s law. All other gear is going down in cost, networking is getting relatively more expensive over time. Relative monthly costs: servers: 57%; networking equipment: 8%; power distribution and cooling: 18%; power: 13%; other: 4%.

  • Problem #2: At the same time networking is getting more expensive, the ratio of networking to compute is going up. That’s partly because Moore’s law is working (still) with servers and compute density is going up. Partly it’s because as the cost of compute falls the amount of advanced analytics performed goes up and analytics are network intensive. Solving big problems using a large number of servers requires a lot of networking. Network traffic has moved east-west rather than the traditional north-south direction.

  • Amazon’s solution 5 years ago was data driven and radical: they built to their own networking designs. Special routers were built. A team was hired to build the protocol stack all the way to the top. And they deployed all this themselves in their network. All services worldwide run on this gear.

    • This strategy turned out to be a lot cheaper. Just the support contract for networking gear was running 10s of millions of dollars.

    • Availability went up. The source of the improvement was simplicity. The problem AWS was trying to solve was simpler than the problem enterprise gear tries to solve. Enterprise gear must adhere to a lot of complicated specs that go unused and only make the system more fragile. By implementing just the functionality that was required meant a much simpler system which lead to higher availability. Any way to win is a good way to win.

    • A cornucopia of metrics. They measure everything. The rule is if a customer has a bad experience using their system their metrics must show it. This means metrics are getting better all the time because customer problems drive better metrics. Once you have metrics that accurately reflect customer experience then goals can be set on making the system better. Every week improvements are made to make things better. If the code didn’t start off better, it gets better over time.

    • Testability. Their gear was better because they tested it better. Enterprise gear is hard to test at scale. They created a $40 million test bed of 8000 servers (3 megawatts). But since this was the cloud they effectively rented the servers for a few months, so it was relatively cheap.

Networking Explained Layer By Layer, From The Very Top To The Network Interface Card

AWS Worldwide Network Backbone

  • 11 AWS regions worldwide. Choose which ones to use by nearness to customers or required jurisdictional boundaries.

  • Private fiber links interconnect most of the major regions. This avoids all the capacity planning problems (Amazon can do better capacity planning), peering issues, and buffering problems that occur on public links. So it’s faster to run their own network, it’s more reliable, cheaper, and lower latency.

Example AWS Region (US East ((Northern Virginia))

  • All regions have at least two availability zones. US East has five AZs.

  • Redundant paths run to transit centers.

  • Each region has redundant transit centers. A transit center connects private links to other AWS regions, private links to AWS Direct Connect customers, and to the internet through peering and paid transit.

  • If one AZ fails all the other AZs keep working.

  • Metro-area DWDM links between AZs

  • 82,864 fiber strands in a regions

  • AZs are less than 2ms apart and usually less than 1ms apart. From a latency perspective they are fairly close, within a few kilometers. Far enough apart for safety, close enough for good latency.

  • 25Tbps peak traffic between AZs

  • AWS offers AZs because:

    • With a single hardened datacenter the best reliability you’ll get is around 99.9% over a mix of applications over a large period of time. High reliability requires running in two datacenters. Traditionally datacenter diversity is from two datacenters that are very far apart because it’s not cost effective to keep datacenters close together. This means longer latencies. LA to NEW is 74ms round trip. Committing to an SSD is 1 to 2ms. You can’t wait 70+ milliseconds for a transaction to commit. Which means applications commit locally and then replicate to the second datacenter. This design in a failure case loses data during the failover. While a true failure is rare, like a building burning down, transient failures are more common, like a load balancer problem for example. So would you failover your connection was down for 3 minutes? No, because data would be lost and it would take a long time to recover that data from other sources. So you lose availability for common events.

    • AZs are milliseconds apart so you can commit to both at the same time. That means if you failover a customer won’t be able to tell because the data replication was synchronous. It’s invisible. It’s hard to write code for this model so you won’t do it for everything. And for some apps a concern for multi-AZ failures might also prevent you from using multiple AZs, but for the rest of apps this is a very powerful model. It’s more costly, but it gives AWS certain advantages.

Example AWS Availability Zone

  • An Availability Zone is always a datacenter in a completely independent building.

  • Amazon has 28+ datacenters. The plus means there are more datacenters in an AZ as a way of extending capacity for an AZ. More datacenters are added within an AZ to extend the capacity of an AZ. Otherwise you would be forced to split your app across AZs, even if you didn’t want to.

  • Some AZs have size fairly substantial datacenters.

  • DCs in an AZ are less than ¼ms apart.

Example Datacenter

  • AWS datacenters are purposely not gigantic. A single datacenter is 25 – 30 megawatts, with between 50,000 – 80,000 servers

  • The return on datacenter largeness diminishes. The advantage of datacenter scale as you build bigger and bigger goes down. Early advantages are huge. Later advantages are small. Going from 2000 to 2500 racks is a little better. A tiny datacenter is too expensive. A really large datacenter is only marginally more expensive per rack than a medium datacenter.

  • Risk increases with larger datacenters. The blast radius if something goes wrong and the datacenter is destroyed, the loss is too big.

  • 102Tbps network capacity into a datacenter.

Example Rack, Server & NIC

  • The only thing that matters for latency is the software latency at either end of a connection. Latency between two servers when sending a message:

    • Your app -> guest OS -> hypervisor -> NIC : the latency is milliseconds

    • through the NIC : the latency is microseconds

    • over the fiber : the latency is nanoseconds

  • SR-IOV (Single Root I/O Virtualization) allows a NIC to provide virtualize in hardware network cards. Each guest gets its own network card. The benefit is > 2x average latency reduction and > 10x latency jitter improvement. This means outliers a down to a 1/10th of what they were. SR-IOV is being deployed now on newer instances types and will eventually be everywhere. The hard part wasn’t adding SR-IOV, it was adding isolation, metering, DDoS protection, and the capacity limits that make SR-IOV useful in a cloud environment.

AWS Custom Server & Storage Designs

  • Cost of a negative situation is not high so expensive unneeded protection can be removed. Servers are designed for what they do, not a general population of users. Amazon knows exactly in what environment the server will run in, they’ll know exactly when something goes wrong, so the servers can be designed with less engineering headroom. The cost of server failure is not that big for them. They are already on site and are very good at replacing harddisks, etc. So a lot of the carefulness in enterprise equipment is not necessary.

  • Processors can be pushed harder. They know the cooling requirements, they influence the mechanical design, they just design good servers, so they can get more performance out of a server. Though a partnership with Intel Amazon has processors that run faster than can be bought on the open market.

  • An example is the design for their own storage rack. It weighs over a ton, 19” wide, and holds 864 disk drives. For some workloads this a wonderful game changing design that helps them get better prices in some areas.

Power Infrastructure

  • Amazon has designed and built their own power substations. It only saves a little money, but they can build them much faster. Utility companies are not used to dealing with the rate AWS is growing at, so they had to build their own.

  • 3 100% carbon neutral regions: US West (Oregon), AWS GovCloud (US), EU (Frankfurt)

Rapid Pace Of Innovation

  • 449 new services and major features released in 2014. 24 in 2008, 48 in 2009, 61 in 2010, 82 in 2011, 159 in 2012, 280 in 2013.

  • AWS is getting more reliable as the pace of innovation quickens. Their goal is to make available to customers the same tools that use to achieve this rate of innovation and high quality.

( via HighScalability.com )