StackExchange’s Performance Dashboard

6148448748_ee8eedd346_mStackExchange created a very cool performance dashboard that looks to be updated from real system metrics. Wouldn’t it be fascinating if every site had a similar dashboard?

The dashboard contains information like there are 560 million page views per month, 260,000 sustained connections,  34 TB data transferred per month, 9 web servers with 48GB of RAM handling 185 req/s at 15% CPU usage. There are 4 SQL servers, 2 redis servers, 3 tag engine servers, 3 elasticsearch servers, and 2 HAProxy servers, along with stats on each.

There’s also an excellent discussion thread on reddit that goes into more interesting details, with questions being answered by folks from StackExchange.

StackExchange is still doing innovative work and is very much an example worth learning from. They’ve always danced to their own tune and it’s a catchy tune at that. More at StackOverflow Update: 560M Pageviews A Month, 25 Servers, And It’s All About Performance.

( via )

StackOverflow Update: 560M Pageviews A Month, 25 Servers, And It’s All About Performance

16238755496_4a3014ebbb_mThe folks at Stack Overflow remain incredibly open about what they are doing and why. So it’s time for another update. What has Stack Overflow been up to?

The network of sites that make up StackExchange, which includes StackOverflow, is now ranked 54th for traffic in the world; they have 110 sites and are growing at a rate of 3 or 4 a month; 4 million users; 40 million answers; and 560 million pageviews a month.

This is with just 25 servers. For everything. That’s high availability, load balancing, caching, databases, searching, and utility functions. All with a relative handful of employees. Now that’s quality engineering.

This update is based on The architecture of StackOverflow (video) by Marco Cecconi and What it takes to run Stack Overflow (post) by Nick Craver. In addition, I’ve merged in comments from various sources. No doubt some of the details are out of date as I meant to write this article long ago, but it should still be representative.

Stack Overflow still uses Microsoft products. Microsoft infrastructure works and is cheap enough, so there’s no compelling reason to change. Yet SO is pragmatic. They use Linux where it makes sense. There’s no purity push to make everything Linux or keep everything Microsoft. That wouldn’t be efficient.

Stack Overflow still uses a scale-up strategy. No clouds in site. With their SQL Servers loaded with 384 GB of RAM and 2TB of SSD, AWS would cost a fortune. The cloud would also slow them down, making it harder to optimize and troubleshoot system issues. Plus, SO doesn’t need a horizontal scaling strategy. Large peak loads, where scaling out makes sense, hasn’t  been a problem because they’ve been quite successful at sizing their system correctly.

So it appears Jeff Atwood’s quote: “Hardware is Cheap, Programmers are Expensive”, still seems to be living lore at the company.

Marco Ceccon in his talk says when talking about architecture you need to answer this question first: what kind of problem is being solved?

First the easy part. What does StackExchange do? It takes topics, creates communities around them, and creates awesome question and answer sites.

The second part relates to scale. As we’ll see next StackExchange is growing quite fast and handles a lot of traffic. How does it do that? Let’s take a look and see….


  • StackExchange network has 110 sites growing at a rate of 3 or 4 a month.

  • 4 million users

  • 8 million questions

  • 40 million answers

  • As a network #54 site for traffic in the world

  • 100% year over year growth

  • 560 million pageviews a month

  • Peak is more like 2600-3000 requests/sec on most weekdays. Programming, being a profession, means weekdays are significantly busier than weekends.

  • 25 servers

  • 2 TB of SQL data all stored on SSDs

  • Each web server has 2x 320GB SSDs in a RAID 1.

  • Each ElasticSearch box has 300 GB also using SSDs.

  • Stack Overflow has a 40:60 read-write ratio.

  • DB servers average 10% CPU utilization

  • 11 web servers, using IIS

  • 2 load balancers, 1 active, using HAProxy

  • 4 active database nodes, using MS SQL

  • 3 application servers implementing the tag engine, anything searching by tag hits

  • 3 machines doing search with ElasticSearch

  • 2 machines for distributed cache and messaging using Redis

  • 2 Networks (each a Nexus 5596 + Fabric Extenders)

  • 2 Cisco 5525-X ASAs (think Firewall)

  • 2 Cisco 3945 Routers

  • 2 read-only SQL Servers for used mainly for the Stack Exchange API

  • VMs also perform functions like deployments, domain controllers, monitoring, ops database for sysadmin goodies, etc.


  • ElasticSearch

  • Redis

  • HAProxy

  • MS SQL

  • Opserver

  • TeamCity

  • Jil – Fast .NET JSON Serializer, built on Sigil

  • Dapper – a micro ORM.


  • The UI has message inbox that is sent a message when you get a new badge, receive a message, significant event, etc. Done using WebSockets and is powered by redis.

  • Search box is powered by ElasticSearch using a REST interface.

  • With so many questions on SO it was impossible to just show the newest questions, they would change too fast, a question every second. Developed an algorithm to look at your pattern of behaviour and show you which questions you would have the most interest in. It’s uses complicated queries based on tags, which is why a specialized Tag Engine was developed.

  • Server side templating is used to generate pages.


  • The 25 servers are not doing much, that is the CPU load is low. It’s calculated SO could run on only 5 servers.

  • The database server is at 10%, except when it bursts while performing a backups.

  • How so low? The databases servers have 384GB of RAM and the web servers are at 10%-15% CPU usage.

  • Scale-up is still working. Other scale-out sites with a similar number of pageviews tend to run on 100, 200, up to 300 servers.

  • Simple system. Built on .Net. Have only 9 projects, others systems have 100s. Reason to have so few projects is is so compilation is lightning fast, which requires planning at the beginning. Compilation takes 10 seconds on a single computer.

  • 110K lines of code. A small number given what it does.

  • This minimalist approach comes with some problems. One problem is not many tests. Tests aren’t needed because there’s a great community. Meta.stackoverflow is a discussion site for the community and where bugs are reported. Meta.stackoverflow is also a beta site for new software. If users find any problems with it they report the bugs that they’ve found, sometimes with solution/patches.

  • Windows 2012 is used in New York but are upgrading to 2012 R2  (Oregon is already on it). For Linux systems it’s Centos 6.4.

  • Load is really almost all over 9 servers, because 10 and 11 are only for,, and the development tier. Those servers also run around 10-20% CPU which means we have quite a bit of headroom available.


  • Intel 330 as the default (web tier, etc.)

  • Intel 520 for mid tier writes like Elastic Search

  • Intel 710 & S3700 for the database tier. S3700 is simply the successor to the high endurance 710 series.

  • Exclusively RAID 1 or RAID 10 (10 being any arrays with 4+ drives). Failures have not been a problem, even with hundreds of intel 2.5″ SSDs in production, a single one hasn’t failed yet. One or more spare parts are kept for each model, but multiple drive failure hasn’t been a concern.

  • ElasticSearch performs much better on SSDs, given SO writes/re-indexes very frequently.

  • SSD changes the use of search. couldn’t handle SO’s concurrent workloads due to locking issues, so they moved to ElasticSearch. It turns out locks around the binary readers really aren’t necessary in an all SSD environment.

  • The only scale-up problems so far is SSD space on the SQL boxes due to the growth pattern of reliability vs. space in the non-consumer space, that isdrives that have capacitors for power loss and such.

High Availability

  • The main datacenter is in New York and the backup datacenter is in Oregon.

  • Redis has 2 slaves, SQL has 2 replicas, tag engine has 3 nodes, elastic has 3 nodes – any other service has high availability as well (and exists in both data centers).

  • Not everything is slaved between data centers (very temporary cache data that’s not needed to eat bandwidth by syncing, etc.) but the big items are, so there is still a shared cache in case of a hard down in the active data center. A start without a cache is possible, but it isn’t very graceful.

  • Nginx was used for SSL, but a transition has been made to using HAProxy to terminate SSL.

  • Total HTTP traffic sent is only about 77% of the total traffic sent. This is because replication is happening to the secondary data center in Oregon as well as other VPN traffic. The majority of this traffic is the data replication to SQL replicas and redis slaves in Oregon.


  • MS SQL Server.

  • Stack Exchange has one database per-site, so Stack Overflow gets one, Super User gets one, Server Fault gets one, and so on. The schema for these is the same. This approach of having different database is effectively a form of partitioning and horizontal scaling.

  • In the primary data center (New York) there is usually 1 master and 1 read-only replica in each cluster. There’s also 1 read-only replica (async) in the DR data center (Oregon). When running in Oregon then the primary is there and both of the New York replicas are read-only and async.

  • There are a few wrinkles. There is one “network wide” database which has things like login credentials, and aggregated data (mostly exposed through user profiles, or APIs).

  • Careers Stack Overflow,, and Area 51 all have their own unique database schema.

  • All the schema changes are applied to all site databases at the same time. They need to be backwards compatible so, for example, if you need to rename a column – a worst case scenario – it’s a multiple steps process: add a new column, add code which works with both columns, back fill the new column, change code so it works with the new column only, remove the old column.

  • Partitioning is not required. Indexing takes care of everything and the data just is not large enough. If something warrants a filtered indexes, why not make it way more efficient? Indexing only on DeletionDate = Null and such is a common pattern, others are specific FK types from enums.

  • Votes are in 1 table per item, for example 1 table for post votes, 1 table for comment votes. Most pages we render real-time, caching only for anonymous users. Given that, there’s no cache to update, it’s just a re-query.

  • Scores are denormalized, so querying is often needed. It’s all IDs and dates, the post votes table just has 56,454,478 rows currently. Most queries are just a few milliseconds due to indexing.

  • The Tag Engine is entirely self-contained, which means not having to depend on an external service for very, very core functionality. It’s a huge in-memory struct array structure that is optimized for SO use cases and precomputed results for heavily hit combinations. It’s a simple windows service running on a few boxes working in a redundant team. CPU is about 2-5% almost always. Three boxes are not needed for load, just redundancy. If all those do fail at once, the local web servers will load the tag engine in memory and keep on going.

  • On Dapper’s lack of a compiler checking queries compared to traditional ORM. The compiler is checking against what you told it the database looks like. This can help with lots of things, but still has the fundamental disconnect problem you’ll get at runtime. A huge problem with the tradeoff is the generated SQL is nasty, and finding the original code it came from is often non-trivial. Lack of ability to hint queries, control parameterization, etc. is also a big issue when trying to optimize queries. For example. literal replacement was added to Dapper to help with query parameterization which allows the use of things like filtered indexes. Dapper also intercepts the SQL calls to dapper and add add exactly where it came from. It saves so much time tracking things down.



  • The process:

    • Most programmers work remotely. Programmers code in their own batcave.

    • Compilation is very fast.

    • Then the few test that they have are run.

    • Once compiled, code is moved to a development staging server.

    • New features are hidden via feature switches.

    • Runs on same hardware as the rest of the sites.

    • It’s then moved to Meta.stackoverflow for testing. 1000 users per day use the site, so its a good test.

    • If it passes it goes live on the network and is tested by the larger community.

  • Heavy usage of static classes and methods, for simplicity and better performance.

  • Code is simple because the complicated bits are packaged in a library and open sourced and maintained. The number of .Net projects stays low because community shared parts of the code are used.

  • Developers get two or three monitors. Screens are important, they help you be productive.


  • Cache all the things.

  • 5 levels of caches.

  • 1st: is the network level cache: caching in the browser, CDN, and proxies.

  • 2nd: given for free by the .Net framework and is called the HttpRuntime.Cache. An in-memory, per server cache.

  • 3rd: Redis. Distributed in-memory key-value store. Share cache elements across different servers that serve the same site. If StackOverflow has 9 servers then all servers will be able to find the same cached items.

  • 4th: SQL Server Cache. The entire database is cached in-memory. The entire thing.

  • 5th: SSD. Usually only hit when the SQL server cache is warming up.

  • For example, every help page is cached. Code to access a page is very terse:

    • Static methods and static classes re used. Really bad from an OOP perspective, but really fast and really friendly towards terse code. All code is directly addressed.

    • Caching is handled by a library layer of Redis and Dapper, a micro ORM.

  • To get around garbage collection problems, only one copy of a class used in templates are created and kept in a cache. Everything is measured, including GC operation, from statistics it is known that layers of indirection increase GC pressure to the point of noticeable slowness.

  • CDN hits vary, since the  query string hash is based on file content, it’s only re-fetched on a build. It’s typically 30-50 million hits a day for 300 to 600 GB of bandwidth.

  • A CDN is not used for CPU or I/O load, but to help users find answers faster.


  • Want to deploy 5 times a day. Don’t build grand gigantic things and then put then live. Important because:

    • Can measure performance directly.

    • Forced to build the smallest thing that can possibly work.

  • TeamCity builds then copies to each web tier via a powershell script. The steps for each server are:

    • Tell HAProxy to take the server out of rotation via a POST

    • Delay to let IIS finish current requests (~5 sec)

    • Stop the website (via the same PSSession for all the following)

    • Robocopy files

    • Start the website

    • Re-enable in HAProxy via another POST

  • Almost everything is deployed via puppet or DSC, so upgrading usually consist of just nuking the RAID array and installing from a PXE boot. it’s very fast and you know it’s done right/repeatable.


  • Teams:

    • SRE (System Reliability Engineering): – 5 people

    • Core Dev (Q&A site) : ~6-7 people

    • Core Dev Mobile: 6 people

    • Careers team that does development solely for the SO Careers product: 7 people

  • Devops and developer teams are really close-knit.

  • There’s a lot of movement between teams.

  • Most employees work remotely.

  • Offices are mostly sales, Denver and London exclusively so.

  • All else equal, it is slightly prefered to have people in NYC, because the in-person time is a plus for the casual interaction that happens in between “getting things done”. But the set up makes it possible to do real work and official team collaboration works almost entirely online.

  • They’ve learned that the in-person benefit is more than outweighed by how much you get from being able to hire the best talent that loves the product anywhere, not just the ones willing to live in the city you happen to be in.

  • The most common reason for someone going remote is starting a family. New York’s great, but spacious it is not.

  • Offices are in Manhattan and a lot of talent is there. The data center needs to not be a crazy distance away since it is always being improved. There’s also a slightly faster connection to many backbones in the NYC location – though we’re talking only a few milliseconds (if that) of difference there.

  • Making an awesome team: Love geeks. Early Microsoft, for example, was full of geeks and they conquered the world.

  • Hire from Stack Overflow community. They looks for a passion for coding, a passion for helping others, and a passion for communicating.


  • Budgets are pretty much project based. Money is only spent as infrastructure is added for new projects. The web servers that have such low utilization are the same ones purchased 3 years ago when the data center was built.


  • Move fast and break things. Push it live.

  • Major changes are tested by pushing them. Development has an equally powerful SQL server and it runs on the same web tier, so performance testing isn’t so bad.

  • Very few tests. Stack Overflow doesn’t use many unit tests because of their active community and heavy usage of static code.

  • Infrastructure changes. There’s 2 of everything, so there’s a backup with the old configuration whenever possible, with a quick failback mechanism. For example, keepalived does failback quickly between load balancers.

  • Redundant systems fail over pretty often just to do regular maintenance. SQL backups are tested by having a dedicated server just for restoring them, constantly (that’s a free license – do it). Plan to start full data center failovers every 2 months or so – the secondary data center is read-only at all other times.

  • Unit tests, integration tests and UI tests run on every push. All the tests must succeed before a production build run is even possible. So there’s some mixed messages going on about testing.

  • The things that obviously should have tests have tests. That means most of the things that touch money on the Careers product, and easily unit-testable features on the Core end (things with known inputs, e.g. flagging, our new top bar, etc), for most other things we just do a functionality test by hand and push it to our incubating site (formerly meta.stackoverflow, now meta.stackexchange).

Monitoring / Logging

  • Now considering using for log management. Currently a dedicated service inserts the syslog UDP traffic into a SQL database. Web pages add headers for the timings on the way out which are captured with HAProxy and are included in the syslog traffic.

  • Opserver and Realog. are how many metrics are surfaced. Realog is a logging display system built by Kyle Brandt and Matt Jibson in Go

  • Logging is from the HAProxy load balancer via syslog instead of via IIS. This is a lot more versatile than IIS logs.


  • Hardware is cheaper than developers and efficient code. You are only as fast as your slowest bottleneck and all the current cloud solutions have fundamental performance or capacity limits.
  • Could you build SO well if building for the cloud from day one? Mostl likely. Could you consistency render all your pages performing several up to date queries and cache fetches across that cloud network you don’t control and getting sub 50ms render times? That’s another matter. Unless you’re talking about substantially higher cost (at least 3-4x), the answer is no – it’s still more economical for SO to host in their own servers.

Performance As A Feature

  • StackOverflow puts a heavy emphasis on performance. The goal for the main page  is to load in less than 50ms, but can be as low as 28ms.

  • Programmers are fanatic about reducing page load times and improving the user experience.

  • Timings for every single request to the network are recorded. With these kind of metrics you can make decisions on where to improve your system.

  • The primary reason their servers run at such low utilization is efficient code. Web servers average between 5-15% CPU, 15.5 GB of RAM used and 20-40 Mb/s network traffic.  The SQL servers average around 5-10% CPU, 365 GB of RAM used, and 100-200 Mb/s of network traffic. This has three major benefits: general room to grow before and upgrade is necessary; headroom to stay online for when things go crazy (bad query, bad code, attacks, whatever it may be); and the ability to clock back on power if needed.

Lessons Learned

  • Why use Redis if you use MS products? gabeech: It’s not about OS evangelism. We run things on the platform they run best on. Period. C# runs best on a windows machine, we use IIS. Redis runs best on a *nix machine we use *nix.

  • Overkill as a strategy. Nick Craver on why their network is over provisioned: Is 20 Gb massive overkill? You bet your ass it is, the active SQL servers average around 100-200 Mb out of that 20 Gb pipe.  However, things like backups, rebuilds, etc. can completely saturate it due to how much memory and SSD storage is present, so it does serve a purpose.

  • SSDs Rock. The database nodes all use SSD and the average write time is 0 milliseconds.

  • Know your read/write workload.

  • Keeping things very efficient means new machines are not needed often. Only when a new project comes along that needs different hardware for some reason is new hardware added. Typically memory is added, but other than that efficient code and low utilization means it doesn’t need replacing. So typically talking about adding a) SSDs for more space, or b) new hardware for new projects.

  • Don’t be afraid to specialize. SO uses complicated queries based on tags, which is why a specialized Tag Engine was developed.

  • Do only what needs to be done. Tests weren’t necessary because an active community did the acceptance testing for them. Add projects only when required. Add a line of code only when necessary. You Aint Gone Need It really works.

  • Reinvention is OK. Typical advice is don’t reinvent the wheel, you’ll just make it worse, by making it square, for example. At SO they don’t worry about making a “Square Wheel”. If developers can write something more lightweight than an already developed alternative, then go for it.

  • Go down to the bare metal. Go into the IL (assembly language of .Net). Some coding is in IL, not C#. Look at SQL query plans. Take memory dumps of the web servers to see what is actually going on. Discovered, for example, a split call generated 2GB of garbage.

  • No bureaucracy. There’s always some tools your team needs. For example, an editor, the most recent version of Visual Studio, etc. Just make it happen without a lot of process getting in the way.

  • Garbage collection driven programming. SO goes to great lengths to reduce garbage collection costs, skipping practices like TDD, avoiding layers of abstraction, and using static methods. While extreme, the result is highly performing code. When you’re doing hundreds of millions of objects in a short window, you can actually measure pauses in the app domain while GC runs. These have a pretty decent impact on request performance.

  • The cost of inefficient code can be higher than you think.  Efficient code stretches hardware further, reduces power usage, makes code easier for programmers to understand.

( via )

What is a Monolith?

There is currently a strong trend for microservice based architectures and frequent discussions comparing them to monoliths. There is much advice about breaking-up monoliths into microservices and also some amusing fights between proponents of the two paradigms – see the great Microservices vs Monolithic Melee. The term ‘Monolith’ is increasingly being used as a generic insult in the same way that ‘Legacy’ is!

However, I believe that there is a great deal of misunderstanding about exactly what a ‘Monolith’ is and those discussing it are often talking about completely different things.

A monolith can be considered an architectural style or a software development pattern (or anti-pattern if you view it negatively). Styles and patterns usually fit into different Viewtypes (a viewtype is a set, or category, of views that can be easily reconciled with each other [Clements et al., 2010]) and some basic viewtypes we can discuss are:

  • Module – The code units and their relation to each other at compile time.
  • Allocation – The mapping of the software onto its environment.
  • Runtime – The static structure of the software elements and how they interact at runtime.

A monolith could refer to any of the basic viewtypes above.

Module Monolith

If you have a module monolith then all of the code for a system is in a single codebase that is compiled together and produces a single artifact. The code may still be well structured (classes and packages that are coherent and decoupled at a source level rather than a big-ball-of-mud) but it is not split into separate modules for compilation. Conversely a non-monolithic module design may have code split into multiple modules or libraries that can be compiled separately, stored in repositories and referenced when required. There are advantages and disadvantages to both but this tells you very little about how the code is used – it is primarily done for development management.




Allocation Monolith

For an allocation monolith, all of the code is shipped/deployed at the same time. In other words once the compiled code is ‘ready for release’ then a single version is shipped to all nodes. All running components have the same version of the software running at any point in time. This is independent of whether the module structure is a monolith. You may have compiled the entire codebase at once before deployment OR you may have created a set of deployment artifacts from multiple sources and versions. Either way this version for the system is deployed everywhere at once (often by stopping the entire system, rolling out the software and then restarting).

A non-monolithic allocation would involve deploying different versions to individual nodes at different times. This is again independent of the module structure as different versions of a module monolith could be deployed individually.



Runtime Monolith

A runtime monolith will have a single application or process performing the work for the system (although the system may have multiple, external dependencies). Many systems have traditionally been written like this (especially line-of-business systems such as Payroll, Accounts Payable, CMS etc).

Whether the runtime is a monolith is independent of whether the system code is a module monolith or not. A runtime monolith often implies an allocation monolith if there is only one main node/component to be deployed (although this is not the case if a new version of software is rolled out across regions, with separate users, over a period of time).


Note that my examples above are slightly forced for the viewtypes and it won’t be as hard-and-fast in the real world.


Be very carefully when arguing about ‘Microservices vs Monoliths’. A direct comparison is only possible when discussing the Runtime viewtype and properties. You should also not assume that moving away from a Module or Allocation monolith will magically enable a Microservice architecture (although it will probably help). If you are moving to a Microservice architecture then I’d advise you to consider all these viewtypes and align your boundaries across them i.e. don’t just code, build and distribute a monolith that exposes subsets of itself on different nodes.



Auth0 Architecture – Running In Multiple Cloud Providers And Regions


Auth0 provides authentication, authorization and single sign on services for apps of any type: mobile, web, native; on any stack.

Authentication is critical for the vast majority of apps. We designed Auth0 from the beginning with multiple levels of redundancy. One of this levels is hosting. Auth0 can run anywhere: our cloud, your cloud, or even your own servers. And when we run Auth0 we run it on multiple-cloud providers and in multiple regions simultaneously.

This article is a brief introduction of the infrastructure behind and the strategies we use to keep it up and running with high availability.

Core Service Architecture

The core service is relatively simple:

  • Front-end servers: these consist of several x-large VMs, running Ubuntu on Microsoft Azure.

  • Store: mongodb, running on dedicated memory optimized X-large VMs.

  • Intra-node service routing: nginx

All components of Auth0 (e.g. Dashboard, transaction server, docs) run on all nodes. All identical.

Multi-Cloud / High Availability


Multi Cloud Architecture

Last week, Azure suffered a global outage that lasted for hours. During that time our HA plan activated and we switched over to AWS

  • The services runs primarily on Microsoft Azure (IaaS). Secondary nodes on stand-by always ready on AWS.

  • We use Route53 with a failover routing policy. TTL at 60 secs. The Route53 health check detects using a probe against primary DC, if it fails (3 times, 10 seconds interval) it changes the DNS entry to point to secondary DC. So max downtime in case of primary failure is ~2 minutes.

  • We use puppet to deploy on every “push to master”. Using puppet allows us to be cloud independent on the configuration/deployment process. Puppet Master runs on our build server (TeamCity currently).

  • MongoDB is replicated often to secondary DC and secondary DC is configured as read-only.

  • While running on the secondary DC, only runtime logins are allowed and the dashboard is set to “read-only mode”.

  • We replicate all the configuration needed for a login to succeed (application info, secrets, connections, users, etc). We don’t replicate transactional data (tokens, logs).

  • In case of failover, there might might some logging records that are lost. We are planning to improve that by having a real-time replica across Azure and AWS.

  • We use our own version of chaos monkey to test the resiliency of our infrastructure


Automated Testing

  • We have 1000+ unit and integration tests.

  • We use saucelabs to run cross-browser (desktop/mobile) integration tests for Lock, our JavaScript login widget.

  • We use phantomjs/casper for integration tests. We test, for instance, that a full flow login with Google and other providers works fine.

  • All these run before every push to production.


Our use case is simple, we need to serve our JS library and its configuration (which providers are enabled, etc.). Assets and configuration data is uploaded to S3. It has to support TLS on our own custom domain ( We ended up building our own CDN.

  • We tried 3 reputable CDN providers, but run into a whole variety of issues: The first one we tried when we didn’t have our own domain for cdn. At some point we decided we needed our own domain over SSL/TLS. This cdn was too expensive if you want SSL and customer domain at that point (600/mo). We also had issues configuring it to work with gzip and S3. Since S3 cannot serve both version (zipped and not) of the same file and this CDN doesn’t have content negotiation, some browsers (cough IE) don’t play well with this. So we moved to another CDN which was much cheaper.

  • The second CDN, we had a handful of issues and we couldn’t understand the root cause of them. Their support was on chat and it took time to get answers. Sometimes it seemed to be S3 issues, sometimes they had issues on routing, etc.

  • We decided to spend more money and we moved to a third CDN. Given that this CDN is being used by high load services like GitHub we thought it was going to be fine. However, our requirements were different from GitHub. If the CDN doesn’t work for GitHub, you won’t see an image on the In our case, our customers depends on the CDN to serve the Login Widget, which means that if it doesn’t work, then their customers can’t login.

  • We ended up building our own CDN using nginx, varnish and S3. It’s hosted on every region on AWS and so far it has been working great (no downtime). We use Route53 latency based routing.

Sandbox (Used To Run Untrusted Code)

One of the features we provide is the ability to run custom code as part of the login transaction. Customers can write these rules and we have a public repository for commonly used rules.

  • The sandbox is built on CoreOS, Docker and etcd.

  • There is a pool of Docker instances that gets assigned to a tenant on-demand.

  • Each tenant gets its own docker instance and there is a recycling policy based on idle time.

  • There is a controller doing the recycling policy and a proxy that routes the request to the right container.



More information about the sandbox is in this JSConf presentation Nov 2014: and slides:


Initially we used pingdom (we still use it), but we decided to develop our own health check system that can run arbitrary health checks based on node.js scripts. These run from all AWS regions.

  • It uses the same sandbox we developed for our service. We call the sandbox via an http API and send the node.js script to run as an HTTP POST.

  • We monitor all the components and we also do synthetic transactions against the service (e.g. a login transaction).



If a health check fails we get notified through Slack. We have two Slack channels #p1 and #p2. If the failure happens 1 time, it gets posted to #p2. If it happens 2 times in a row it gets posted to #p1 and all members of devops get an SMS (via Twilio).

For detailed performance counters and response times we use statsd and we send all the metrics to Librato. This is an example of a chart you can create.


We also setup alerts based on derivative metrics (i.e. how much something grows or shrinks in a time period). For instance, we have one based on logins: if Derivate(logins) > X => Send an alert to Slack.


Finally, we have alerts coming from NewRelic for infrastructure components.


For logging we use ElasticSearch, Logstash and Kibana. We are storing logs from nginx and mongodb at this point. We are also parsing mongo logs using logstash in order to identify slow queries (anything with a high number of collscans).



  • All related web properties: the site, our blog, etc. run completely separate from the app and runtime, on their own Ubuntu + Docker VMs.


This is where we are going:

  • We are moving to CoreOS and Docker. We want to move to a model where we manage clusters as a whole instead of doing configuration management over individual nodes. Docker helps also by removing some moving parts by doing image-based deployment (and be able to rollback at that level as well).

  • MongoDB will be geo-replicated across DCs between AWS and Azure. We are testing latency.

  • For all the search related features we are moving to ElasticSearch to provide search based on any criteria. MongoDB didn’t work out well in this scenario (given our multi-tenancy).


Scaling Docker with Kubernetes

Kubernetes is an open source project to manage a cluster of Linux containers as a single system, managing and running Docker containers across multiple hosts, offering co-location of containers, service discovery and replication control. It was started by Google and now it is supported by Microsoft, RedHat, IBM and Docker amongst others.

Google has been using container technology for over ten years, starting over 2 billion containers per week. With Kubernetes it shares its container expertise creating an open platform to run containers at scale.

The project serves two purposes. Once you are using Docker containers the next question is how to scale and start containers across multiple Docker hosts, balancing the containers across them. It also adds a higher level API to define how containers are logically grouped, allowing to define pools of containers, load balancing and affinity.

Kubernetes is still at a very early stage, which translates to lots of changes going into the project, some fragile examples, and some cases for new features that need to be fleshed out, but the pace of development, and the support by other big companies in the space, is highly promising.

Kubernetes concepts

The Kubernetes architecture is defined by a master server and multiple minions. The command line tools connect to the API endpoint in the master, which manages and orchestrates all the minions, Docker hosts that receive the instructions from the master and run the containers.

  • Master: Server with the Kubernetes API service. Multi master configuration is on the roadmap.
  • Minion: Each of the multiple Docker hosts with the Kubelet service that receive orders from the master, and manages the host running containers.
  • Pod: Defines a collection of containers tied together that are deployed in the same minion, for example a database and a web server container.
  • Replication controller: Defines how many pods or containers need to be running. The containers are scheduled across multiple minions.
  • Service: A definition that allows discovery of services/ports published by containers, and external proxy communications. A service maps the ports of the containers running on pods across multiple minions to externally accesible ports.
  • kubecfg: The command line client that connects to the master to administer Kubernetes.


Kubernetes is defined by states, not processes. When you define a pod, Kubernetes tries to ensure that it is always running. If a container is killed, it will try to start a new one. If a replication controller is defined with 3 replicas, Kubernetes will try to always run that number, starting and stopping containers as necessary.

The example app used in this article is the Jenkins CI server, in a typical master-slaves setup to distribute the jobs. Jenkins is configured with the Jenkins swarm plugin to run a Jenkins master and multiple Jenkins slaves, all of them running as Docker containers across multiple hosts. The swarm slaves connect to the Jenkins master on startup and become available to run Jenkins jobs. The configuration files used in the example are available in GitHub, and the Docker images are available as csanchez/jenkins-swarm, for the master Jenkins, extending the official Jenkins image with the swarm plugin, and csanchez/jenkins-swarm-slave, for each of the slaves, just running the slave service on a JVM container.

Creating a Kubernetes cluster

Kubernetes provides scripts to create a cluster with several operating systems and cloud/virtual providers: Vagrant (useful for local testing), Google Compute Engine, Azure, Rackspace, etc.

The examples will use a local cluster running on Vagrant, using Fedora as OS, as detailed in the getting started instructions, and have been tested on Kubernetes 0.5.4. Instead of the default three minions (Docker hosts) we are going to run just two, which is enough to show the Kubernetes capabilities without requiring a more powerful machine.

Once you have downloaded Kubernetes and extracted it, the examples can be run from that directory. In order to create the cluster from scratch the only command needed is ./cluster/

$ export KUBERNETES_PROVIDER=vagrant
$ ./cluster/

Get the example configuration files:

$ git clone

The cluster creation will take a while depending on machine power and internet bandwidth, but should eventually finish without errors and it only needs to be ran once.

Command line tool

The command line tool to interact with Kubernetes is called kubecfg, with a convenience script in cluster/

In order to check that our cluster is up and running with two minions, just run the kubecfg list minions command and it should display the two virtual machines in the Vagrant configuration.

$ ./cluster/ list minions

Minion identifier


The Jenkins master server is defined as a pod in Kubernetes terminology. Multiple containers can be specified in a pod, that would be deployed in the same Docker host, with the advantage that containers in a pod can share resources, such as storage volumes, and use the same network namespace and IP. Volumes are by default empty directories, type emptyDir, that live for the lifespan of the pod, not the specific container, so if the container fails the persistent storage will live on. Other volume type is hostDir, that will mount a directory from the host server in the container.

In this Jenkins specific example we could have a pod with two containers, the Jenkins server and, for instance, a MySQL container to use as database, although we will only focus on a standalone Jenkins master container.

In order to create a Jenkins pod we run kubecfg with the Jenkins container pod definition, using Docker image csanchez/jenkins-swarm, ports 8080 and 50000 mapped to the container in order to have access to the Jenkins web UI and the slave API, and a volume mounted in /var/jenkins_home. You can find the example code in GitHub as well.

The Jenkins web UI pod (pod.json) is defined as follows:

  "id": "jenkins",
  "kind": "Pod",
  "apiVersion": "v1beta1",
  "desiredState": {
    "manifest": {
      "version": "v1beta1",
      "id": "jenkins",
      "containers": [
          "name": "jenkins",
          "image": "csanchez/jenkins-swarm:1.565.3.3",
          "ports": [
              "containerPort": 8080,
              "hostPort": 8080
              "containerPort": 50000,
              "hostPort": 50000
          "volumeMounts": [
              "name": "jenkins-data",
              "mountPath": "/var/jenkins_home"
      "volumes": [
          "name": "jenkins-data",
          "source": {
            "emptyDir": {}
  "labels": {
    "name": "jenkins"

And create it with:

$ ./cluster/ -c kubernetes-jenkins/pod.json create pods

Name                Image(s)                           Host                Labels              Status
----------          ----------                         ----------          ----------          ----------
jenkins             csanchez/jenkins-swarm:1.565.3.3   <unassigned>        name=jenkins        Pending

After some time, depending on your internet connection, as it has to download the Docker image to the minion, we can check its status and in which minion is started.

$ ./cluster/ list pods
Name                Image(s)                           Host                    Labels              Status
----------          ----------                         ----------              ----------          ----------
jenkins             csanchez/jenkins-swarm:1.565.3.3   name=jenkins        Running

If we ssh into the minion that the pod was assigned to, minion-1 or minion-2, we can see how Docker started the container defined, amongst other containers used by Kubernetes for internal management (kubernetes/pause and google/cadvisor).

$ vagrant ssh minion-2 -c "docker ps"

CONTAINER ID        IMAGE                              COMMAND                CREATED             STATUS              PORTS                                              NAMES
7f6825a80c8a        google/cadvisor:0.6.2              "/usr/bin/cadvisor"    3 minutes ago       Up 3 minutes                                                           k8s_cadvisor.b0dae998_cadvisormanifes12uqn2ohido76855gdecd9roadm7l0.default.file_cadvisormanifes12uqn2ohido76855gdecd9roadm7l0_28df406a
5c02249c0b3c        csanchez/jenkins-swarm:1.565.3.3   "/usr/local/bin/jenk   3 minutes ago       Up 3 minutes                                                           k8s_jenkins.f87be3b0_jenkins.default.etcd_901e8027-759b-11e4-bfd0-0800279696e1_bf8db75a
ce51fda15f55        kubernetes/pause:go                "/pause"               10 minutes ago      Up 10 minutes                                                          k8s_net.dbcb7509_0d38f5b2-759c-11e4-bfd0-0800279696e1.default.etcd_0d38fa52-759c-11e4-bfd0-0800279696e1_e4e3a40f
e6f00165d7d3        kubernetes/pause:go                "/pause"               13 minutes ago      Up 13 minutes>8080/tcp,>50000/tcp   k8s_net.9eb4a781_jenkins.default.etcd_901e8027-759b-11e4-bfd0-0800279696e1_7bd4d24e
7129fa5dccab        kubernetes/pause:go                "/pause"               13 minutes ago      Up 13 minutes>8080/tcp                             k8s_net.a0f18f6e_cadvisormanifes12uqn2ohido76855gdecd9roadm7l0.default.file_cadvisormanifes12uqn2ohido76855gdecd9roadm7l0_659a7a52

And, once we know the container id, we can check the container logs with vagrant ssh minion-1 -c “docker logs cec3eab3f4d3″

We should also see the Jenkins web UI at or, depending on what minion it was started in.

Service discovery

Kubernetes allows defining services, a way for containers to use discovery and proxy requests to the appropriate minion. With this definition in service-http.json we are creating a service with id jenkins pointing to the pod with the label name=jenkins, as declared in the pod definition, and forwarding the port 8888 to the container’s 8080.

  "id": "jenkins",
  "kind": "Service",
  "apiVersion": "v1beta1",
  "port": 8888,
  "containerPort": 8080,
  "selector": {
    "name": "jenkins"

Creating the service with kubecfg:

$ ./cluster/ -c kubernetes-jenkins/service-http.json create services

Name                Labels              Selector            IP                  Port
----------          ----------          ----------          ----------          ----------
jenkins                                 name=jenkins         8888

Each service is assigned a unique IP address tied to the lifespan of the Service. If we had multiple pods matching the service definition the service would load balance the traffic across all of them.

Another feature of services is that a number of environment variables are available for any subsequent containers ran by Kubernetes, providing the ability to connect to the service container, in a similar way as running linked Docker containers. This will provide useful for finding the master Jenkins server from any of the slaves.


Another tweak we need to do is to open port 50000, needed by the Jenkins swarm plugin. It can be achieved creating another service service-slave.json so Kubernetes forwards traffic to that port to the Jenkins server container.

  "id": "jenkins-slave",
  "kind": "Service",
  "apiVersion": "v1beta1",
  "port": 50000,
  "containerPort": 50000,
  "selector": {
    "name": "jenkins"

The service is created with kubecfg again.

$ ./cluster/ -c kubernetes-jenkins/service-slave.json create services

Name                Labels              Selector            IP                  Port
----------          ----------          ----------          ----------          ----------
jenkins-slave                           name=jenkins          50000

An all the defined services are available now, including some Kubernetes internal ones:

$ ./cluster/ list services

Name                Labels              Selector                                  IP                  Port
----------          ----------          ----------                                ----------          ----------
kubernetes-ro                           component=apiserver,provider=kubernetes         80
kubernetes                              component=apiserver,provider=kubernetes          443
jenkins                                 name=jenkins                             8888
jenkins-slave                           name=jenkins                              50000

Replication controllers

Replication controllers allow running multiple pods in multiple minions. Jenkins slaves can be run this way to ensure there is always a pool of slaves ready to run Jenkins jobs.

In a replication.json definition:

  "id": "jenkins-slave",
  "apiVersion": "v1beta1",
  "kind": "ReplicationController",
  "desiredState": {
    "replicas": 1,
    "replicaSelector": {
      "name": "jenkins-slave"
    "podTemplate": {
      "desiredState": {
        "manifest": {
          "version": "v1beta1",
          "id": "jenkins-slave",
          "containers": [
              "name": "jenkins-slave",
              "image": "csanchez/jenkins-swarm-slave:1.21",
              "command": [
                "sh", "-c", "/usr/local/bin/ -master http://$JENKINS_SERVICE_HOST:$JENKINS_SERVICE_PORT -tunnel $JENKINS_SLAVE_SERVICE_HOST:$JENKINS_SLAVE_SERVICE_PORT -username jenkins -password jenkins -executors 1"
      "labels": {
        "name": "jenkins-slave"
  "labels": {
    "name": "jenkins-slave"

The podTemplate section allows the same configuration options as a pod definition. In this case we want to make the Jenkins slave connect automatically to our Jenkins master, instead of relying on Jenkins multicast discovery. To do so we execute the command with -master parameter to point the slave to the Jenkins master running in Kubernetes. Note that we use the Kubernetes provided environment variables for the Jenkins service definition (JENKINS_SERVICE_HOST and JENKINS_SERVICE_PORT). The image command is overridden to configure the container this way, useful to reuse existing images while taking advantage of the service environment variables. It can be done in pod definitions too.

Create the replicas with kubecfg:

$ ./cluster/ -c kubernetes-jenkins/replication.json create replicationControllers

Name                Image(s)                            Selector             Replicas
----------          ----------                          ----------           ----------
jenkins-slave       csanchez/jenkins-swarm-slave:1.21   name=jenkins-slave   1

Listing the pods now would show new ones being created, up to the number of replicas defined in the replication controller.

$ ./cluster/ list pods

Name                                   Image(s)                            Host                    Labels               Status
----------                             ----------                          ----------              ----------           ----------
jenkins                                csanchez/jenkins-swarm:1.565.3.3   name=jenkins         Running
07651754-4f88-11e4-b01e-0800279696e1   csanchez/jenkins-swarm-slave:1.21   name=jenkins-slave   Pending

The first time running jenkins-swarm-slave image the minion has to download it from the Docker repository, but after a while, depending on your internet connection, the slaves should automatically connect to the Jenkins server. Going into the server where the slave is started, docker ps has to show the container running and docker logs is useful to debug any problems on container startup.

$ vagrant ssh minion-1 -c "docker ps"

CONTAINER ID        IMAGE                               COMMAND                CREATED              STATUS              PORTS                    NAMES
870665d50f68        csanchez/jenkins-swarm-slave:1.21   "/usr/local/bin/jenk   About a minute ago   Up About a minute                            k8s_jenkins-slave.74f1dda1_07651754-4f88-11e4-b01e-0800279696e1.default.etcd_11cac207-759f-11e4-bfd0-0800279696e1_9495d10e
cc44aa8743f0        kubernetes/pause:go                 "/pause"               About a minute ago   Up About a minute                            k8s_net.dbcb7509_07651754-4f88-11e4-b01e-0800279696e1.default.etcd_11cac207-759f-11e4-bfd0-0800279696e1_4bf086ee
edff0e535a84        google/cadvisor:0.6.2               "/usr/bin/cadvisor"    27 minutes ago       Up 27 minutes                                k8s_cadvisor.b0dae998_cadvisormanifes12uqn2ohido76855gdecd9roadm7l0.default.file_cadvisormanifes12uqn2ohido76855gdecd9roadm7l0_588941b0
b7e23a7b68d0        kubernetes/pause:go                 "/pause"               27 minutes ago       Up 27 minutes>8080/tcp   k8s_net.a0f18f6e_cadvisormanifes12uqn2ohido76855gdecd9roadm7l0.default.file_cadvisormanifes12uqn2ohido76855gdecd9roadm7l0_57a2b4de

The replication controller can automatically be resized to any number of desired replicas:

$ ./cluster/ resize jenkins-slave 2

And again the pods are updated to show where each replica is running.

$ ./cluster/ list pods
Name                                   Image(s)                            Host                    Labels               Status
----------                             ----------                          ----------              ----------           ----------
07651754-4f88-11e4-b01e-0800279696e1   csanchez/jenkins-swarm-slave:1.21   name=jenkins-slave   Running
a22e0d59-4f88-11e4-b01e-0800279696e1   csanchez/jenkins-swarm-slave:1.21   name=jenkins-slave   Pending
jenkins                                csanchez/jenkins-swarm:1.565.3.3   name=jenkins         Running

220140917 kubernetes-jenkins


Right now the default scheduler is random, but resource based scheduling will be implemented soon. At the time of writing there are several issues opened to add scheduling based on memory and CPU usage. There is also work in progress in an Apache Mesos based scheduler. Apache Mesos is a framework for distributed systems providing APIs for resource management and scheduling across entire datacenter and cloud environments.

Self healing

One of the benefits of using Kubernetes is the automated management and recovery of containers.

If the container running the Jenkins server dies for any reason, for instance because the process being ran crashes, Kubernetes will notice and will create a new container after a few seconds.

$ vagrant ssh minion-2 -c 'docker kill `docker ps | grep csanchez/jenkins-swarm: | sed -e "s/ .*//"`'

$ ./cluster/ list pods
Name                                   Image(s)                            Host                    Labels               Status
----------                             ----------                          ----------              ----------           ----------
jenkins                                csanchez/jenkins-swarm:1.565.3.3   name=jenkins         Failed
07651754-4f88-11e4-b01e-0800279696e1   csanchez/jenkins-swarm-slave:1.21   name=jenkins-slave   Running
a22e0d59-4f88-11e4-b01e-0800279696e1   csanchez/jenkins-swarm-slave:1.21   name=jenkins-slave   Running

And some time later, typically no more than a minute…

Name                                   Image(s)                            Host                    Labels               Status
----------                             ----------                          ----------              ----------           ----------
jenkins                                csanchez/jenkins-swarm:1.565.3.3   name=jenkins         Running
07651754-4f88-11e4-b01e-0800279696e1   csanchez/jenkins-swarm-slave:1.21   name=jenkins-slave   Running
a22e0d59-4f88-11e4-b01e-0800279696e1   csanchez/jenkins-swarm-slave:1.21   name=jenkins-slave   Running

Running the Jenkins data dir in a volume we guarantee that the data is kept even after the container dies, so we do not lose any Jenkins jobs or data created. And because Kubernetes is proxying the services in each minion the slaves will reconnect to the new Jenkins server automagically no matter where they run! And exactly the same will happen if any of the slave containers dies, the system will automatically create a new container and thanks to the service discovery it will automatically join the Jenkins server pool.

If something more drastic happens, like a minion dying, Kubernetes does not offer yet the ability to reschedule the containers in the other existing minions, it would just show the pods as Failed.

$ vagrant halt minion-2
==> minion-2: Attempting graceful shutdown of VM...
$ ./cluster/ list pods
Name                                   Image(s)                            Host                    Labels               Status
----------                             ----------                          ----------              ----------           ----------
jenkins                                csanchez/jenkins-swarm:1.565.3.3   name=jenkins         Failed
07651754-4f88-11e4-b01e-0800279696e1   csanchez/jenkins-swarm-slave:1.21   name=jenkins-slave   Running
a22e0d59-4f88-11e4-b01e-0800279696e1   csanchez/jenkins-swarm-slave:1.21   name=jenkins-slave   Failed

Tearing down

kubecfg offers several commands to stop and delete the replication controllers, pods and services definitions.

To stop the replication controller, setting the number of replicas to 0, and causing the termination of all the Jenkins slaves containers:

$ ./cluster/ stop jenkins-slave

To delete it:

$ ./cluster/ rm jenkins-slave

To delete the jenkins server pod, causing the termination of the Jenkins master container:

$ ./cluster/ delete pods/jenkins

To delete the services:

$ ./cluster/ delete services/jenkins
$ ./cluster/ delete services/jenkins-slave


Kubernetes is still a very young project, but highly promising to manage Docker deployments across multiple servers and simplify the execution of long running and distributed Docker containers. By abstracting infrastructure concepts and working on states instead of processes, it provides easy definition of clusters, including self healing capabilities out of the box. In short, Kubernetes makes management of Docker fleets easier.

About the Author

Carlos Sanchez has been working on automation and quality of software development, QA and operations processes for over 10 years, from build tools and continuous integration to deployment automation, DevOps best practices and continuous delivery. He has delivered solutions to Fortune 500 companies, working at several US based startups, most recently MaestroDev, a company he cofounded. Carlos has been a speaker at several conferences around the world, including JavaOne, EclipseCON, ApacheCON, JavaZone, Fosdem or PuppetConf. Very involved in open source, he is a member of the Apache Software Foundation amongst other open source groups, contributing to several projects, such as Apache Maven, Fog or Puppet.


Data Warehouse and Analytics Infrastructure at Viki

At Viki, we use data to power product, marketing and business decisions. We use an in-house analytics dashboard to expose all the data we collect to various teams through simple table and chart based reports. This allows them to monitor all our high level KPIs and metrics regularly.

Data also powers more heavy duty stuff – like our data-driven content recommendation system, or predictive models that help us forecast the value of content we’re looking to license. We’re also constantly looking at data to determine the success of new product features, tweak and improve existing features and even kill stuff that doesn’t work. All of this makes data an integral part of the decision making process at Viki.

To support all these functions, we need a robust infrastructure below it, and that’s our data warehouse and analytics infrastructure.

This post is the first of a series about our data warehouse and analytics infrastructure. In this post we’ll cover the high-level pipeline of the system and goes into details about how we collect and batch-process our data. Do expect a lot of detailed-level discussions.

About Viki: We’re a online TV site, with fan-powered translations in 150+ different languages. To understand more what Viki is, watch this short video (2 minutes).

Part 0: Overview (or TL;DR)

Our analytics infrastructure, following most common sense approach, is broken down into 3 steps:

  • Collect and Store Data
  • Process Data (batch + real-time)
  • Present Data


Collect and Store Data

  1. Logs (events) are sent by different clients to a central log collector
  2. Log collector forward events to a Hydration service, here the events get enriched with more time-sensitive information; the results are stored to S3

Batch Processing Data

  1. There is an hourly job that runs to take data from S3, apply further transformations (read: cleaning up bad data) and store the results to our cloud-based Hadoop cluster
  2. We run multiple MapReduce (Hive) jobs to aggregate data from Hadoop and write them to our central analytics database (Postgres)
  3. Another job takes a snapshot of our production databases and restore into our analytics database

Presenting Data

  1. All analytics/reporting-related activities are then done at our master analytics database. The results (meant for report presentation) are then sent to our Reporting DB
  2. We run an internal reporting dashboard app on top of our Reporting DB; this is where end-users log in to see their reports

Real-time Data Processing

  1. The data from Hydration service is also multiplexed and sent to our real-time processing pipeline (using Apache Storm)
  2. At Storm, we write custom job that does real-time aggregation of important metrics. We also write a real-time alerting system to inform ourselves when traffic goes bad

Part 1: Collecting, Pre-processing and Storing Data


We use fluentd to receive event logs from different platforms (web, Android, iOS, etc) (through HTTP). We set up a cluster of 2 dedicated servers running multiple fluentd instances inside Docker, load-balanced through HAproxy. When the message hits our endpoint, fluentd then buffers the messages and batch-forward them to our Hydration System. At the moment, our cluster is doing 100M messages a day.

A word about fluentd: It’s a robust open-source log collecting software that has a very healthy and helpful community around it. The core is written in C so it’s fast and scalable, with plugins written in Ruby, making it easy to extend.

What data do we collect? We collect everything that we think is useful for the business: a click event, an ad impression, ad request, a video play, etc. We call each record an event, and it’s stored as a JSON string like this:

  "time":1380452846, "event": "video_play",
  "video_id":"1008912v", "user_id":"5298933u",
  "app_id":"100004a", "app_ver":"",
  "device":"iPad", "stream_quality":"720p",
  "ip":"", "country":"ca",
  "city_name":"Toronto", "region_name":"ON"

Pre-processing Data – Hydration Service


Data Hydration ServiceThe data after collected is sent to a Hydration Service for pre-processing. Here, the message is enriched with time-sensitive information.

For example: When a user watches a video (thus a video_play event sent), we want to know if it’s a free user or a paid user. Since the user could be a free user today and upgrade to paid tomorrow, the only way to correctly attribute the play event to free/paid bucket is to inject that status right right into the message when it’s received. In short, the service translates this:

{ "event":"video_play", "user_id":"1234" }

into this:

{ "event":"video_play", "user_id":"1234", "user_status":"free" }

For non time-sensitive operations (fixing typo, getting country from IP, etc), there is different process for that (discussed below)

Storing to S3

From the Hydration Service, the message is buffered and then stored to S3 – our source of truth. The data is gzip-compressed and stored into hour bucket, making it easy and fast to retrieve them per hour time-period.

Part 2: Batch-processing Data

The processing layer has 2 components, the batch-processing and the real-time processing component.

This section focuses mainly on our batch-processing layer – our main process of transforming data for reporting/presentation. We’ll cover our real-time processing layer in another post.


Batch-processing Data Layer

Cleaning Data Before Importing Into Hadoop

Those who have worked with data before know this: Cleaning data takes a lot of time. In fact: Cleaning and preparing data will take most of your time. Not the actual analysis.

What is unclean data (or bad data)? Data that is logged incorrectly. It comes in a lot of different forms. E.g

  • Typo mistake, send ‘clickk’ event instead of ‘click’
  • Clients send event twice, or forgot to send event

When bad data enters your system, it stays there forever, unless you purposely find a way to clean it/take it out

So how do we clean up bad data?

Previously when we receive a record, we write directly to our Hadoop cluster and make a backup to S3. This makes it difficult to correct the bad data, due to the append-only nature of Hadoop.

Now all the data is first stored into S3. And we have hourly process that takes data from S3, apply cleanup/transformations and load them into Hadoop (insert-overwrite).


Storing Data, Before and AfterThe process is similar in nature to the hydration process, but this time we look at 1 hour block at a time, rather than per record. This approach has many great benefits:

  • The pipeline is more linear, thus prevents from the threat of data discrepancy (between S3 and Hadoop).
  • The data is not tied down to being stored in Hadoop. If we want to load our data into other data storage, we’d just write another process that transform S3 data and dump somewhere else.
  • When a bad logging happens causing unclean data, we can modify the transformation code and rerun the data from the point of bad logging. Because the process isidempotent, we can perform the reprocessing as many times as we want without double-logging the data.


Our S3 to Hadoop Transform and Load ProcessIf you’ve studied this article The Log from the Data Engineering folks at LinkedIn, you’d notice that the approach is very similar (replacing Kafka with S3, and per-message processing with per hour processing). Indeed our paradigm is inspired by The Log architecture. However due to our needs, our system chose S3 because:

  1. When it comes to batch (re)processing, we do it in a time-period manner (eg. process 1 hour of data). Kafka use natural number to order message, thus if we use Kafka we’ll have to build another service to translate [beg_timestamp, end_timestamp) into [beg_index, end_index).
  2. Kafka can only retain up to X days due to the disk-space limitation. As we want the ability to reprocess data further back, employing Kafka means we need another strategy to cater to these cases.

Aggregating Data from Hadoop into Postgres

Once the data got into Hadoop, we’d have these daily aggregation jobs to aggregate data into fewer dimensions and port them into our analytics master database (PostgreSQL)

For example, to aggregate a table of video starts data together with some video and user information, we run this Hive query (MapReduce job):

-- The hadoop's events table would contain 2 fields: time (int), v (json)
  SUBSTR(FROM_UNIXTIME(time), 0, 10) AS date_d,
  v['platform'] AS platform,
  v['country'] AS country,
  v['video_id'] AS video_id,
  v['user_id'] AS user_id
  COUNT(1) AS cnt
FROM events
WHERE time >= BEG_TS
  AND time <= END_TS
  AND v['event'] = 'video_start'
GROUP BY 1,2,3,4,5

and load the results into an aggregated.video_starts table in Postgres:

       Table "aggregated.video_starts"
   Column    |          Type          | Modifiers 
 date_d      | date                   | not null
 platform    | character varying(255) | 
 country     | character(3)           | 
 video_id    | character varying(255) | 
 user_id     | character varying(255) | 
 cnt         | bigint                 | not null

Further querying and reporting of video_starts will be done out of this table. If we need more dimensions, we either rebuild this table with more dimensions, or build a new table from Hadoop.

If it’s a one-time ad-hoc analysis, we’d just run the queries directly against Hadoop.

Table Partitioning:



Also, we’re making use of Postgres’ Table Inheritance feature to partition our data into multiple monthly tables with a parent table on top of all. Your query just needs to hit the parent table and the engine will know which underlying monthly tables to hit to get your data.

This makes our data very easy to maintain, with small indexes and better rebuild process. We’d have fast (SSD) drives that host the recent tables, and move the older ones to slower (but bigger) drives for semi-archiving purpose.

Centralizing All Data



We dump our production databases into our master analytics database on a daily basis.

Also, we use a lot of 3rd party vendors (Adroll, DoubleClick, Flurry, GA, etc). For each of these services, we write a process to ping their API and import the data into our master analytics database.

These data, together with the aggregated data from Hadoop, allow us to produce meaningful analysis combining data from multiple sources.

For example, to break down our video starts by different genre, we would write some query that joins data from prod.videos table with aggregated.video_starts table:

-- Video Starts by genre
SELECT V.genre, SUM(cnt) FROM aggregated.video_starts VS
LEFT JOIN prod.videos V ON VS.video_id =
WHERE VS.date_d = '2014-06-01'

The above is made possible because we have both sources of data (event tracking data + production data) in 1 place.

Centralizing data is a very important concept in our pipeline because it makes it simple and pain-free for us to connect, report and corroborate our numbers across many different data sources.

Managing Job Dependencies

We started with simple crontab to schedule our hourly/daily jobs. When the jobs grew complicated, we’d end up with very long crontab:


Crontab also doesn’t support graph-based job flow (e.g run A and B at the sametime, when both finishes, run C)


So we looked around for a solution. We considered Chronos (by Airbnb), but their use-case is more complicated than what we needed, plus the need to setup ZooKeeper and all that.

We ended up using Azkaban by LinkedIn, it has everything we need: Crontab with graph-based job flow, it also tell you the runtime history of your job. And when a job flow fails, you can restart them, running only tasks that failed/haven’t run.

It’s pretty awesome.

Making Sure Your Numbers Tie

One of the things I see being less discussed in analytics infrastructure talk/blog is making sure your data don’t drop half-way during transportation, resulting in data inconsistency in different storages.

We have a process that runs after every data transportation, it counts number of records in both source and destination storage and prints errors when they don’t match. These check-total processes sound a little tedious to do, but it proved crucial to our system; it gives us the confidence in the accurary of the numbers we report to management.

Case in point, we had a process that dumps data from Postgres to CSV, then compresses and uploads to S3 and loads them into Amazon Redshift (using COPY command). So technically we have the exact same table in both Postgres and Redshift. One day our analyst pointed out that the data in Redshift is significantly less than in Postgres. Upon investigation, there was bug that cause CSV file to be truncated and thus not fully loaded into Redshift tables. It was because for this particular process we didn’t have the check-totals in place.

Using Ruby as the Scripting Language

When we started we are primarily a Ruby shop, so going ahead with Ruby was a natural choice. “Isn’t it slow?”, you might say. But we use Ruby not to process data (i.e it rarely holds any large amount of data), but to facilitate and coordinate the process.

We have written an entire library to support doing data pipeline in Ruby. For example, we extended pg gem to make it more object-oriented to Postgres. It allows us to do table creation, table hot-swapping, upsert, insert-overwriting, copying tables between databases, etc, all without having to touch SQL code. It has become a nice, productive abstraction on top of SQL and Postgres. Think ORM for data-warehousing purpose.

Example: The below code will create a data.videos_by_genre table holding the result of a simple aggregation query. The process works on a temporary table and eventually it’ll perform a table hot-swap with the main one; this is to avoid any data disruption being made if we would have done it on the main table from the beginning.

columns = [
  {name: 'genre', data_type: 'varchar'},
  {name: 'video_count', data_type: 'int'},
indexes = [{columns: ['genre']}]
table = 'data.videos_by_genre', columns, indexes

table.with_temp_table do |temp_t|
  temp_t.drop(check_exists: true)
  connection.exec <<-SQL.strip_heredoc
    INSERT INTO #{temp_t.qualified_name}
    SELECT genre, COUNT(1) FROM prod.videos
    GROUP BY 1


(the above example could also be done using a MATERIALIZED VIEW btw)

Having this set of libraries has proven very critical to our data pipeline process, since it allows us to write extensible and maintainable code that perform all sort of data transformations.


We rely mostly on free and open-source technologies. Our stack is:

  • fluentd (open-source) for collecting logs
  • Cloud-based Hadoop + Hive (TreasureData – 3rd party vendor)
  • PostgreSQL + Amazon Redshift as central analytics database
  • Ruby as scripting language
  • NodeJS (worker process) with Redis (caching)
  • Azkaban (job flow management)
  • Kestrel (message queue)
  • Apache Storm (real-time stream processing)
  • Docker for automated deployment
  • HAproxy (load balancing)
  • and lots of SQL (huge thanks to Postgres, one of the best relational databases ever made)


The above post went through the overall architecture of our analytics system. It also went into details the Collecting layer and Batch-processing layer. In later blog posts we’ll cover the remaining, specifically:

  • Our Data Presentation layer. And how Stuti, our analyst, built our funnel analysis, fan-in and fan-out tools, all with SQL. And it updates automatically (very funnel. wow!)
  • Our Real-time traffic alert/monitoring system (using Apache Storm)
  • turing: our feature roll-out, A/B testing framework

( Via Engineering. )

Nifty Architecture Tricks From Wix – Building A Publishing Platform At Scale


Wix operates websites in the long tale. As a HTML5 based WYSIWYG web publishing platform, they have created over 54 million websites, most of which receive under 100 page views per day. So traditional caching strategies don’t apply, yet it only takes four web servers to handle all the traffic. That takes some smart work.

Aviran Mordo, Head of Back-End Engineering at Wix, has described their solution in an excellent talk: Wix Architecture at Scale. What they’ve developed is in the best tradition of scaling is specialization. They’ve carefully analyzed their system and figured out how to meet their aggressive high availability and high performance goals in some most interesting ways.

Wix uses multiple datacenters and clouds. Something I haven’t seen before is that they replicate data to multiple datacenters, to Google Compute Engine, and to Amazon. And they have fallback strategies between them in case of failure.

Wix doesn’t use transactions. Instead, all data is immutable and they use a simple eventual consistency strategy that perfectly matches their use case.

Wix doesn’t cache (as in a big caching layer). Instead, they pay great attention to optimizing the rendering path so that every page displays in under 100ms.

Wix started small, with a monolithic architecture, and has consciously moved to a service architecture using a very deliberate process for identifying services that can help anyone thinking about the same move.

This is not your traditional LAMP stack or native cloud anything. Wix is a little different and there’s something here you can learn from. Let’s see how they do it…


  • 54+ million websites, 1 million new websites per month.

  • 800+ terabytes of static data, 1.5 terabytes of new files per day

  • 3 data centers + 2 clouds (Google, Amazon)

  • 300 servers

  • 700 million HTTP requests per day

  • 600 people total, 200 people in R&D

  • About 50 services.

  • 4 public servers are needed to serve 45 million websites


  • MySQL

  • Google and Amazon clouds

  • CDN

  • Chef


  • Simple initial monolithic architecture. Started with one app server. That’s the simplest way to get started. Make quick changes and deploy. It gets you to a particular point.

    • Tomcat, Hibernate, custom web framework

    • Used stateful logins.

    • Disregarded any notion of performance and scaling.

  • Fast forward two years.

    • Still one monolithic server that did everything.

    • At a certain scale of developers and customers it held them back.

    • Problems with dependencies between features. Changes in one place caused deployment of the whole system. Failure in unrelated areas caused system wide downtime.

  • Time to break the system apart.

    • Went with a services approach, but it’s not that easy. How are you going to break functionality apart and into services?

    • Looked at what users are doing in the system and identified three main parts: edit websites, view sites created by Wix, serving media.

    • Editing web sites includes data validation of data from the server, security and authentication, data consistency, and lots of data modification requests.

    • Once finished with the web site users will view it. There are 10x more viewers than editors. So the concerns are now:

      • high availability. HA is the most important feature because it’s the user’s business.

      • high performance

      • high traffic volume

      • the long tail. There are a lot of websites, but they are very small. Every site gets maybe 10 or 100 page views a day. The long tail make caching not the go to scalability strategy. Caching becomes very inefficient.

    • Media serving is the next big service. Includes HTML, javascript, css, images. Needed a way to serve files the 800TB of data under a high volume of requests. The win is static content is highly cacheable.

    • The new system looks like a networking layer that sits below three segment services: editor segment (anything that edits data), media segment (handles static files, read-only), public segment (first place a file is viewed, read-only).

Guidelines For How To Build Services

  • Each service has its own database and only one service can write to a database.

  • Access to a database is only through service APIs. This supports a separation of concerns and hiding the data model from other services.

  • For performance reasons read-only access is granted to other services, but only one service can write. (yes, this contradicts what was said before)

  • Services are stateless. This makes horizontal scaling easy. Just add more servers.

  • No transactions. With the exception of billing/financial transactions, all other services do not use transactions. The idea is to increase database performance by removing transaction overhead. This makes you think about how the data is modeled to have logical transactions, avoiding inconsistent states, without using database transactions.

  • When designing a new service caching is not part of the architecture. First, make a service as performant as possible, then deploy to production, see how it performs, only then, if there are performance issues, and you can’t optimize the code (or other layers), only then add caching.

Editor Segment

  • Editor server must handle lots of files.

  • Data stored as immutable JSON pages (~2.5 million per day) in MySQL.

  • MySQL is a great key-value store. Key is based on a hash function of the file so the key is immutable. Accessing MySQL by primary key is very fast and efficient.

  • Scalability is about tradeoffs. What tradeoffs are we going to make? Didn’t want to use NoSQL because they sacrifice consistency and most developers do not know how to deal with that. So stick with MySQL.

  • Active database. Found after a site has been built only 6% were still being updated. Given this then these active sites can be stored in one database that is really fast and relatively small in terms of storage (2TB).

  • Archive database. All the stale site data, for sites that are infrequently accessed, is moved over into another database that is relatively slow, but has huge amounts of storage. After three months data is pushed to this database is accesses are low. (one could argue this is an implicit caching strategy).

  • Gives a lot of breathing room to grow. The large archive database is slow, but it doesn’t matter because the data isn’t used that often. On first access the data comes from the archive database, but then it is moved to the active database so later accesses are fast.

High Availability For Editor Segment

  • With a lot of data it’s hard to provide high availability for everything. So look at the critical path, which for a website is the content of the website. If a widget has problems most of the website will still work. Invested a lot in protecting the critical path.

  • Protect against database crashes. Want to recover quickly. Replicate databases and failover to the secondary database.

  • Protect against data corruption and data poisoning.  Doesn’t have to be malicious, a bug is enough to spoil the barrel. All data is immutable. Revisions are stored for everything. Worst case  if corruption can’t be fixed is to revert to version where the data was fine.

  • Protect against unavailability. A website has to work all the time. This drove an investment in replicating data across different geographical locations and multiple clouds. This makes the system very resilient.

    • Clicking save on a website editing session sends a JSON file to the editor server.

    • The server sends the page to the active MySQL server which is replicated to another datacenter.

    • After the page is saved to locally, an asynchronous process is kicked upload the data to a static grid, which is the Media Segment.

    • After data is uploaded to the static grid, a notification is sent to a archive service running on the Google Compute Engine. The archive goes to the grid, downloads a page, and stores a copy on the Google cloud.

    • Then a notification is sent back to the editor saying the page was saved to GCE.

    • Another copy is saved to Amazon from GCE.

    • One the final notification is received it means there are three copies of the current revision of data: one in the database, the static grid, and on GCE.

    • For the current revision there are three copies. For old revision there two revisions (static grid, GCE).

    • The process is self-healing. If there’s a failure the next time a user updates their website everything that wasn’t uploaded will be uploaded again.

    • Orphan files are garbage collected.

Modeling Data With No Database Transactions

  • Don’t want a situation where a user edit two pages and only one page is saved in the database, which is an inconsistent state.

  • Take all the JSON files and stick them in the database one after the other. When all the files are saved another save command is issued which contains a manifest of all the IDs (which is hash of the content which is the file name on the static server) of the saved pages that were uploaded to the static servers.

Media Segment

  • Stores lots of files. 800TB of user media files, 3M files uploaded daily, and 500M metadata records.

  • Images are modified. They are resized for different devices and sharpened. Watermarks can be inserted and there’s also audio format conversion.

  • Built an eventually consistent distributed file system that is multi datacenter aware with automatic fallback across DCs. This is before Amazon.

  • A pain to run. 32 servers, doubling the number every 9 months.

  • Plan to push stuff to the cloud to help scale.

  • Vendor lock-in is a myth. It’s all APIs. Just change the implementation and you can move to different clouds in weeks.

  • What really locks you down is data. Moving 800TB of data to a different cloud is really hard.

  • They broke Google Compute Engine when they moved all their data into GCE. They reached the limits of the Google cloud. After some changes by Google it now works.

  • Files are immutable so the are highly cacheable.

  • Image requests first go to a CDN. If the image isn’t in the CDN the request goes to their primary datacenter in Austin. If the image isn’t in Austin the request then goes to Google Cloud. If it’s not in Google cloud it goes to a datacenter in Tampa.

Public Segment

  • Resolve URLs (45 million of them), dispatch to the appropriate renderer, and then render into HTML, sitemap XML, or robots TXT, etc.

  • Public SLA is that response time is < 100ms at peak traffic. Websites have to be available, but also fast. Remember, no caching.

  • When a user clicks publish after editing a page, the manifest, which contains references to pages, are pushed to Public. The routing table is also published.

  • Minimize out-of-service hops. Requires 1 database call to resolve the route. 1 RPC call to dispatch the request to the renderer. 1 database call to get the site manifest.

  • Lookup tables are cached in memory and are updated every 5 minutes.

  • Data is not stored in the same format as it is for the editor. It is stored in a denormalized format, optimized for read by primary key. Everything that is needed is returned in a single request.

  • Minimize business logic. The data is denormalized and precalculated. When you handle large scale every operation, every millisecond you add, it’s times 45 million, so every operation that happens on the public server has to be justified.

  • Page rendering.

    • The html returned by the public server is bootstrap html. It’s a shell with JavaScript imports and JSON data with references to site manifest and dynamic data.

    • Rendering is offloaded to the client. Laptops and mobile devices are very fast and can handle the rendering.

    • JSON was chosen because it’s easy to parse and compressible.

    • It’s easier to fix bugs on the client. Just redeploy new client code. When rendering is done on the server the html will be cached, so fixing a bug requires re-rendering millions of websites again.

High Availability For Public Segment

  • Goal is to be always available, but stuff happens.

  • On a good day: a browser makes a request, the request goes to a datacenter, through a load balancer, goes to a public server, resolves the route, goes to the renderer, the html goes back to the browser, and the browser runs the javascript. The javascript fetches all media files and the JSON data and renders a very beautiful web site. The browser then make a request to the Archive service. The Archive service replays the request in the same way the browser does and stores the data in a cache.

  • On a bad day a datacenter is lost, which did happen. All the UPSs died and the datacenter was down. The DNS was changed and then all the requests went to the secondary datacenter.

  • On a bad day Public is lost. This happened once when a load balancer got half of a configuration so all the Public servers were gone. Or a bad version can be deployed that starts returning errors. Custom code in the load balancer handles this problem by routing to the Archive service to fetch the cached if the Public servers are not available. This approach meant customers were not affected when Public went down, even though the system was reverberating with alarms at the time.

  • On a bad day the Internet sucks. The browser makes a request, goes to the datacenter, goes to the load balancer, gets the html back. Now the JavaScript code has to fetch all the pages and JSON data. It goes to the CDN, it goes to the static grid and fetches all the JSON files to render the site. In these processes Internet problems can prevent files from being returned. Code in JavaScript says if you can’t get to the primary location, try and get it from the archive service, if that fails try the editor database.

Lessons Learned

  • Identify your critical path and concerns. Think through how your product works. Develop usage scenarios. Focus your efforts on these as they give the biggest bang for the buck.

  • Go multi-datacenter and multi-cloud. Build redundancy on the critical path (for availability).

  • De-normalize data and Minimize out-of-process hops (for performance). Precaluclate and do everything possible to minimize network chatter.

  • Take advantage of client’s CPU power. It saves on your server count and it’s also easier to fix bugs in the client.

  • Start small, get it done, then figure out where to go next. Wix did what they needed to do to get their product working. Then they methodically moved to a sophisticated services architecture.

  • The long tail requires a different approach. Rather than cache everything Wix chose to optimize the heck out of the render path and keep data in both an active and archive databases.

  • Go immutable. Immutability has far reaching consequences for an architecture. It affects everything from the client through the back-end. It’s an elegant solution to a lot of problems.

  • Vendor lock-in is a myth. It’s all APIs. Just change the implementation and you can move to different clouds in weeks.

  • What really locks you down is data. Moving lots of data to a different cloud is really hard.

( Via )