NYTimes Architecture: No Head, No Master, No Single Point Of Failure

Michael Laing, a Systems Architect at NYTimes, gave this great decription of their use of RabbitMQ and their overall architecture on the RabbitMQ mailing list. The closing sentiment marks this as definitely an architecture to learn from:

Although it may seem complex, nytimes architecture has simple components and is mostly principles and plumbing. The key point to grasp is that there is no head, no master, no single point of failure. As I write this I can see components failing (not RabbitMQ), and we are fixing them so they are more reliable. But the system doesn’t fail, users can connect, and messages are delivered, regardless – all within design parameters.

Since it’s short, to the point, and I couldn’t say it better, I’ll just reproduce two of the email list posts here:

Just a quick note and thank you to the RabbitMQ team for a great product.

Our premier online offering http://www.nytimes.com has a new look and new underpinnings, now including a messaging architecture implemented using RabbitMQ.

This architecture – nytimes architecture – has dozens of RabbitMQ instances spread across 6 AWS zones in Oregon and Dublin. The instances are organized into “wholesale” and “retail” layers. Connection to clients is via websockets/sockjs.

Upon launch today, the system autoscaled to ~500,000 users. Connection times remained flat at ~200ms.

nytimes architecture provides subscription services for breaking news, video feeds, etc. and will add more event based services. It also supports individual messaging related to subscription status for registered users.

This system would not have been possible without RabbitMQ. It was the one component, used everywhere, that never faltered or failed.

We are using: a single Amazon Linux AMI, RabbitMQ, Cassandra 2, python 2We use pika with tornado and libev for the nytimes architecture wholesale and retail pieces; our internal clients use Java and PHP.

We use shovels – lots of shovels – to interconnect.

In production we have a RabbitMQ client 3-cluster and a core 3-cluster in each region on c1-xlarges. A proxy cluster of c1-mediums in Virginia connects clients to the client clusters. All services are parallelized so we can add more cores and clients.

The retail layer autoscales and use c1-mediums with a single rabbit shovel-connected to one of the core rabbits. Each python websocket/sockjs gateway supports up to 100K clients.

We autodeploy into subnets within Virtual Private Clouds in AWS. Clients are routed via least latency to the fastest healthy region.

Of the technical components, the gateway is the most complex. We will be moving it into open source in pieces and the first piece is likely to be the python websocket/sockjs libraries which, frankly, beat the crap out of most other stuff out there and fully conform with the relevant standards. It can be loosely thought of as a C co-process managed by python, and as such, may be possible to reuse in other languages/environments.

We have a 12-node Cassandra cluster across the 2 regions / 6 zones. It is used for persistence of messages and as cache. We do not use persistence in RabbitMQ. Our services are idempotent and important messages may be replicated multiple times creating intentional race conditions in which the fastest wins.

Although it may seem complex, nytimes architecture has simple components and is mostly principles and plumbing. The key point to grasp is that there is no head, no master, no single point of failure. As I write this I can see components failing (not RabbitMQ), and we are fixing them so they are more reliable. But the system doesn’t fail, users can connect, and messages are delivered, regardless – all within design parameters.

(Via HighScalability.com)

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s