Microway Rolls out Octoputer Servers with up to 8 GPUs

Today Microway announced a new line of servers designed for GPU and storage density. As part of the announcement, the company’s new OctoPuter GPU servers pack 34 TFLOPS of computing power when paired with up to up to eight NVIDIA Tesla K40 GPU accelerators.

NVIDIA GPU accelerators offer the fastest parallel processing power available, but this requires high-speed access to the data. Microway’s newest GPU computing solutions ensure that large amounts of source data are retained in the same server as a high-density of Tesla GPUs. The result is faster application performance by avoiding the bottleneck of data retrieval from network storage,” said Stephen Fried, CTO of Microway.

Microway also introduced an additional NumberSmasher 1U GPU server housing up to three NVIDIA Tesla K40 GPU accelerators. With nearly 13 TFLOPS of computing power, the NumberSmasher includes up to 512GB of memory, 24 x86 compute cores, hardware RAID, and optional InfiniBand.

Octoputer_Tesla1000px-434x400

(Via InsideHPC.com)

The WhatsApp Architecture

WhatsApp stats: What has hundreds of nodes, thousands of cores, hundreds of terabytes of RAM, and hopes to serve the billions of smartphones that will soon be a reality around the globe? The Erlang/FreeBSD-based server infrastructure at WhatsApp. We’ve faced many challenges in meeting the ever-growing demand for our messaging services, but as we continue to push the envelope on size (>8000 cores) and speed (>70M Erlang messages per second) of our serving system.

A warning here, we don’t know a lot about the WhatsApp over all architecture. Just bits and pieces gathered from various sources. Rick Reed’s main talk is about the optimization process used to get to 2 million connections a server while using Erlang, which is interesting, but it’s not a complete architecture talk.

Stats

These stats are generally for the current system, not the system we have a talk on. The talk on the current system will include more on hacks for data storage, messaging, meta-clustering, and more BEAM/OTP patches.

  • 450 million active users, and reached that number faster than any other company in history.

  • 32 engineers, one developer supports 14 million active users

  • 50 billion messages every day across seven platforms (inbound + outbound)

  • 1+ million people sign up every day

  • $0 invested in advertising

  • $60 million investment from Sequoia Capital; $3.4 billion is the amount Sequoia will make

  • 35% is how much of Facebook’s cash is being used for the deal
  • Hundreds of nodes

  • >8000 cores

  • Hundreds of terabytes of RAM

  • >70M Erlang messages per second

  • In 2011 WhatsApp achieved 1 million established tcp sessions on a single machine with memory and cpu to spare. In 2012 that was pushed to over 2 million tcp connections. In 2013 WhatsApp tweeted out: On Dec 31st we had a new record day: 7B msgs inbound, 11B msgs outbound = 18 billion total messages processed in one day! Happy 2013!!!

Platform

Backend

  • Erlang

  • FreeBSD

  • Yaws, lighttpd

  • PHP

  • Custom patches to BEAM (BEAM is like Java’s JVM, but for Erlang)

  • Custom XMPP

  • Hosting may be in Softlayer

Frontend

  • Seven client platforms: iPhone, Android, Blackberry, Nokia Symbian S60, Nokia S40, Windows Phone, ?

  • SQLite

Hardware

  • Standard user facing server:

    • Dual Westmere Hex-core (24 logical CPUs);

    • 100GB RAM, SSD;

    • Dual NIC (public user-facing network, private back-end/distribution);

Product

  • Focus is on messaging. Connecting people all over the world, regardless of where they are in the world, without having to pay a lot of money. Founder Jan Koum remembers how difficult it was in 1992 to connect to family all over the world.

  • Privacy. Shaped by Jan Koum’s experiences growing up in the Ukraine, where nothing was private. Messages are not stored on servers; chat history is not stored; goal is to know as little about users as possible; your name and your gender are not known; chat history is only on your phone.

General

  • WhatsApp server is almost completely implemented in Erlang.

    • Server systems that do the backend message routing are done in Erlang.

    • Great achievement is that the number of active users is managed with a really small server footprint. Team consensus is that it is largely because of Erlang.

    • Interesting to note Facebook Chat was written in Erlang in 2009, but they went away from it because it was hard to find qualified programmers.

  • WhatsApp server has started from ejabberd

    • Ejabberd is a famous open source Jabber server written in Erlang.

    • Originally chosen because its open, had great reviews by developers, ease of start and the promise of Erlang’s long term suitability for large communication system.

    • The next few years were spent re-writing and modifying quite a few parts of ejabberd, including switching from XMPP to internally developed protocol, restructuring the code base and redesigning some core components, and making lots of important modifications to Erlang VM to optimize server performance.

  • To handle 50 billion messages a day the focus is on making a reliable system that works. Monetization is something to look at later, it’s far far down the road.

  • A primary gauge of system health is message queue length. The message queue length of all the processes on a node is constantly monitored and an alert is sent out if they accumulate backlog beyond a preset threshold. If one or more processes falls behind that is alerted on, which gives a pointer to the next bottleneck to attack.

  • Multimedia messages are sent by uploading the image, audio or video to be sent to an HTTP server and then sending a link to the content along with its Base64 encoded thumbnail (if applicable).

  • Some code is usually pushed every day. Often, it’s multiple times a day, though in general peak traffic times are avoided. Erlang helps being aggressive in getting fixes and features into production. Hot-loading means updates can be pushed without restarts or traffic shifting. Mistakes can usually be undone very quickly, again by hot-loading. Systems tend to be much more loosely-coupled which makes it very easy to roll changes out incrementally.

  • What protocol is used in Whatsapp app? SSL socket to the WhatsApp server pools. All messages are queued on the server until the client reconnects to retrieve the messages. The successful retrieval of a message is sent back to the whatsapp server which forwards this status back to the original sender (which will see that as a “checkmark” icon next to the message). Messages are wiped from the server memory as soon as the client has accepted the message

  • How does the registration process work internally in Whatsapp? WhatsApp used to create a username/password based on the phone IMEI number. This was changed recently. WhatsApp now uses a general request from the app to send a unique 5 digit PIN. WhatsApp will then send a SMS to the indicated phone number (this means the WhatsApp client no longer needs to run on the same phone). Based on the pin number the app then request a unique key from WhatsApp. This key is used as “password” for all future calls. (this “permanent” key is stored on the device). This also means that registering a new device will invalidate the key on the old device.

  • Google’s push service is used on Android.

  • More users on Android. Android is more enjoyable to work with. Developers are able to prototype a feature and push it out to hundreds of millions of users overnight, if there’s an issue it can be fixed quickly. iOS, not so much.

The Quest For 2+ Million Connections Per Server

  • Experienced lots of user growth, which is a good problem to have, but it also means having to spend money buying more hardware and increased operational complexity of managing all those machines.

  • Need to plan for bumps in traffic. Examples are soccer games and earthquakes in Spain or Mexico. These happen near peak traffic loads, so there needs to be enough spare capacity to handle peaks + bumps. A recent soccer match generated a 35% spike in outbound message rate right at the daily peak.

  • Initial server loading was 200 simultaneous connections per server.

    • Extrapolated out would mean a lot of servers with the hoped for growth pattern.

    • Servers were brittle in the face of burst loads. Network glitches and other problems would occur. Needed to decouple components so things weren’t so brittle at high capacity.

    • Goal was a million connections per server. An ambitious goal given at the time they were running at 200K connections. Running servers with headroom to allow for world events, hardware failures, and other types of glitches would require enough resilience to handle the high usage levels and failures.

Tools And Techniques Used To Increase Scalability

  • Wrote system activity reporter tool (wsar):

    • Record system stats across the system, including OS stats, hardware stats, BEAM stats. It was build so it was easy to plugin metrics from other systems, like virtual memory. Track CPU utilization, overall utilization, user time, system time, interrupt time, context switches, system calls, traps, packets sent/received, total count of messages in queues across all processes, busy port events, traffic rate, bytes in/out, scheduling stats, garbage collection stats, words collected, etc.

    • Initially ran once a minute. As the systems were driven harder one second polling resolution was required because events that happened in the space if a minute were invisible. Really fine grained stats to see how everything is performing.

  • Hardware performance counters in CPU (pmcstat):

    • See where the CPU is at as a percentage of time. Can tell how much time is being spent executing the emulator loop. In their case it is 16% which tells them that only 16% is executing emulated code so even if you were able to remove all the execution time of all the Erlang code it would only save 16% out of the total runtime. This implies you should focus in other areas to improve efficiency of the system.

  • dtrace, kernel lock-counting, fprof

    • Dtrace was mostly for debugging, not performance.

    • Patched BEAM on FreeBSD to include CPU time stamp.

    • Wrote scripts to create an aggregated view of across all processes to see where routines are spending all the  time.

    • Biggest win was compiling the emulator with lock counting turned on.

  • Some Issues:

    • Earlier on saw more time spent in the garbage collections routines, that was brought down.

    • Saw some issues with the networking stack that was tuned away.

    • Most issues were with lock contention in the emulator which shows strongly in the output of the lock counting.

  • Measurement:

    • Synthetic workloads, which means generating traffic from your own test scripts, is of little value for tuning user facing systems at extreme scale.

      • Worked well for simple interfaces like a user table, generating inserts and reads as quickly as possible.

      • If supporting a million connections on a server it would take 30 hosts to open enough IP ports to generate enough connections to test just one server. For two million servers that would take 60 hosts. Just difficult to generate that kind of scale.

      • The type of traffic that is seen during production is difficult to generate. Can guess at a normal workload, but in actuality see networking events, world events, since multi-platform see varying behaviour between clients, and varying by country.

    • Tee’d workload:

      • Take normal production traffic and pipe it off to a separate system.

      • Very useful for systems for which side effects could be constrained. Don’t want to tee traffic and do things that would affect the permanent state of a user or result in multiple messages going to users.

      • Erlang supports hot loading, so could be under a full production load, have an idea, compile, load the change as the program is running and instantly see if that change is better or worse.

      • Added knobs to change production load dynamically and see how it would affect performance. Would be tailing the sar output looking at things like CPU usage, VM utilization, listen queue overflows, and turn knobs to see how the system reacted.

    • True production loads:

      • Ultimate test. Doing both input work and output work.

      • Put server in DNS a couple of times so it would get double or triple the normal traffic. Creates issues with TTLs because clients don’t respect DNS TTLs and there’s a delay, so can’t quickly react to getting more traffic than can be dealt with.

      • IPFW. Forward traffic from one server to another so could give a host exactly the number of desired client connections. A bug caused a kernel panic so that didn’t work very well.

  • Results:

    • Started at 200K simultaneous connections per server.

    • First bottleneck showed up at 425K. System ran into a lot of contention. Work stopped. Instrumented the scheduler to measure how much useful work is being done, or sleeping, or spinning. Under load it started to hit sleeping locks so 35-45% CPU was being used across the system but the schedulers are at 95% utilization.

    • The first round of fixes got to over a million connections.

      • VM usage is at 76%. CPU is at 73%. BEAM emulator running at 45% utilization, which matches closely to user percentage, which is good because the emulator runs as user.

      • Ordinarily CPU utilization isn’t a good measure of how busy a system is because the scheduler uses CPU.

    • A month later tackling bottlenecks 2 million connections per server was achieved.

      • BEAM utilization at 80%, close to where FreeBSD might start paging. CPU is about the same, with double the connections. Scheduler is hitting contention, but running pretty well.

    • Seemed like a good place to stop so started profiling Erlang code.

      • Originally had two Erlang processes per connection. Cut that to one.

      • Did some things with timers.

    • Peaked at 2.8M connections per server

      • 571k pkts/sec, >200k dist msgs/sec

      • Made some memory optimizations so VM load was down to 70%.

    • Tried 3 million connections, but failed.

      • See long message queues when the system is in trouble. Either a single message queue or a sum of message queues.

      • Added to BEAM instrumentation on message queue stats per process. How many messages are being sent/received, how fast.

      • Sampling every 10 seconds, could see a process had 600K messages in its message queue with a dequeue rate of 40K with a delay of 15 seconds. Projected drain time was 41 seconds.

  • Findings:

    • Erlang + BEAM + their fixes – has awesome SMP scalability. Nearly linear scalability. Remarkable. On a 24-way box can run the system with 85% CPU utilization and it’s keeping up running a production load. It can run like this all day.

      • Testament to Erlang’s program model.

      • The longer a server has been up it will accumulate long running connections that are mostly idle so it can handle more connections because these connections are not as busy per connection.

    • Contention was biggest issue.

      • Some fixes were in their Erlang code to reduce BEAM’s contention issues.

      • Some patched to BEAM.

      • Partitioning workload so work didn’t have to cross processors a lot.

      • Time-of-day lock. Every time a message is delivered from a port it looks to update the time-of-day which is a single lock across all schedulers which means all CPUs are hitting one lock.

      • Optimized use of timer wheels. Removed bif timer

      • Check IO time table grows arithmetically. Created VM thrashing has the hash table would be reallocated at various points. Improved to use geometric allocation of the table.

      • Added write file that takes a port that you already have open to reduce port contention.

      • Mseg allocation is single point of contention across all allocators. Make per scheduler.

      • Lots of port transactions when accepting a connection. Set option to reduce expensive port interactions.

      • When message queue backlogs became large garbage collection would destabilize the system. So pause GC until the queues shrunk.

    • Avoiding some common things that come at a price.

      • Backported a TSE time counter from FreeBSD 9 to 8. It’s a cheaper to read timer. Fast to get time of day, less expensive than going to a chip.

      • Backported igp network driver from FreeBSD 9 because having issue with multiple queue on NICs locking up.

      • Increase number of files and sockets.

      • Pmcstat showed a lot of time was spent looking up PCBs in the network stack. So bumped up the size of the hash table to make lookups faster.

    • BEAM Patches

      • Previously mentioned instrumentation patches. Instrument scheduler to get utilization information, statistics for message queues, number of sleeps, send rates, message counts, etc. Can be done in Erlang code with procinfo, but with a million connections it’s very slow.

      • Stats collection is very efficient to gather so they can be run in production.

      • Stats kept at 3 different decay intervals: 1, 10, 100 second intervals. Allows seeing issues over time.

      • Make lock counting work for larger async thread counts.

      • Added debug options to debug lock counters.

    • Tuning

      • Set the scheduler wake up threshold to low because schedulers would go to sleep and would never wake up.

      • Prefer mseg allocators over malloc.

      • Have an allocator per instance per scheduler.

      • Configure carrier sizes start out big and get bigger. Causes FreeBSD to use super pages. Reduced TLB thrash rate and improves throughput for the same CPU.

      • Run BEAM at real-time priority so that other things like cron jobs don’t interrupt schedule. Prevents glitches that would cause backlogs of important user traffic.

      • Patch to dial down spin counts so the scheduler wouldn’t spin.

    • Mnesia

      • Prefer os:timestamp to erlang:now.

      • Using no transactions, but with remote replication ran into a backlog. Parallelized replication for each table to increase throughput.

    • There are actually lots more changes that were made.

Lessons

  • Optimization is dark grungy work suitable only for trolls and engineers. When Rick is going through all the changes that he made to get to 2 million connections a server it was mind numbing. Notice the immense amount of work that went into writing tools, running tests, backporting code, adding gobs of instrumentation to nearly every level of the stack, tuning the system, looking at traces, mucking with very low level details and just trying to understand everything. That’s what it takes to remove the bottlenecks in order to increase performance and scalability to extreme levels.

  • Get the data you needWrite tools. Patch tools. Add knobs. Ken was relentless in extending the system to get the data they needed, constantly writing tools and scripts to the data they needed to manage and optimize the system. Do whatever it takes.

  • Measure. Remove Bottlenecks. Test. Repeat. That’s how you do it.

  • Erlang rocks! Erlang continues to prove its capability as a versatile, reliable, high-performance platform. Though personally all the tuning and patching that was required casts some doubt on this claim.

  • Crack the virality code and profit. Virality is an allusive quality, but as WhatsApp shows, if you do figure out, man, it’s worth a lot of money.

  • Value and employee count are now officially divorced. There are a lot of force-multipliers out in the world today. An advanced global telecom infrastructure makes apps like WhatsApp possible. If WhatsApp had to make the network or a phone etc it would never happen. Powerful cheap hardware and Open Source software availability is of course another multiplier. As is being in the right place at the right time with the right product in front of the right buyer.

  • There’s something to this brutal focus on the user idea. WhatsApp is focussed on being a simple messaging app, not being a gaming network, or an advertising network, or a disappearing photos network. That worked for them. It guided their no ads stance, their ability to keep the app simple while adding features, and the overall no brainer it just works philosohpy on any phone.

  • Limits in the cause of simplicity are OK. Your identity is tied to the phone number, so if you change your phone number your identity is gone. This is very un-computer like. But it does make the entire system much simpler in design.

  • Age ain’t no thing. If it was age discrimination that prevented WhatsApp co-founder Brian Acton from getting a job at both Twitter and Facebook in 2009, then shame, shame, shame.

  • Start simply and then customize. When chat was launched initially the server side was based on ejabberd. It’s since been completely rewritten, but that was the initial step in the Erlang direction. The experience with the scalability, reliability, and operability of Erlang in that initial use case led to broader and broader use.

  • Keep server count low. Constantly work to keep server counts as low as possible while leaving enough headroom for events that create short-term spikes in usage. Analyze and optimize until the point of diminishing returns is hit on those efforts and then deploy more hardware.

  • Purposely overprovision hardware. This ensures that users have uninterrupted service during their festivities and employees are able to enjoy the holidays without spending the whole time fixing overload issues.

  • Growth stalls when you charge money. Growth was super fast when WhatsApp was free, 10,000 downloads a day in the early days. Then when switching over to paid that declined to 1,000 a day. At the end of the year, after adding picture messaging, they settled on charging a one-time download fee, later modified to an annual payment.

  • Inspiration comes from the strangest places. Experience with forgetting the username and passwords on Skype accounts drove the passion for making the app “just work.”

(via HighScalability.com)

How To Scale A Web Application Using A Coffee Shop As An Example

This is a guest repost by Sriram Devadas, Engineer at Vistaprint, Web platform group. A fun and well written analogy of how to scale web applications using a familiar coffee shop as an example. No coffee was harmed during the making of this post.

I own a small coffee shop.

My expense is proportional to resources
100 square feet of built up area with utilities, 1 barista, 1 cup coffee maker.

My shop’s capacity
Serves 1 customer at a time, takes 3 minutes to brew a cup of coffee, a total of 5 minutes to serve a customer.

Since my barista works without breaks and the German made coffee maker never breaks down,
my shop’s maximum throughput = 12 customers per hour.

 

Web Server

Customers walk away during peak hours. We only serve one customer at a time. There is no space to wait.

I upgrade shop. My new setup is better!

Expenses
Same area and utilities, 3 baristas, 2 cup coffee maker, 2 chairs

Capacity
3 minutes to brew 2 cups of coffee, ~7 minutes to serve 3 concurrent customers, 2 additional customers can wait in queue on chairs.

Concurrent customers = 3, Customer capacity = 5

 

Scaling Vertically

Business is booming. Time to upgrade again. Bigger is better!

Expenses
200 square feet and utilities, 5 baristas, 4 cup coffee maker, 3 chairs

Capacity goes up proportionally. Things are good.

In summer, business drops. This is normal for coffee shops. I want to go back to a smaller setup for sometime. But my landlord wont let me scale that way.

Scaling vertically is expensive for my up and coming coffee shop. Bigger is not always better.

 

Scaling Horizontally With A Load Balancer

Landlord is willing to scale up and down in terms of smaller 3 barista setups. He can add a setup or take one away if I give advance notice.

If only I could manage multiple setups with the same storefront…

There is a special counter in the market which does just that!
It allows several customers to talk to a barista at once. Actually the customer facing employee need not be a barista, just someone who takes and delivers orders. And baristas making coffee do not have to deal directly with pesky customers.

Now I have an advantage. If I need to scale, I can lease another 3 barista setup (which happens to be a standard with coffee landlords), and hook it up to my special counter. And do the reverse to scale down.

While expenses go up, I can handle capacity better too. I can ramp up and down horizontally.

 

Resource Intensive Processing

My coffee maker is really a general purpose food maker. Many customers tell me they would love to buy freshly baked bread. I add it to the menu.

I have a problem. The 2 cup coffee maker I use, can make only 1 pound of bread at a time. Moreover it takes twice as long to make bread.

In terms of time,
1 pound of bread = 4 cups of coffee

Sometimes bread orders clog my system! Coffee ordering customers are not happy with the wait. Word spreads about my inefficiency.

I need a way of segmenting my customer orders by load, while using my resources optimally.

 

Asynchronous Queue Based Processing

I introduce a token based queue system.

Customers come in, place their order, get a token number and wait.
The order taker places the order on 2 input queues – bread and coffee.
Baristas look at the queues and all other units and decide if they should pick up the next coffee or the next bread order.
Once a coffee or bread is ready, it is placed on an output tray and the order taker shouts out the token number. The customer picks up the order.

- The inputs queues and output tray are new. Otherwise the resources are the same, only working in different ways.

- From a customers point of view, the interaction with the system is now quite different.

- The expense and capacity calculations are complicated. The system’s complexity went up too. If any problem crops up, debugging and fixing it could be problematic.

- If the customers accept this asynchronous system, and we can manage the complexity, it provides a way to scale capacity and product variety. I can intimidate my next door competitor.

 

Scaling Beyond

We are reaching the limits of our web server driven, load balanced, queue based asynchronous system. What next?

My already weak coffee shop analogy has broken down completely at this point.

Search for DNS round robin and other techniques of scaling if you are interested to know more.

If you are new to web application scaling, do the same for each of the topics mentioned in this page.

My coffee shop analogy was an over-simplification, to spark interest on the topic of scaling web applications.

If you really want to learn, tinker with these systems and discuss them with someone who has practical knowledge.

(via HighScalability.com)