Microway Rolls out Octoputer Servers with up to 8 GPUs

Today Microway announced a new line of servers designed for GPU and storage density. As part of the announcement, the company’s new OctoPuter GPU servers pack 34 TFLOPS of computing power when paired with up to up to eight NVIDIA Tesla K40 GPU accelerators.

NVIDIA GPU accelerators offer the fastest parallel processing power available, but this requires high-speed access to the data. Microway’s newest GPU computing solutions ensure that large amounts of source data are retained in the same server as a high-density of Tesla GPUs. The result is faster application performance by avoiding the bottleneck of data retrieval from network storage,” said Stephen Fried, CTO of Microway.

Microway also introduced an additional NumberSmasher 1U GPU server housing up to three NVIDIA Tesla K40 GPU accelerators. With nearly 13 TFLOPS of computing power, the NumberSmasher includes up to 512GB of memory, 24 x86 compute cores, hardware RAID, and optional InfiniBand.

Octoputer_Tesla1000px-434x400

(Via InsideHPC.com)

The WhatsApp Architecture

WhatsApp stats: What has hundreds of nodes, thousands of cores, hundreds of terabytes of RAM, and hopes to serve the billions of smartphones that will soon be a reality around the globe? The Erlang/FreeBSD-based server infrastructure at WhatsApp. We’ve faced many challenges in meeting the ever-growing demand for our messaging services, but as we continue to push the envelope on size (>8000 cores) and speed (>70M Erlang messages per second) of our serving system.

A warning here, we don’t know a lot about the WhatsApp over all architecture. Just bits and pieces gathered from various sources. Rick Reed’s main talk is about the optimization process used to get to 2 million connections a server while using Erlang, which is interesting, but it’s not a complete architecture talk.

Stats

These stats are generally for the current system, not the system we have a talk on. The talk on the current system will include more on hacks for data storage, messaging, meta-clustering, and more BEAM/OTP patches.

  • 450 million active users, and reached that number faster than any other company in history.

  • 32 engineers, one developer supports 14 million active users

  • 50 billion messages every day across seven platforms (inbound + outbound)

  • 1+ million people sign up every day

  • $0 invested in advertising

  • $60 million investment from Sequoia Capital; $3.4 billion is the amount Sequoia will make

  • 35% is how much of Facebook’s cash is being used for the deal
  • Hundreds of nodes

  • >8000 cores

  • Hundreds of terabytes of RAM

  • >70M Erlang messages per second

  • In 2011 WhatsApp achieved 1 million established tcp sessions on a single machine with memory and cpu to spare. In 2012 that was pushed to over 2 million tcp connections. In 2013 WhatsApp tweeted out: On Dec 31st we had a new record day: 7B msgs inbound, 11B msgs outbound = 18 billion total messages processed in one day! Happy 2013!!!

Platform

Backend

  • Erlang

  • FreeBSD

  • Yaws, lighttpd

  • PHP

  • Custom patches to BEAM (BEAM is like Java’s JVM, but for Erlang)

  • Custom XMPP

  • Hosting may be in Softlayer

Frontend

  • Seven client platforms: iPhone, Android, Blackberry, Nokia Symbian S60, Nokia S40, Windows Phone, ?

  • SQLite

Hardware

  • Standard user facing server:

    • Dual Westmere Hex-core (24 logical CPUs);

    • 100GB RAM, SSD;

    • Dual NIC (public user-facing network, private back-end/distribution);

Product

  • Focus is on messaging. Connecting people all over the world, regardless of where they are in the world, without having to pay a lot of money. Founder Jan Koum remembers how difficult it was in 1992 to connect to family all over the world.

  • Privacy. Shaped by Jan Koum’s experiences growing up in the Ukraine, where nothing was private. Messages are not stored on servers; chat history is not stored; goal is to know as little about users as possible; your name and your gender are not known; chat history is only on your phone.

General

  • WhatsApp server is almost completely implemented in Erlang.

    • Server systems that do the backend message routing are done in Erlang.

    • Great achievement is that the number of active users is managed with a really small server footprint. Team consensus is that it is largely because of Erlang.

    • Interesting to note Facebook Chat was written in Erlang in 2009, but they went away from it because it was hard to find qualified programmers.

  • WhatsApp server has started from ejabberd

    • Ejabberd is a famous open source Jabber server written in Erlang.

    • Originally chosen because its open, had great reviews by developers, ease of start and the promise of Erlang’s long term suitability for large communication system.

    • The next few years were spent re-writing and modifying quite a few parts of ejabberd, including switching from XMPP to internally developed protocol, restructuring the code base and redesigning some core components, and making lots of important modifications to Erlang VM to optimize server performance.

  • To handle 50 billion messages a day the focus is on making a reliable system that works. Monetization is something to look at later, it’s far far down the road.

  • A primary gauge of system health is message queue length. The message queue length of all the processes on a node is constantly monitored and an alert is sent out if they accumulate backlog beyond a preset threshold. If one or more processes falls behind that is alerted on, which gives a pointer to the next bottleneck to attack.

  • Multimedia messages are sent by uploading the image, audio or video to be sent to an HTTP server and then sending a link to the content along with its Base64 encoded thumbnail (if applicable).

  • Some code is usually pushed every day. Often, it’s multiple times a day, though in general peak traffic times are avoided. Erlang helps being aggressive in getting fixes and features into production. Hot-loading means updates can be pushed without restarts or traffic shifting. Mistakes can usually be undone very quickly, again by hot-loading. Systems tend to be much more loosely-coupled which makes it very easy to roll changes out incrementally.

  • What protocol is used in Whatsapp app? SSL socket to the WhatsApp server pools. All messages are queued on the server until the client reconnects to retrieve the messages. The successful retrieval of a message is sent back to the whatsapp server which forwards this status back to the original sender (which will see that as a “checkmark” icon next to the message). Messages are wiped from the server memory as soon as the client has accepted the message

  • How does the registration process work internally in Whatsapp? WhatsApp used to create a username/password based on the phone IMEI number. This was changed recently. WhatsApp now uses a general request from the app to send a unique 5 digit PIN. WhatsApp will then send a SMS to the indicated phone number (this means the WhatsApp client no longer needs to run on the same phone). Based on the pin number the app then request a unique key from WhatsApp. This key is used as “password” for all future calls. (this “permanent” key is stored on the device). This also means that registering a new device will invalidate the key on the old device.

  • Google’s push service is used on Android.

  • More users on Android. Android is more enjoyable to work with. Developers are able to prototype a feature and push it out to hundreds of millions of users overnight, if there’s an issue it can be fixed quickly. iOS, not so much.

The Quest For 2+ Million Connections Per Server

  • Experienced lots of user growth, which is a good problem to have, but it also means having to spend money buying more hardware and increased operational complexity of managing all those machines.

  • Need to plan for bumps in traffic. Examples are soccer games and earthquakes in Spain or Mexico. These happen near peak traffic loads, so there needs to be enough spare capacity to handle peaks + bumps. A recent soccer match generated a 35% spike in outbound message rate right at the daily peak.

  • Initial server loading was 200 simultaneous connections per server.

    • Extrapolated out would mean a lot of servers with the hoped for growth pattern.

    • Servers were brittle in the face of burst loads. Network glitches and other problems would occur. Needed to decouple components so things weren’t so brittle at high capacity.

    • Goal was a million connections per server. An ambitious goal given at the time they were running at 200K connections. Running servers with headroom to allow for world events, hardware failures, and other types of glitches would require enough resilience to handle the high usage levels and failures.

Tools And Techniques Used To Increase Scalability

  • Wrote system activity reporter tool (wsar):

    • Record system stats across the system, including OS stats, hardware stats, BEAM stats. It was build so it was easy to plugin metrics from other systems, like virtual memory. Track CPU utilization, overall utilization, user time, system time, interrupt time, context switches, system calls, traps, packets sent/received, total count of messages in queues across all processes, busy port events, traffic rate, bytes in/out, scheduling stats, garbage collection stats, words collected, etc.

    • Initially ran once a minute. As the systems were driven harder one second polling resolution was required because events that happened in the space if a minute were invisible. Really fine grained stats to see how everything is performing.

  • Hardware performance counters in CPU (pmcstat):

    • See where the CPU is at as a percentage of time. Can tell how much time is being spent executing the emulator loop. In their case it is 16% which tells them that only 16% is executing emulated code so even if you were able to remove all the execution time of all the Erlang code it would only save 16% out of the total runtime. This implies you should focus in other areas to improve efficiency of the system.

  • dtrace, kernel lock-counting, fprof

    • Dtrace was mostly for debugging, not performance.

    • Patched BEAM on FreeBSD to include CPU time stamp.

    • Wrote scripts to create an aggregated view of across all processes to see where routines are spending all the  time.

    • Biggest win was compiling the emulator with lock counting turned on.

  • Some Issues:

    • Earlier on saw more time spent in the garbage collections routines, that was brought down.

    • Saw some issues with the networking stack that was tuned away.

    • Most issues were with lock contention in the emulator which shows strongly in the output of the lock counting.

  • Measurement:

    • Synthetic workloads, which means generating traffic from your own test scripts, is of little value for tuning user facing systems at extreme scale.

      • Worked well for simple interfaces like a user table, generating inserts and reads as quickly as possible.

      • If supporting a million connections on a server it would take 30 hosts to open enough IP ports to generate enough connections to test just one server. For two million servers that would take 60 hosts. Just difficult to generate that kind of scale.

      • The type of traffic that is seen during production is difficult to generate. Can guess at a normal workload, but in actuality see networking events, world events, since multi-platform see varying behaviour between clients, and varying by country.

    • Tee’d workload:

      • Take normal production traffic and pipe it off to a separate system.

      • Very useful for systems for which side effects could be constrained. Don’t want to tee traffic and do things that would affect the permanent state of a user or result in multiple messages going to users.

      • Erlang supports hot loading, so could be under a full production load, have an idea, compile, load the change as the program is running and instantly see if that change is better or worse.

      • Added knobs to change production load dynamically and see how it would affect performance. Would be tailing the sar output looking at things like CPU usage, VM utilization, listen queue overflows, and turn knobs to see how the system reacted.

    • True production loads:

      • Ultimate test. Doing both input work and output work.

      • Put server in DNS a couple of times so it would get double or triple the normal traffic. Creates issues with TTLs because clients don’t respect DNS TTLs and there’s a delay, so can’t quickly react to getting more traffic than can be dealt with.

      • IPFW. Forward traffic from one server to another so could give a host exactly the number of desired client connections. A bug caused a kernel panic so that didn’t work very well.

  • Results:

    • Started at 200K simultaneous connections per server.

    • First bottleneck showed up at 425K. System ran into a lot of contention. Work stopped. Instrumented the scheduler to measure how much useful work is being done, or sleeping, or spinning. Under load it started to hit sleeping locks so 35-45% CPU was being used across the system but the schedulers are at 95% utilization.

    • The first round of fixes got to over a million connections.

      • VM usage is at 76%. CPU is at 73%. BEAM emulator running at 45% utilization, which matches closely to user percentage, which is good because the emulator runs as user.

      • Ordinarily CPU utilization isn’t a good measure of how busy a system is because the scheduler uses CPU.

    • A month later tackling bottlenecks 2 million connections per server was achieved.

      • BEAM utilization at 80%, close to where FreeBSD might start paging. CPU is about the same, with double the connections. Scheduler is hitting contention, but running pretty well.

    • Seemed like a good place to stop so started profiling Erlang code.

      • Originally had two Erlang processes per connection. Cut that to one.

      • Did some things with timers.

    • Peaked at 2.8M connections per server

      • 571k pkts/sec, >200k dist msgs/sec

      • Made some memory optimizations so VM load was down to 70%.

    • Tried 3 million connections, but failed.

      • See long message queues when the system is in trouble. Either a single message queue or a sum of message queues.

      • Added to BEAM instrumentation on message queue stats per process. How many messages are being sent/received, how fast.

      • Sampling every 10 seconds, could see a process had 600K messages in its message queue with a dequeue rate of 40K with a delay of 15 seconds. Projected drain time was 41 seconds.

  • Findings:

    • Erlang + BEAM + their fixes – has awesome SMP scalability. Nearly linear scalability. Remarkable. On a 24-way box can run the system with 85% CPU utilization and it’s keeping up running a production load. It can run like this all day.

      • Testament to Erlang’s program model.

      • The longer a server has been up it will accumulate long running connections that are mostly idle so it can handle more connections because these connections are not as busy per connection.

    • Contention was biggest issue.

      • Some fixes were in their Erlang code to reduce BEAM’s contention issues.

      • Some patched to BEAM.

      • Partitioning workload so work didn’t have to cross processors a lot.

      • Time-of-day lock. Every time a message is delivered from a port it looks to update the time-of-day which is a single lock across all schedulers which means all CPUs are hitting one lock.

      • Optimized use of timer wheels. Removed bif timer

      • Check IO time table grows arithmetically. Created VM thrashing has the hash table would be reallocated at various points. Improved to use geometric allocation of the table.

      • Added write file that takes a port that you already have open to reduce port contention.

      • Mseg allocation is single point of contention across all allocators. Make per scheduler.

      • Lots of port transactions when accepting a connection. Set option to reduce expensive port interactions.

      • When message queue backlogs became large garbage collection would destabilize the system. So pause GC until the queues shrunk.

    • Avoiding some common things that come at a price.

      • Backported a TSE time counter from FreeBSD 9 to 8. It’s a cheaper to read timer. Fast to get time of day, less expensive than going to a chip.

      • Backported igp network driver from FreeBSD 9 because having issue with multiple queue on NICs locking up.

      • Increase number of files and sockets.

      • Pmcstat showed a lot of time was spent looking up PCBs in the network stack. So bumped up the size of the hash table to make lookups faster.

    • BEAM Patches

      • Previously mentioned instrumentation patches. Instrument scheduler to get utilization information, statistics for message queues, number of sleeps, send rates, message counts, etc. Can be done in Erlang code with procinfo, but with a million connections it’s very slow.

      • Stats collection is very efficient to gather so they can be run in production.

      • Stats kept at 3 different decay intervals: 1, 10, 100 second intervals. Allows seeing issues over time.

      • Make lock counting work for larger async thread counts.

      • Added debug options to debug lock counters.

    • Tuning

      • Set the scheduler wake up threshold to low because schedulers would go to sleep and would never wake up.

      • Prefer mseg allocators over malloc.

      • Have an allocator per instance per scheduler.

      • Configure carrier sizes start out big and get bigger. Causes FreeBSD to use super pages. Reduced TLB thrash rate and improves throughput for the same CPU.

      • Run BEAM at real-time priority so that other things like cron jobs don’t interrupt schedule. Prevents glitches that would cause backlogs of important user traffic.

      • Patch to dial down spin counts so the scheduler wouldn’t spin.

    • Mnesia

      • Prefer os:timestamp to erlang:now.

      • Using no transactions, but with remote replication ran into a backlog. Parallelized replication for each table to increase throughput.

    • There are actually lots more changes that were made.

Lessons

  • Optimization is dark grungy work suitable only for trolls and engineers. When Rick is going through all the changes that he made to get to 2 million connections a server it was mind numbing. Notice the immense amount of work that went into writing tools, running tests, backporting code, adding gobs of instrumentation to nearly every level of the stack, tuning the system, looking at traces, mucking with very low level details and just trying to understand everything. That’s what it takes to remove the bottlenecks in order to increase performance and scalability to extreme levels.

  • Get the data you needWrite tools. Patch tools. Add knobs. Ken was relentless in extending the system to get the data they needed, constantly writing tools and scripts to the data they needed to manage and optimize the system. Do whatever it takes.

  • Measure. Remove Bottlenecks. Test. Repeat. That’s how you do it.

  • Erlang rocks! Erlang continues to prove its capability as a versatile, reliable, high-performance platform. Though personally all the tuning and patching that was required casts some doubt on this claim.

  • Crack the virality code and profit. Virality is an allusive quality, but as WhatsApp shows, if you do figure out, man, it’s worth a lot of money.

  • Value and employee count are now officially divorced. There are a lot of force-multipliers out in the world today. An advanced global telecom infrastructure makes apps like WhatsApp possible. If WhatsApp had to make the network or a phone etc it would never happen. Powerful cheap hardware and Open Source software availability is of course another multiplier. As is being in the right place at the right time with the right product in front of the right buyer.

  • There’s something to this brutal focus on the user idea. WhatsApp is focussed on being a simple messaging app, not being a gaming network, or an advertising network, or a disappearing photos network. That worked for them. It guided their no ads stance, their ability to keep the app simple while adding features, and the overall no brainer it just works philosohpy on any phone.

  • Limits in the cause of simplicity are OK. Your identity is tied to the phone number, so if you change your phone number your identity is gone. This is very un-computer like. But it does make the entire system much simpler in design.

  • Age ain’t no thing. If it was age discrimination that prevented WhatsApp co-founder Brian Acton from getting a job at both Twitter and Facebook in 2009, then shame, shame, shame.

  • Start simply and then customize. When chat was launched initially the server side was based on ejabberd. It’s since been completely rewritten, but that was the initial step in the Erlang direction. The experience with the scalability, reliability, and operability of Erlang in that initial use case led to broader and broader use.

  • Keep server count low. Constantly work to keep server counts as low as possible while leaving enough headroom for events that create short-term spikes in usage. Analyze and optimize until the point of diminishing returns is hit on those efforts and then deploy more hardware.

  • Purposely overprovision hardware. This ensures that users have uninterrupted service during their festivities and employees are able to enjoy the holidays without spending the whole time fixing overload issues.

  • Growth stalls when you charge money. Growth was super fast when WhatsApp was free, 10,000 downloads a day in the early days. Then when switching over to paid that declined to 1,000 a day. At the end of the year, after adding picture messaging, they settled on charging a one-time download fee, later modified to an annual payment.

  • Inspiration comes from the strangest places. Experience with forgetting the username and passwords on Skype accounts drove the passion for making the app “just work.”

(via HighScalability.com)

How To Scale A Web Application Using A Coffee Shop As An Example

This is a guest repost by Sriram Devadas, Engineer at Vistaprint, Web platform group. A fun and well written analogy of how to scale web applications using a familiar coffee shop as an example. No coffee was harmed during the making of this post.

I own a small coffee shop.

My expense is proportional to resources
100 square feet of built up area with utilities, 1 barista, 1 cup coffee maker.

My shop’s capacity
Serves 1 customer at a time, takes 3 minutes to brew a cup of coffee, a total of 5 minutes to serve a customer.

Since my barista works without breaks and the German made coffee maker never breaks down,
my shop’s maximum throughput = 12 customers per hour.

 

Web Server

Customers walk away during peak hours. We only serve one customer at a time. There is no space to wait.

I upgrade shop. My new setup is better!

Expenses
Same area and utilities, 3 baristas, 2 cup coffee maker, 2 chairs

Capacity
3 minutes to brew 2 cups of coffee, ~7 minutes to serve 3 concurrent customers, 2 additional customers can wait in queue on chairs.

Concurrent customers = 3, Customer capacity = 5

 

Scaling Vertically

Business is booming. Time to upgrade again. Bigger is better!

Expenses
200 square feet and utilities, 5 baristas, 4 cup coffee maker, 3 chairs

Capacity goes up proportionally. Things are good.

In summer, business drops. This is normal for coffee shops. I want to go back to a smaller setup for sometime. But my landlord wont let me scale that way.

Scaling vertically is expensive for my up and coming coffee shop. Bigger is not always better.

 

Scaling Horizontally With A Load Balancer

Landlord is willing to scale up and down in terms of smaller 3 barista setups. He can add a setup or take one away if I give advance notice.

If only I could manage multiple setups with the same storefront…

There is a special counter in the market which does just that!
It allows several customers to talk to a barista at once. Actually the customer facing employee need not be a barista, just someone who takes and delivers orders. And baristas making coffee do not have to deal directly with pesky customers.

Now I have an advantage. If I need to scale, I can lease another 3 barista setup (which happens to be a standard with coffee landlords), and hook it up to my special counter. And do the reverse to scale down.

While expenses go up, I can handle capacity better too. I can ramp up and down horizontally.

 

Resource Intensive Processing

My coffee maker is really a general purpose food maker. Many customers tell me they would love to buy freshly baked bread. I add it to the menu.

I have a problem. The 2 cup coffee maker I use, can make only 1 pound of bread at a time. Moreover it takes twice as long to make bread.

In terms of time,
1 pound of bread = 4 cups of coffee

Sometimes bread orders clog my system! Coffee ordering customers are not happy with the wait. Word spreads about my inefficiency.

I need a way of segmenting my customer orders by load, while using my resources optimally.

 

Asynchronous Queue Based Processing

I introduce a token based queue system.

Customers come in, place their order, get a token number and wait.
The order taker places the order on 2 input queues – bread and coffee.
Baristas look at the queues and all other units and decide if they should pick up the next coffee or the next bread order.
Once a coffee or bread is ready, it is placed on an output tray and the order taker shouts out the token number. The customer picks up the order.

- The inputs queues and output tray are new. Otherwise the resources are the same, only working in different ways.

- From a customers point of view, the interaction with the system is now quite different.

- The expense and capacity calculations are complicated. The system’s complexity went up too. If any problem crops up, debugging and fixing it could be problematic.

- If the customers accept this asynchronous system, and we can manage the complexity, it provides a way to scale capacity and product variety. I can intimidate my next door competitor.

 

Scaling Beyond

We are reaching the limits of our web server driven, load balanced, queue based asynchronous system. What next?

My already weak coffee shop analogy has broken down completely at this point.

Search for DNS round robin and other techniques of scaling if you are interested to know more.

If you are new to web application scaling, do the same for each of the topics mentioned in this page.

My coffee shop analogy was an over-simplification, to spark interest on the topic of scaling web applications.

If you really want to learn, tinker with these systems and discuss them with someone who has practical knowledge.

(via HighScalability.com)

 

Docker – a Linux Container.

(Today I read a article a Docker. It’s a good thing and I want share it to you)

About Docker

Docker is an open-source engine that automates the deployment of any application as a lightweight, portable, self-sufficient container that will run virtually anywhere.

Docker containers can encapsulate any payload, and will run consistently on and between virtually any server. The same container that a developer builds and tests on a laptop will run at scale, in production, on VMs, bare-metal servers, OpenStack clusters, public instances, or combinations of the above.

Common use cases for Docker include:

  • Automating the packaging and deployment of applications
  • Creation of lightweight, private PAAS environments
  • Automated testing and continuous integration/deployment
  • Deploying and scaling web apps, databases and backend services

Table of contents

Background

Fifteen years ago, virtually all applications were written using well defined stacks of services and deployed on a single monolithic, proprietary server. Today, developers build and assemble applications using a multiplicity of the best available services, and must be prepared for those applications to be deployed across a multiplicity of different hardware environments, included public, private, and virtualized servers.

Figure 1: The Evolution of IT

This sets up the possibility for:

  • Adverse interactions between different services and “dependency hell”
  • Challenges in rapidly migrating and scaling across different hardware* The impossibility of managing a matrix of multiple different services deployed across multiple different types of hardware

Figure 2: The Challenge of Multiple Stacks and Multiple Hardware Environments

Or, viewed as a matrix, we can see that there is a huge number of combinations and permutations of applications/services and hardware environments that need to be considered every time an application is written or rewritten. This creates a difficult situation for both the developers who are writing applications and the folks in operations who are trying to create a scalable, secure, and highly performance operations environment.

Figure 3: Dynamic Stacks and Dynamic Hardware Environments Create an NxN Matrix

How to solve this problem? A useful analogy can be drawn from the world of shipping. Before 1960, most cargo was shipped break bulk. Shippers and carriers alike needed to worry about bad interactions between different types of cargo (e.g. if a shipment of anvils fell on a sack of bananas). Similarly, transitions between different modes of transport were painful. Up to half the time to ship something could be taken up as ships were unloaded and reloaded in ports, and in waiting for the same shipment to get reloaded onto trains, trucks, etc. Along the way, losses due to damage and theft were large. And, there was an n X n matrix between a multiplicity of different goods and a multiplicity of different transport mechanisms.

Figure 4: Analogy: Shipping Pre-1960

Fortunately, an answer was found in the form of a standard shipping container. Any type of goods, from pistachios to Porsches, can be packaged inside a standard shipping container. The container can then be sealed, and not re-opened until it reaches its final destination. In between, the containers can be loaded and unloaded, stacked, transported, and efficiently moved over long distances. The transfer from ship to gantry crane to train to truck can be automated, without requiring a modification of the container. Many authors credit the shipping container with revolutionizing both transportation and world trade in general. Today, 18 million standard containers carry 90% of world trade.

Figure 5: Solution to Shipping Challenge Was a Standard Container

To some extent, Docker can be thought of as an intermodal shipping container system for code.

Figure 6: The Solution to Software Shipping is Also a Standard Container System

Docker enables any application and its dependencies to be packaged up as a lightweight, portable, self-sufficient container. Containers have standard operations, thus enabling automation. And, they are designed to run on virtually any Linux server. The same container that that a developer builds and tests on a laptop will run at scale, in production, on VMs, bare-metal servers, OpenStack clusters, public instances, or combinations of the above.

In other words, developers can build their application once, and then know that it can run consistently anywhere. Operators can configure their servers once, and then know that they can run any application.

Why Should I Care (For Developers)

Build once…run anywhere

“Docker interests me because it allows simple environment isolation and repeatability. I can create a run-time environment once, package it up, then run it again on any other machine. Furthermore, everything that runs in that environment is isolated from the underlying host (much like a virtual machine). And best of all, everything is fast and simple.”

Why Should I Care (For Devops)

Configure once…run anything

  • Make the entire lifecycle more efficient, consistent, and repeatable
  • Increase the quality of code produced by developers
  • Eliminate inconsistencies between development, test, production, and customer environments
  • Support segregation of duties
  • Significantly improves the speed and reliability of continuous deployment and continuous integration systems
  • Because the containers are so lightweight, address significant performance, costs, deployment, and portability issues normally associated with VMs

What are the Main Features of Docker

It is useful to compare the main features of Docker to those of shipping containers. (See the analogy above).

Physical Containers Docker
Content Agnostic The same container can hold almost any kind of cargo Can encapsulate any payload and its dependencies
Hardware Agnostic Standard shape and interface allow same container to move from ship to train to semi-truck to warehouse to crane without being modified or opened Using operating system primitives (e.g. LXC) can run consistently on virtually any hardware – VMs, bare metal, openstack, public IAAS, etc. – without modification
Content Isolation and Interaction No worry about anvils crushing bananas. Containers can be stacked and shipped together Resource, network, and content isolation. Avoids dependency hell
Automation Standard interfaces make it easy to automate loading, unloading, moving, etc. Standard operations to run, start, stop, commit, search, etc. Perfect for devops: CI, CD, autoscaling, hybrid clouds
Highly efficient No opening or modification, quick to move between waypoints Lightweight, virtually no perf or start-up penalty, quick to move and manipulate
Separation of duties Shipper worries about inside of box, carrier worries about outside of box Developer worries about code, Ops worries about infrastructure.

Figure 7: Main Docker Features

For a more technical view of features, please see the following:

  • Filesystem isolation: each process container runs in a completely separate root filesystem.
  • Resource isolation: system resources like cpu and memory can be allocated differently to each process container, using cgroups.
  • Network isolation: each process container runs in its own network namespace, with a virtual interface and IP address of its own.
  • Copy-on-write: root filesystems are created using copy-on-write, which makes deployment extremely fast, memory-cheap and disk-cheap.
  • Logging: the standard streams (stdout/stderr/stdin) of each process container is collected and logged for real-time or batch retrieval.
  • Change management: changes to a container’s filesystem can be committed into a new image and re-used to create more containers. No templating or manual configuration required.
  • Interactive shell: docker can allocate a pseudo-tty and attach to the standard input of any container, for example to run a throwaway interactive shell.

What are the Basic Docker Functions

Docker makes it easy to build, modify, publish, search, and run containers. The diagram below should give you a good sense of the Docker basics. With Docker, a container comprises both an application and all of its dependencies. Containers can either be created manually or, if a source code repository contains a DockerFile, automatically. Subsequent modifications to a baseline Docker image can be committed to a new container using the Docker Commit Function and then Pushed to a Central Registry.

Containers can be found in a Docker Registry (either public or private), using Docker Search. Containers can be pulled from the registry using Docker Pull and can be run, started, stopped, etc. using Docker Run commands. Notably, the target of a run command can be your own servers, public instances, or a combination.

Figure 8: Basic Docker Functions

For a full list of functions, please go to: http://docs.docker.io/en/latest/commandline/

Docker runs three ways: * as a daemon to manage LXC containers on your Linux host (sudo docker -d) * as a CLI which talks to the daemon’s REST API (docker run …) * as a client of Repositories that let you share what you’ve built (docker pull, docker commit).

How Do Containers Work? (And How are they Different From VMs)

A container comprises an application and its dependencies. Containers serve to isolate processes which run in isolation in userspace on the host’s operating system.

This differs significantly from traditional VMs. Traditional, hardware virtualization (e.g. VMWare, KVM, Xen, EC2) aims to create an entire virtual machine. Each virtualized application contains not only the application (which may only be 10′s of MB) along with the binaries and libraries needed to run that application, and an entire Guest operating System (which may measure in 10s of GB).

The picture below captures the difference

Figure 9: Containers vs. Traditional VMs

Since all of the containers share the same operating system (and, where appropriate, binaries and libraries), they are significantly smaller than VMs, making it possible to store 100s of VMs on a physical host (versus a strictly limited number of VMs). In addition, since they utilize the host operating system, restarting a VM does not mean restarting or rebooting the operating system. Thus, containers are much more portable and much more efficient for many use cases.

With Docker Containers, the efficiencies are even greater. With a traditional VM, each application, each copy of an application, and each slight modification of an application requires creating an entirely new VM.

As shown above, a new application on a host need only have the application and its binaries/libraries. There is no need for a new guest operating system.

If you want to run several copies of the same application on a host, you do not even need to copy the shared binaries.

Finally, if you make a modification of the application, you need only copy the differences.

Figure 10: Mechanism to Make Docker Containers Lightweight

This not only makes it efficient to store and run containers, it also makes it extremely easy to update applications. As shown in the next figure, updating a container only requires applying the differences.

Figure 11: Modfiying and Updating Containers

What is the Relationship between Docker and dotCloud?

Docker is an open-source implementation of the deployment engine which powers dotCloud, a popular Platform-as-a-Service. It benefits directly from the experience accumulated over several years of large-scale operation and support of hundreds of thousands of applications and databases. dotCloud is the chief sponsor of the Docker project, and dotCloud CTO is the original architect and current, overall maintainer. While several dotCloud employees work on Docker full-time, Docker is a true community project, with hundreds of non-Docker contributors and a complete open design philosophy. All pulls, pushes, forks, bugs, issues, and roadmaps are available for viewing, editing, and commenting on GitHub.

What Are Some Cool Use Cases For Docker?

Docker is a powerful tool for many different use cases. Here are some great early use cases for Docker, as described by members of our community.

Use Case Examples Link
Build your own PaaS Dokku – Docker powered mini-Heroku. The smallest PaaS implementation you’ve ever seen http://bit.ly/191Tgsx
Web Based Environment for Instruction JiffyLab – web based environment for the instruction, or lightweight use of, Python and UNIX shell http://bit.ly/12oaj2K
Easy Application Deployment Deploy Java Apps With Docker = Awesome http://bit.ly/11BCvvu
Running Drupal on Docker http://bit.ly/15MJS6B
Installing Redis on Docker http://bit.ly/16EWOKh
Create Secure Sandboxes Docker makes creating secure sandboxes easier than ever http://bit.ly/13mZGJH
Create your own SaaS Memcached as a Service http://bit.ly/11nL8vh
Automated Application Deployment Push-button Deployment with Docker http://bit.ly/1bTKZTo
Continuous Integration and Deployment Next Generation Continuous Integration & Deployment with dotCloud’s Docker and Strider http://bit.ly/ZwTfoy
Lightweight Desktop Virtualization Docker Desktop: Your Desktop Over SSH Running Inside Of A Docker Container http://bit.ly/14RYL6x

More things you would like to know:

Getting started with Docker

Click here to get started, including full instructions, code, and documentation. We’ve also prepared an interactive tutorial to help you get started.

Getting a copy of the source code

The Docker project is hosted on GitHub. Click here to visit the repository.

Contribute to the docker community

Head on over to our community page

(via Docker.io)

 

An analysis of Facebook photo caching

Every day people upload more than 350 million photos to Facebook as of December 2012 and view many more in their News Feeds and on their friends’ Timelines. Facebook stores these photos on Haystack machines that are optimized to store photos. But there is also a deep and distributed photo-serving stack with many layers of caches that delivers photos to people so they can view them.

We recently published an academic study of this stack in SOSP. In this post, we describe the stack and then cover four of our many findings:

    1. How effective each layer of the stack is.
    2. How the popularity distribution of photos changes across layers.
    3. The effect of increasing cache sizes.
    4. The effect of more advanced caching algorithms.

Facebook’s photo-serving stack

The photo-serving stack has four layers: browser caches, edge caches, origin cache, and Haystack backend. The first layer of the stack is the browser cache on peoples’ machines. If someone has recently seen or downloaded a photo, that photo will likely be in their browser cache and they can retrieve it locally. Otherwise, their browser will send an HTTPS request out to the Internet.

That request will be routed either to our CDN partners or to one of Facebook’s many edge caches, which are located at Internet points of presence (PoPs). (In the paper and this blog post, we focus on what happens in the Facebook-controlled stack.) The particular edge cache that a request is routed to is encoded in the request’s URL. If the requested photo is present at the edge cache, then it returns the photo to the user’s browser, which will store it in its local cache. If the photo is not present at the edge cache, it is requested from the origin cache.

The origin cache is a single cache that is distributed across multiple data centers (like PrinevilleForest City, and Lulea). Requests from the edge caches are routed to hosts in the origin cache based on the requested photo’s ID. This means that for a single picture, all requests to the origin cache from all edge caches will be directed to the same server. If the requested photo is present in the origin cache, it is returned to the person via the edge cache, which now will store the photo. If the requested photo is not present, it is fetched from the Haystack backend.

The Haystack backend stores all photos, so all requests can be satisfied at this layer. Requests from the origin cache are typically filled by Haystack machines in the same data center. If for some reason the local Haystack machine is unavailable, the origin cache will request the photo from a Haystack machine in another data center. In either case, the photo flows back up the caching stack being stored at each layer as it goes: origin cache, edge cache, and then browser cache.

Trace collection

In our study, we collected a month-long trace of requests from non-mobile users for a deterministic subset of photos that was served entirely by the Facebook-controlled stack. The trace captures hits and misses at all levels of the stack, including client browser caches. All analysis and simulation is based on this trace.

From our trace we determined the hit ratio at each of the layers and the ratio of total requests that travels past each layer. Each layer significantly reduces the volume of traffic so that the Haystack backend only needs to serve 9.9% of requested photos.

This is a significant reduction in requests, but we can do even better.

Shifting popularity distributions

It is well known that the popularity of objects on the web tends to follow a power law distribution. Our study confirmed this for the Facebook photo workload and showed how this distribution shifts as we move down layers in the stack. We counted the number of requests for each photo at every layer, sorted them by popularity, and then graphed them on a log-log scale. The power law relationship shows up as linear on this scale.

As we move down the stack, the alpha parameter for the power law distribution decreases and flattens out. The distribution also changes to one that more closely resembles a stretched exponential distribution.

Serving popular photos

Caches derive most of their utility from serving the most popular content many times. To quantify to what degree this was true for the Facebook stack, we split photos up into popularity groups. Group “high” has the 1,000 most popular photos, group “medium” the next 9,000 most popular photos, group “low” the next 90,000 most popular photos, and group “lowest” the least popular 2.5 million photos in the subset of photos we traced. Each of these groups represents about 25% of user requests. This graph shows the percent of requests served by each layer for these four groups.

As expected, the browser and edge caches are very effective for the most popular photos in our trace and progressively less effective for less popular groups. The origin cache’s largest effect is seen for popularity group low, which also agrees with our intuition. More popular photos will be effectively cached earlier in the stack at the browser and edge layers. Less popular photos are requested too infrequently to be effectively cached.

Larger caches improve hit ratios

Unless your entire workload fits in the current cache, a larger cache will improve hit ratios — but by how much? To find out we took our trace to each layer and replayed it using a variety of different cache sizes. We share the results for one of the edge caches and the origin cache here.

Size X in the graph is current size of that edge cache. Doubling the size of the cache would significantly increase the hit ratio, from 59% to 65%. Tripling the size of the cache would further increase hit ratio to 68.5%.

The infinite line in the graphs shows the performance of a cache of infinite size, i.e., one that can hold the entire workload. It is an upper bound on the possible hit ratio and it shows us there is still much room for improvement.

The origin cache shows even more room for improvement from larger caches. Doubling the size of the origin cache improves hit ratio from 33% to 42.5%. Tripling increases that to 48.5%. The results for a cache of infinite size again demonstrate that there is still much room for improvement.

Advanced caching algorithms improve hit ratios

Larger caches are not the only way to improve hit ratios — being smarter about what items you keep in the cache can also have an effect. We used our trace to test the effects of using a variety of different caching algorithms. We published results for the following algorithms:

    • FIFO: A first-in-first-out queue is used for cache eviction. This is the algorithm Facebook currently uses.
    • LRU: A priority queue ordered by last-access time is used for cache eviction.
    • LFU: A priority queue ordered first by number of hits and then by last-access time is used for cache eviction.
    • S4LRU – Quadruple-segmented LRU. An intermediate algorithm between LRU and LFU (see paper for details).
    • Clairvoyant: A priority queue ordered by next-access time is used for cache eviction. (Requires knowledge of the future and is impossible to implement in practice, but gives a more accurate upper bound on the performance of an ideal caching algorithm.)

First, how do the different algorithms stack up at an edge cache?

While both LRU and LFU give slightly higher hit ratios than the FIFO caching policy, the S4LRU caching algorithm gives much higher hit ratios. Based on the data from our trace, at the current size the S4LRU algorithm would increase the edge cache hit ratio from 59.2% to 67.7%. At twice the current size, the trace data predicts it would increase cache hit ratios to 72.0%.

We see even more pronounced effects at the origin cache.

Based on our trace data, the S4LRU algorithm again gives the best results. (We tested quite a few different caching algorithms that are less common than LRU and LFU but did not publish results for them because S4LRU outperformed all of them in our tests.) It increases hit ratio from 33.0% to 46.9% based on our trace data. At twice the current size, our data predicts that S4LRU can provide a 54.4% hit ratio, a 21.4% improvement over the current FIFO cache.

More information

Qi Huang gave a talk on this work at SOSP. The slides and a video of the talk are on the SOSP conference page. There are also many more details in our paper.

(via Facebook.com)

 

A3Cube develop Extreme Parallel Storage Fabric, 7x Infiniband

News from EETimes points towards a startup that claims to offer an extreme performance advantage over Infiniband.  A3Cube Inc. has developed a variation of the PCIe Express on a Network Interface Card to offer lower latency.  The company is promoting their Ronniee Express technology via a PCIe 2.0 driven FPGA to offer sub-microsecond latency across a 128 server cluster.

In the Sockperf benchmark, numbers from A3Cube put performance at around 7x that of Infiniband and PCIe 3.0 x8, and thus claim that the approach beats the top alternatives.  The PCIe support of the device at the physical layer enables quality-of-service features, and A3Cube claim the fabric enables a cluster of 10000 nodes to be represented in a single image without congestion.

The aim for A3Cube will be primarily in HFT, genomics, oil/gas exploration and real-time data analytics.  Prototypes for merchants are currently being worked on, and it is expected that two versions of network cards and a 1U switch based on the technology will be available before July.

The new IP from A3Cube is kept hidden away, but the logic points towards device enumeration and the extension of the PCIe root complex of a cluster of systems.  This is based on the quote regarding PCIe 3.0 incompatibility based on the different device enumeration in that specification.  The plan is to build a solid platform on PCIe 4.0, which puts the technology several years away in terms of non-specialized deployment.

As many startups, the process for A3Cube is to now secure venture funding.  The approach to Ronniee Express is different to that of PLX who are developing a direct PCIe interconnect for computer racks.

A3Cube’s webpage on the technology states the fabric uses a combination of hardware and software, while remaining application transparent.  The product combines multiple 20 or 40 Gbit/s channels, with the aim at petabyte-scale Big Data and HPC storage systems.

Information from Willem Ter Harmsel puts the Ronniee NIC system as a global shared memory container, with an in-memory network between nodes.  CPU/Memory/IO are directly connected, with 800-900 nanosecond latencies, and the ‘memory windows’ facilitates low latency traffic.

Using A3cube’s storage OS, byOS, and 40 terabytes of SSDs and the Ronniee Express fabric, five storage nodes were connected together via 4 links per NIC allowing for 810ns latency in any direction.  A3Cube claim 4 million IOPs with this setup.

Further, in interview by Willem and Anontella Rubicco shows that “Ronniee is designed to build massively parallel storage and analytics machines; not to be used as an “interconnection” as Infiniband or Ethernet.  It is designed to accelerate applications and create parallel storage and analytics architecture.”

(via AnandTech.com)

 

ScaleDB: A NewSQL Alternative to Amazon RDS and Oracle RAC with 10x Performance Gain

In the world of relational databases, MySQL and Oracle RAC (Real Application Cluster) own a significant segment of the market. Oracle RAC owns the majority of the enterprise market, while MySQL gained significant popularity amongst many of the Web 2.0 sites.

Today, both of those databases are owned by Oracle (MySQL was acquired by Sun which was later acquired by Oracle).

The following diagrams show the enterprise database marketshare covered by Gartner and Cloud Database market share covered by Jalistic – a Java PaaS provider.

The acquisition of MySQL by Oracle raised concern over the future of the project due to the inherent conflict of interest between its two database products. Oracle RAC is the company’s main “cash cow” product, while MySQL competes for the same audience.

Shortly after Oracle’s acquisition of MySQL, the open source database was forked by one of its original founders into a new project named MariaDB. MariaDB was established to provide an alternative development and support option to MySQL and is now becoming the default database of choice of RedHat.

MySQL vs Oracle RAC Clustering Models

The two databases take a fairly different approach to scalability.

Oracle RAC is based on a shared storage model. With Oracle, the data is broken into strips that are spread across multiple devices, and multiple database servers operate concurrently and in sync over the (shared) data.

MySQL, on the other hand, does not use a shared data model. With MySQL, a given data set is managed by a single server (a model called shared nothing). With MySQL, scaling and High Availability (HA) is achieved by managing copies of the data. As only a single machine can update the data, this mode can only scale-up by adding more capacity to the machine that owns the database. As machines have limits to capacity yet must keep up with large amounts of data or many users, the database needs to be broken into several independent databases (a process called sharding). However, sharding is a complex process, is not dynamic and requires assumptions on data distribution, data growth and the queries. For these reasons, sharding is not possible with many applications. In particular, a cloud deployment is expected to be elastic to dynamically support changing needs and user behaviors.

For High Availability, the MySQL shared nothing approach uses Primary/Backup model with a single master and a backup node (called slave) that manage a copy of the data. Each update to the master requires that the same update will be executed on the slave. The primary and slave nodes require some handshake protocol to determine who is the master and sync the changes of the data. The master node performs updates/writes to the persistent file-system and the level of High Availability is set by the DBA, who decides if the slave can lag behind or needs to confirm the duplicate updates within each transaction.

For scaling, this model can use the slave as a read replica (or make additional copies of the data), a method called horizontal scaling in which read requests are spread across the different copies of the data. (However, all the writes need to go to the master and then be reflected on the replicas/slaves.)

Relational Databases on the Cloud

The high-end position of Oracle RAC, the low-cost and open source nature of MySQL, along with the adoption of the cloud as the infrastructure platform led to a vastly different method of deployment of databases in the cloud.

Oracle RAC took an approach similar to the old mainframe: to produce a pre-engineered appliance (called Exadata) that comes with the software and hardware integrated. That approach was specifically aimed at existing customers of Oracle RAC who needed a quick resolution to their scalability needs without redesigning their applications. Plugging a more optimized stack helped to push the scalability bar without changing the existing applications that rely on the database.

Amazon launched RDS, which is basically an on-demand version of the MySQL database. This approach fits nicely with the existing users of MySQL who are looking for a more affordable way to run their database in the cloud.

The main limitation of the Amazon RDS approach is that it inherits the limitations of MySQL and thus provides a limited and relatively complex read-scalability and doesn’t provide a good solution for write-scalability other than the scale-up approach.

ScaleDB – a NewSQL Alternative to Amazon RDS and Oracle RAC

ScaleDB is a NewSQL database that takes an approach similar to the read/write scalability model of Oracle RAC and implements it as a platform that transparently supports the MySQL (or MariaDB) database. As a result, existing MySQL/MariaDB applications can leverage the platform without any changes – they use MySQL or MariaDB as the front end engine which connects to the ScaleDB platform that provides a new and more concurrent database and storage services. This approach makes it possible to run multiple MySQL or MariaDB server instances against a distributed storage in a shared data model (similar to Oracle RAC).

The diagram below shows in high level how ScaleDB works.

Each Database Node runs a standard MariaDB database instance with ScaleDB plugged as the database engine and as an alternative storage and index device.

Scaling of the data and the index is done by processing requests concurrently from multiple database instances and leveraging the ScaleDB distributed storage tier (at the storage/file-system level) where data and index elements are spread evenly across multiple storage nodes of the cluster. Read more on how ScaleDB works.

ScaleDB vs Other NewSQL Databases

Most of the NewSQL databases are based on MySQL as the engine behind their implementation. The main difference between many NewSQLs and ScaleDB is that NewSQL databases brings the NoSQL approach into the SQL world, where ScaleDB takes the Oracle RAC shared storage approach to scale the database.

ScaleDB can deliver write/read scale while offering close to 100% compatibility, whereas in many of the alternative NewSQL approaches, scaling would often involve significant changes in the data model and queries semantics.

Benchmark – ScaleDB vs Amazon RDS

To demonstrate the difference between a standard cloud deployment of MySQL and a ScaleDB deployment and in order to find whether the ScaleDB approach can live up to its promise we conducted a benchmark comparing Amazon RDS scalability for write/read workloads with that of ScaleDB. We tested a dataset that does not fit to the memory (RAM) of a single machine and used  the largest machines offered by Amazon. We required that scaling would be dynamic and that all types of queries would be supported. These requirements made sharding a no-go option.

The benchmark is based on the popular YCSB – Yahoo Benchmark as the benchmarking framework.

The results of that benchmark are shown in the graphs below.

Both illustrate a relatively flat scalability with Amazon RDS and a close to linear scalability on the ScaleDB front.

Benchmark environment:

  • Benchmark Framework - YCSB – Yahoo Benchmark
  • Cloud environment: Amazon
  • Machine Type: Extra Large
  • Data Element Size (per row) – 1k
  • Data Capacity: 50GB
  • Zones – Single zone.
  • RDS was set with 1000 provisional ios
  • ScaleDB cluster setup – 2 database nodes, 2 data volumes (4 machines -data is striped over 2 machines and each machine had a mirror).
  • ScaleDB MySQL engine – MariaDB

Running ScaleDB on OpenStack and Other Clouds with Cloudify

The thing that got me excited about the project is that it serves as a perfect fit for many of our enterprise customers who are using Cloudify for moving their existing applications to their private cloud without code change. Those customers are looking for a simple way to scale their applications and many of them run today on Oracle RAC for that purpose.

The move of enterprises to a cloud-based environment also drives the demand for a more affordable approach to handle their database scaling, which is provided by ScaleDB.

On the other hand, setting up a database cluster can be a fairly complex task.

By creating a Cloudify recipe for ScaleDB, we could remove a large part of that complexity and set up an entire ScaleDB database and storage environment through a single click.

In this way we can run ScaleDB on demand as we would with RDS and scale on both read and write as with Oracle Exadata, only in a more affordable fashion.

References

1. Cloudify Recipe for running Scale DB

2. Yahoo Benchmark

3. ScaleDB Architecture white paper

4. ScaleDB user manual

(Via Nati Shalom’s Blog)