DDN Pushes the Envelope for Parallel Storage I/O

Today at Supercomputing 2014, DataDirect Networks lifted the veil a bit more on Infinite Memory Engine (IME), its new software that will employ Flash storage and a bunch of smart algorithms to create a buffer between HPC compute and parallel file system resources, with the goal of improving file I/O by up to 100x. The company also announced the latest release of its Exascaler, its Lustre-based storage appliance lineup.

The data patterns have been changing at HPC sites in a way that is creating bottlenecks in the I/O. While many HPC shops may think they’re primarily working with large and sequential files, the reality is that most data is relatively small and random, and that fragmented I/O creates problems when moving the data across the interconnect, says Jeff Sisilli, Sr. Director Product Marketing at DataDirect Networks.

“Parallel file systems were really built for large files,” Sisilli tells HPCwire. “What we’re finding is 90 percent of typical I/O in HPC data centers utilizes small files, those less than 32KB. What happens is, when you inject those into a parallel file system, it starts to really bring down performance.”

DDN says it overcame the restrictions in how parallel file systems were created with IME, which creates a storage tier above the file system and provides a “fast data” layer between the compute nodes in an HPC cluster and the backend file system. The software, which resides on the I/O nodes in the cluster, utilizes any available Flash solid state drives (SSDs) or other non-volatile memory (NVM) storage resources available, creating a “burst buffer” to absorb peak loads and eliminate I/O contention.

IME_diagram-236x300
IME works in two ways. First, it removes any limitations of the POSIX layer, such as file locks, that can slow down communication. Secondly, algorithms bundle up the small and random I/O operations into larger files that can be more efficiently read into the file system.

In lab tests at a customer site, DDN ran IME against the S3D turbulent flow modeling software. The software was really designed for larger sequential files, but is often used in the real world with smaller and random files. In the customer’s case, these “mal-aligned and fragmented” files were causing I/O throughput across the InfiniBand interconnect to drop to 25 MBs per second.

After introducing IME, the customer was able to ingest data from the compute cluster onto IME’s SSDs at line rate. “This customer was using InfiniBand, and we were able to fill up InfiniBand all the way to line rate, and absorb at 50 GB per second,” Sisilli says.

The data wasn’t written back into the file system quite that quickly. But because the algorithms were able to align all those small files and convert fragments into full stripe writes, it did provide a speed up compared to 25MB per second. “We were able to drain out the buffer and write to the parallel file system at 4GB per second, which is two orders of magnitude faster than before,” Sisilli says.

The “net net” of IME, Sisilli says, is it frees up HPC compute cluster resources. “From the parallel file system side, we’re able to shield the parallel file system and underlying storage arrays from fragmented I/O, and have those be able to ingest optimized data and utilize much less hardware to be able to get to the performance folks need up above,” he says.

IME will work with any Lustre- or GPFS-based parallel file system. That includes DDN’s own EXAscaler line of Lustre-based storage appliances, or the storage appliances of any other vendor. There are no application modifications required to use IME, which also features data erasure encoding capabilities typicaly found in object file stores. The only requirements are that the application is POSIX compliant or uses the MPI job scheduler. DDN also provides an API that customers can use if they want to modify their apps to work with IME; the company has plans to create an ecosystem of compatible tools using this API.

There are other vendors developing similar Flash-bashed storage buffer offerings. But DDN says the fact that it’s taking an open, software-based approach gives customer an advantage over those vendors that are requiring customers to purchase specialized hardware, or those that work with only certain types of Interconnects.

IME_burst_buffer-300x153

IME isn’t available yet; it’s still in technology preview mode. But when it becomes available, scalability won’t be an issue. The software will be able to corral and make available petabytes worth of Flash or NVM storage resources living across thousands of nodes, Sisilli says. “What we’re recommending is [to have in IME] anywhere between two to three amount of your compute cluster memory to have a great working space within IME to accelerate your applications and and do I/O,” he says. “That can be all the way down to terabytes, and for supercomputers, it’s multi petabytes.”

IME is still undergoing tests, and is expected to become generally available in the second quarter of 2015. DDN will offer it as an appliance or as software.

DDN also today unveiled a new release of EXAScaler. With Version 2.1, DDN has improved read and write I/O performance by 25 percent. That will give DDN a comfortable advantage over competing file systems for some time, says Roger Goff, Sr. Product Manager for DDN.

“We know what folks are about to announce because they pre-announce those things,” Goff says. “Our solution is tremendously faster than what you will see [from other vendors], particularly on a per-rack performance basis.”

Other new features in version 2.1 include support for self-encrypting drives; improved rebuild times; InfiniBand optimizations; and better integration with DDN’s Storage Fusion Xcelerator (SFX Flash Caching) software.

DDN has also standardized on the Lustre file system from Intel, called Intel Enterprise Edition for Lustre version 2.5. That brings it several new capabilities, including a new MapReduce connector for running Hadoop workloads.

“So instead of having data replicated across multiple nodes in the cluster, which is the native mode for HDFS, with this adapter, you can run those Hadoop applications and take advantages of the single-copy nature of a parallel file system, yet have the same capability of a parallel file system to scale to thousands and thousands of clients accessing that same data,” Goff says.

EXAScaler version 2.1 is available now across all three EXAScaler products, including the entry-level SFA7700, the midrange ES12k/SFA12k-20, and the high-end SFA12KX/SFA212k-40.

( Via HPCwire.com )

Amazon Announces EC2 Container Service For Managing Docker Containers On AWS

Earlier this year I wrote about container computing and enumerated some of the benefits that you get when you use it as the basis for a distributed application platform: consistency & fidelity, development efficiency, and operational efficiency. Because containers are lighter in weight and have less memory and computational overhead than virtual machines, they make it easy to support applications that consist of hundreds or thousands of small, isolated “moving parts.” A properly containerized application is easy to scale and maintain, and makes efficient use of available system resources.

Introducing Amazon EC2 Container Service

meter80_1
In order to help you to realize these benefits, we are announcing a preview of our new container management service, EC2 Container Service (or ECS for short). This service will make it easy for you to run any number of Docker containers across a managed cluster of Amazon Elastic Compute Cloud (EC2) instances using powerful APIs and other tools. You do not have to install cluster management software, purchase and maintain the cluster hardware, or match your hardware inventory to your software needs when you use ECS. You simply launch some instances in a cluster, define some tasks, and start them. ECS is built around a scalable, fault-tolerant, multi-tenant base that takes care of all of the details of cluster management on your behalf.

By the way, don’t let the word “cluster” scare you off! A cluster is simply a pool of compute, storage, and networking resources that serves as a host for one or more containerized applications. In fact, your cluster can even consist of a single t2.micro instance. In general, a single mid-sized EC2 instance has sufficient resources to be used productively as a starter cluster.

EC2 Container Service Benefits
Here’s how this service will help you to build, run, and scale Docker-based applications:

  • Easy Cluster Management – ECS sets up and manages clusters made up of Docker containers. It launches and terminates the containers and maintains complete information about the state of your cluster. It can scale to clusters that encompass tens of thousands of containers across multiple Availability Zones.
  • High Performance – You can use the containers as application building blocks. You can start, stop, and manage thousands of containers in seconds.
  • Flexible Scheduling – ECS includes a built-in scheduler that strives to spread your containers out across your cluster to balance availability and utilization. Because ECS provides you with access to complete state information, you can also build your own scheduler or adapt an existing open source scheduler to use the service’s APIs.
  • Extensible & Portable – ECS runs the same Docker daemon that you would run on-premises. You can easily move your on-premises workloads to the AWS cloud, and back.
  • Resource Efficiency – A containerized application can make very efficient use of resources. You can choose to run multiple, unrelated containers on the same EC2 instance in order to make good use of all available resources. You could, for example, decide to run a mix of short-term image processing jobs and long-running web services on the same instance.
  • AWS Integration – Your applications can make use of AWS features such as Elastic IP addresses, resource tags, and Virtual Private Cloud (VPC). The containers are, in effect, a new base-level building block in the same vein as EC2 and S3.
  • Secure – Your tasks run on EC2 instances within an Amazon Virtual Private Cloud. The tasks can take advantage of IAM roles, security groups, and other AWS security features. Containers run in a multi-tenant environment and can communicate with each other only across defined interfaces. The containers are launched on EC2 instances that you own and control.

Using EC2 Container Service
ECS was designed to be easy to set up and to use!

You can launch an ECS-enabled AMI and your instances will be automatically checked into your default cluster. If you want to launch into a different cluster you can specify it by modifying the configuration file in the image, or passing in User Data on launch. To ECS-enable a Linux AMI, you simply install the ECS Agent and the Docker daemon.

ECS will add the newly launched instance to its capacity pool and run containers on it as directed by the scheduler. In other words, you can add capacity to any of your clusters by simply launching additional EC2 instances in them!

The ECS Agent will be available in open source form under an Apache license. You can install it on any of your existing Linux AMIs and call registerContainerInstances to add them to your cluster.

grid80_1

Here are a few vocabulary items to help you to get familiar with the terminology used by ECS:
  • Cluster – A cluster is a pool of EC2 instances in a particular AWS Region, all managed by ECS. One cluster can contain multiple instance types and sizes, and can reside within one or more Availability Zones.
  • Scheduler – A scheduler is associated with each cluster. The scheduler is responsible for making good use of the resources in the cluster by assigning containers to instances in a way that respects any placement constraints and simultaneously drives as much parallelism as possible, while also aiming for high availability.
  • Container – A container is a packaged (or “Dockerized,” as the cool kids like to say) application component. Each EC2 instance in a cluster can serve as a host to one or more containers.
  • Task Definition – A JSON file that defines a Task as a set of containers. Fields in the file define the image for each container, convey memory and CPU requirements, and also specify the port mappings that are needed for the containers in the task to communicate with each other.
  • Task – A task is an instantiation of a Task Definition consisting of one or more containers, defined by the work that they do and their relationship to each other.
  • ECS-Enabled AMI – An Amazon Machine Image (AMI) that runs the ECS Agent and dockerd. We plan to ECS-enable the Amazon Linux AMI and are working with our partners to similarly enable their AMIs.

EC2 Container Service includes a set of APIs that are both simple and powerful. You can create, describe, and destroy clusters and you can register EC2 instances therein. You can create task definitions and initiate and manage tasks.

Here is the basic set of steps that you will follow in order to run your application on ECS. I am making the assumption that you have already Dockerized your application by breaking it down in to fine-grained components, each described by a Dockerfile and each running nicely on your existing infrastructure. There are plenty of good resources online to help you with this process. Many popular applications application components have already been Dockerized and can be found on Docker Hub. You can use ECS with any public or private Docker repository that you can access. Ok, so here are the steps:

  1. Create a cluster, or decide to use the default one for your account in the target Region.
  2. Create your task definitions and register them with the cluster.
  3. Launch some EC2 instances and register them with the cluster.
  4. Start the desired number of copies of each task.
  5. Monitor the overall utilization of the cluster and the overall throughput of your application, and make adjustments as desired. For example, you can launch and then register additional EC2 instances in order to expand the cluster’s pool of available resources.

EC2 Container Service Pricing and Availability
The service is launch today in preview form. If you are interested in signing up, click here to join the waiting list.

There is no extra charge for ECS. As usual, you pay only for the resources that you use.

( Via Amazon blog)

Data Warehouse and Analytics Infrastructure at Viki

At Viki, we use data to power product, marketing and business decisions. We use an in-house analytics dashboard to expose all the data we collect to various teams through simple table and chart based reports. This allows them to monitor all our high level KPIs and metrics regularly.

Data also powers more heavy duty stuff – like our data-driven content recommendation system, or predictive models that help us forecast the value of content we’re looking to license. We’re also constantly looking at data to determine the success of new product features, tweak and improve existing features and even kill stuff that doesn’t work. All of this makes data an integral part of the decision making process at Viki.

To support all these functions, we need a robust infrastructure below it, and that’s our data warehouse and analytics infrastructure.

This post is the first of a series about our data warehouse and analytics infrastructure. In this post we’ll cover the high-level pipeline of the system and goes into details about how we collect and batch-process our data. Do expect a lot of detailed-level discussions.

About Viki: We’re a online TV site, with fan-powered translations in 150+ different languages. To understand more what Viki is, watch this short video (2 minutes).

Part 0: Overview (or TL;DR)

Our analytics infrastructure, following most common sense approach, is broken down into 3 steps:

  • Collect and Store Data
  • Process Data (batch + real-time)
  • Present Data

15724861706_3a81497b37_m

Collect and Store Data

  1. Logs (events) are sent by different clients to a central log collector
  2. Log collector forward events to a Hydration service, here the events get enriched with more time-sensitive information; the results are stored to S3

Batch Processing Data

  1. There is an hourly job that runs to take data from S3, apply further transformations (read: cleaning up bad data) and store the results to our cloud-based Hadoop cluster
  2. We run multiple MapReduce (Hive) jobs to aggregate data from Hadoop and write them to our central analytics database (Postgres)
  3. Another job takes a snapshot of our production databases and restore into our analytics database

Presenting Data

  1. All analytics/reporting-related activities are then done at our master analytics database. The results (meant for report presentation) are then sent to our Reporting DB
  2. We run an internal reporting dashboard app on top of our Reporting DB; this is where end-users log in to see their reports

Real-time Data Processing

  1. The data from Hydration service is also multiplexed and sent to our real-time processing pipeline (using Apache Storm)
  2. At Storm, we write custom job that does real-time aggregation of important metrics. We also write a real-time alerting system to inform ourselves when traffic goes bad

Part 1: Collecting, Pre-processing and Storing Data

aMC5MNF

We use fluentd to receive event logs from different platforms (web, Android, iOS, etc) (through HTTP). We set up a cluster of 2 dedicated servers running multiple fluentd instances inside Docker, load-balanced through HAproxy. When the message hits our endpoint, fluentd then buffers the messages and batch-forward them to our Hydration System. At the moment, our cluster is doing 100M messages a day.

A word about fluentd: It’s a robust open-source log collecting software that has a very healthy and helpful community around it. The core is written in C so it’s fast and scalable, with plugins written in Ruby, making it easy to extend.

What data do we collect? We collect everything that we think is useful for the business: a click event, an ad impression, ad request, a video play, etc. We call each record an event, and it’s stored as a JSON string like this:

{
  "time":1380452846, "event": "video_play",
  "video_id":"1008912v", "user_id":"5298933u",
  "uuid":"80833c5a760597bf1c8339819636df04",
  "app_id":"100004a", "app_ver":"2.9.3.151",
  "device":"iPad", "stream_quality":"720p",
  "ip":"99.232.169.246", "country":"ca",
  "city_name":"Toronto", "region_name":"ON"
}

Pre-processing Data – Hydration Service

cqglpGe

Data Hydration ServiceThe data after collected is sent to a Hydration Service for pre-processing. Here, the message is enriched with time-sensitive information.

For example: When a user watches a video (thus a video_play event sent), we want to know if it’s a free user or a paid user. Since the user could be a free user today and upgrade to paid tomorrow, the only way to correctly attribute the play event to free/paid bucket is to inject that status right right into the message when it’s received. In short, the service translates this:

{ "event":"video_play", "user_id":"1234" }

into this:

{ "event":"video_play", "user_id":"1234", "user_status":"free" }

For non time-sensitive operations (fixing typo, getting country from IP, etc), there is different process for that (discussed below)

Storing to S3

From the Hydration Service, the message is buffered and then stored to S3 – our source of truth. The data is gzip-compressed and stored into hour bucket, making it easy and fast to retrieve them per hour time-period.

Part 2: Batch-processing Data

The processing layer has 2 components, the batch-processing and the real-time processing component.

This section focuses mainly on our batch-processing layer – our main process of transforming data for reporting/presentation. We’ll cover our real-time processing layer in another post.

z1b5mFT

Batch-processing Data Layer

Cleaning Data Before Importing Into Hadoop

Those who have worked with data before know this: Cleaning data takes a lot of time. In fact: Cleaning and preparing data will take most of your time. Not the actual analysis.

What is unclean data (or bad data)? Data that is logged incorrectly. It comes in a lot of different forms. E.g

  • Typo mistake, send ‘clickk’ event instead of ‘click’
  • Clients send event twice, or forgot to send event

When bad data enters your system, it stays there forever, unless you purposely find a way to clean it/take it out

So how do we clean up bad data?

Previously when we receive a record, we write directly to our Hadoop cluster and make a backup to S3. This makes it difficult to correct the bad data, due to the append-only nature of Hadoop.

Now all the data is first stored into S3. And we have hourly process that takes data from S3, apply cleanup/transformations and load them into Hadoop (insert-overwrite).

V3KBoqY

Storing Data, Before and AfterThe process is similar in nature to the hydration process, but this time we look at 1 hour block at a time, rather than per record. This approach has many great benefits:

  • The pipeline is more linear, thus prevents from the threat of data discrepancy (between S3 and Hadoop).
  • The data is not tied down to being stored in Hadoop. If we want to load our data into other data storage, we’d just write another process that transform S3 data and dump somewhere else.
  • When a bad logging happens causing unclean data, we can modify the transformation code and rerun the data from the point of bad logging. Because the process isidempotent, we can perform the reprocessing as many times as we want without double-logging the data.

kYfXPXG

Our S3 to Hadoop Transform and Load ProcessIf you’ve studied this article The Log from the Data Engineering folks at LinkedIn, you’d notice that the approach is very similar (replacing Kafka with S3, and per-message processing with per hour processing). Indeed our paradigm is inspired by The Log architecture. However due to our needs, our system chose S3 because:

  1. When it comes to batch (re)processing, we do it in a time-period manner (eg. process 1 hour of data). Kafka use natural number to order message, thus if we use Kafka we’ll have to build another service to translate [beg_timestamp, end_timestamp) into [beg_index, end_index).
  2. Kafka can only retain up to X days due to the disk-space limitation. As we want the ability to reprocess data further back, employing Kafka means we need another strategy to cater to these cases.

Aggregating Data from Hadoop into Postgres

Once the data got into Hadoop, we’d have these daily aggregation jobs to aggregate data into fewer dimensions and port them into our analytics master database (PostgreSQL)

For example, to aggregate a table of video starts data together with some video and user information, we run this Hive query (MapReduce job):

-- The hadoop's events table would contain 2 fields: time (int), v (json)
SELECT
  SUBSTR(FROM_UNIXTIME(time), 0, 10) AS date_d,
  v['platform'] AS platform,
  v['country'] AS country,
  v['video_id'] AS video_id,
  v['user_id'] AS user_id
  COUNT(1) AS cnt
FROM events
WHERE time >= BEG_TS
  AND time <= END_TS
  AND v['event'] = 'video_start'
GROUP BY 1,2,3,4,5

and load the results into an aggregated.video_starts table in Postgres:

       Table "aggregated.video_starts"
   Column    |          Type          | Modifiers 
-------------+------------------------+-----------
 date_d      | date                   | not null
 platform    | character varying(255) | 
 country     | character(3)           | 
 video_id    | character varying(255) | 
 user_id     | character varying(255) | 
 cnt         | bigint                 | not null

Further querying and reporting of video_starts will be done out of this table. If we need more dimensions, we either rebuild this table with more dimensions, or build a new table from Hadoop.

If it’s a one-time ad-hoc analysis, we’d just run the queries directly against Hadoop.

Table Partitioning:

1LoFj7B

 

Also, we’re making use of Postgres’ Table Inheritance feature to partition our data into multiple monthly tables with a parent table on top of all. Your query just needs to hit the parent table and the engine will know which underlying monthly tables to hit to get your data.

This makes our data very easy to maintain, with small indexes and better rebuild process. We’d have fast (SSD) drives that host the recent tables, and move the older ones to slower (but bigger) drives for semi-archiving purpose.

Centralizing All Data

1Zuq7np

 

We dump our production databases into our master analytics database on a daily basis.

Also, we use a lot of 3rd party vendors (Adroll, DoubleClick, Flurry, GA, etc). For each of these services, we write a process to ping their API and import the data into our master analytics database.

These data, together with the aggregated data from Hadoop, allow us to produce meaningful analysis combining data from multiple sources.

For example, to break down our video starts by different genre, we would write some query that joins data from prod.videos table with aggregated.video_starts table:

-- Video Starts by genre
SELECT V.genre, SUM(cnt) FROM aggregated.video_starts VS
LEFT JOIN prod.videos V ON VS.video_id = V.id
WHERE VS.date_d = '2014-06-01'
GROUP BY 1
ORDER BY 1

The above is made possible because we have both sources of data (event tracking data + production data) in 1 place.

Centralizing data is a very important concept in our pipeline because it makes it simple and pain-free for us to connect, report and corroborate our numbers across many different data sources.

Managing Job Dependencies

We started with simple crontab to schedule our hourly/daily jobs. When the jobs grew complicated, we’d end up with very long crontab:

qHXdXme

Crontab also doesn’t support graph-based job flow (e.g run A and B at the sametime, when both finishes, run C)

rkomAJT

So we looked around for a solution. We considered Chronos (by Airbnb), but their use-case is more complicated than what we needed, plus the need to setup ZooKeeper and all that.

We ended up using Azkaban by LinkedIn, it has everything we need: Crontab with graph-based job flow, it also tell you the runtime history of your job. And when a job flow fails, you can restart them, running only tasks that failed/haven’t run.

It’s pretty awesome.

Making Sure Your Numbers Tie

One of the things I see being less discussed in analytics infrastructure talk/blog is making sure your data don’t drop half-way during transportation, resulting in data inconsistency in different storages.

We have a process that runs after every data transportation, it counts number of records in both source and destination storage and prints errors when they don’t match. These check-total processes sound a little tedious to do, but it proved crucial to our system; it gives us the confidence in the accurary of the numbers we report to management.

Case in point, we had a process that dumps data from Postgres to CSV, then compresses and uploads to S3 and loads them into Amazon Redshift (using COPY command). So technically we have the exact same table in both Postgres and Redshift. One day our analyst pointed out that the data in Redshift is significantly less than in Postgres. Upon investigation, there was bug that cause CSV file to be truncated and thus not fully loaded into Redshift tables. It was because for this particular process we didn’t have the check-totals in place.

Using Ruby as the Scripting Language

When we started we are primarily a Ruby shop, so going ahead with Ruby was a natural choice. “Isn’t it slow?”, you might say. But we use Ruby not to process data (i.e it rarely holds any large amount of data), but to facilitate and coordinate the process.

We have written an entire library to support doing data pipeline in Ruby. For example, we extended pg gem to make it more object-oriented to Postgres. It allows us to do table creation, table hot-swapping, upsert, insert-overwriting, copying tables between databases, etc, all without having to touch SQL code. It has become a nice, productive abstraction on top of SQL and Postgres. Think ORM for data-warehousing purpose.

Example: The below code will create a data.videos_by_genre table holding the result of a simple aggregation query. The process works on a temporary table and eventually it’ll perform a table hot-swap with the main one; this is to avoid any data disruption being made if we would have done it on the main table from the beginning.

columns = [
  {name: 'genre', data_type: 'varchar'},
  {name: 'video_count', data_type: 'int'},
]
indexes = [{columns: ['genre']}]
table = PGx::Table.new 'data.videos_by_genre', columns, indexes

table.with_temp_table do |temp_t|
  temp_t.drop(check_exists: true)
  temp_t.create
  connection.exec <<-SQL.strip_heredoc
    INSERT INTO #{temp_t.qualified_name}
    SELECT genre, COUNT(1) FROM prod.videos
    GROUP BY 1
  SQL

  temp_t.create_indexes
  table.hotswap
end

(the above example could also be done using a MATERIALIZED VIEW btw)

Having this set of libraries has proven very critical to our data pipeline process, since it allows us to write extensible and maintainable code that perform all sort of data transformations.

Technology

We rely mostly on free and open-source technologies. Our stack is:

  • fluentd (open-source) for collecting logs
  • Cloud-based Hadoop + Hive (TreasureData – 3rd party vendor)
  • PostgreSQL + Amazon Redshift as central analytics database
  • Ruby as scripting language
  • NodeJS (worker process) with Redis (caching)
  • Azkaban (job flow management)
  • Kestrel (message queue)
  • Apache Storm (real-time stream processing)
  • Docker for automated deployment
  • HAproxy (load balancing)
  • and lots of SQL (huge thanks to Postgres, one of the best relational databases ever made)

Conclusion

The above post went through the overall architecture of our analytics system. It also went into details the Collecting layer and Batch-processing layer. In later blog posts we’ll cover the remaining, specifically:

  • Our Data Presentation layer. And how Stuti, our analyst, built our funnel analysis, fan-in and fan-out tools, all with SQL. And it updates automatically (very funnel. wow!)
  • Our Real-time traffic alert/monitoring system (using Apache Storm)
  • turing: our feature roll-out, A/B testing framework

( Via Engineering. Viki.com )

Nifty Architecture Tricks From Wix – Building A Publishing Platform At Scale

15724861706_3a81497b37_m

Wix operates websites in the long tale. As a HTML5 based WYSIWYG web publishing platform, they have created over 54 million websites, most of which receive under 100 page views per day. So traditional caching strategies don’t apply, yet it only takes four web servers to handle all the traffic. That takes some smart work.

Aviran Mordo, Head of Back-End Engineering at Wix, has described their solution in an excellent talk: Wix Architecture at Scale. What they’ve developed is in the best tradition of scaling is specialization. They’ve carefully analyzed their system and figured out how to meet their aggressive high availability and high performance goals in some most interesting ways.

Wix uses multiple datacenters and clouds. Something I haven’t seen before is that they replicate data to multiple datacenters, to Google Compute Engine, and to Amazon. And they have fallback strategies between them in case of failure.

Wix doesn’t use transactions. Instead, all data is immutable and they use a simple eventual consistency strategy that perfectly matches their use case.

Wix doesn’t cache (as in a big caching layer). Instead, they pay great attention to optimizing the rendering path so that every page displays in under 100ms.

Wix started small, with a monolithic architecture, and has consciously moved to a service architecture using a very deliberate process for identifying services that can help anyone thinking about the same move.

This is not your traditional LAMP stack or native cloud anything. Wix is a little different and there’s something here you can learn from. Let’s see how they do it…

Stats

  • 54+ million websites, 1 million new websites per month.

  • 800+ terabytes of static data, 1.5 terabytes of new files per day

  • 3 data centers + 2 clouds (Google, Amazon)

  • 300 servers

  • 700 million HTTP requests per day

  • 600 people total, 200 people in R&D

  • About 50 services.

  • 4 public servers are needed to serve 45 million websites

Platform

  • MySQL

  • Google and Amazon clouds

  • CDN

  • Chef

Evolution

  • Simple initial monolithic architecture. Started with one app server. That’s the simplest way to get started. Make quick changes and deploy. It gets you to a particular point.

    • Tomcat, Hibernate, custom web framework

    • Used stateful logins.

    • Disregarded any notion of performance and scaling.

  • Fast forward two years.

    • Still one monolithic server that did everything.

    • At a certain scale of developers and customers it held them back.

    • Problems with dependencies between features. Changes in one place caused deployment of the whole system. Failure in unrelated areas caused system wide downtime.

  • Time to break the system apart.

    • Went with a services approach, but it’s not that easy. How are you going to break functionality apart and into services?

    • Looked at what users are doing in the system and identified three main parts: edit websites, view sites created by Wix, serving media.

    • Editing web sites includes data validation of data from the server, security and authentication, data consistency, and lots of data modification requests.

    • Once finished with the web site users will view it. There are 10x more viewers than editors. So the concerns are now:

      • high availability. HA is the most important feature because it’s the user’s business.

      • high performance

      • high traffic volume

      • the long tail. There are a lot of websites, but they are very small. Every site gets maybe 10 or 100 page views a day. The long tail make caching not the go to scalability strategy. Caching becomes very inefficient.

    • Media serving is the next big service. Includes HTML, javascript, css, images. Needed a way to serve files the 800TB of data under a high volume of requests. The win is static content is highly cacheable.

    • The new system looks like a networking layer that sits below three segment services: editor segment (anything that edits data), media segment (handles static files, read-only), public segment (first place a file is viewed, read-only).

Guidelines For How To Build Services

  • Each service has its own database and only one service can write to a database.

  • Access to a database is only through service APIs. This supports a separation of concerns and hiding the data model from other services.

  • For performance reasons read-only access is granted to other services, but only one service can write. (yes, this contradicts what was said before)

  • Services are stateless. This makes horizontal scaling easy. Just add more servers.

  • No transactions. With the exception of billing/financial transactions, all other services do not use transactions. The idea is to increase database performance by removing transaction overhead. This makes you think about how the data is modeled to have logical transactions, avoiding inconsistent states, without using database transactions.

  • When designing a new service caching is not part of the architecture. First, make a service as performant as possible, then deploy to production, see how it performs, only then, if there are performance issues, and you can’t optimize the code (or other layers), only then add caching.

Editor Segment

  • Editor server must handle lots of files.

  • Data stored as immutable JSON pages (~2.5 million per day) in MySQL.

  • MySQL is a great key-value store. Key is based on a hash function of the file so the key is immutable. Accessing MySQL by primary key is very fast and efficient.

  • Scalability is about tradeoffs. What tradeoffs are we going to make? Didn’t want to use NoSQL because they sacrifice consistency and most developers do not know how to deal with that. So stick with MySQL.

  • Active database. Found after a site has been built only 6% were still being updated. Given this then these active sites can be stored in one database that is really fast and relatively small in terms of storage (2TB).

  • Archive database. All the stale site data, for sites that are infrequently accessed, is moved over into another database that is relatively slow, but has huge amounts of storage. After three months data is pushed to this database is accesses are low. (one could argue this is an implicit caching strategy).

  • Gives a lot of breathing room to grow. The large archive database is slow, but it doesn’t matter because the data isn’t used that often. On first access the data comes from the archive database, but then it is moved to the active database so later accesses are fast.

High Availability For Editor Segment

  • With a lot of data it’s hard to provide high availability for everything. So look at the critical path, which for a website is the content of the website. If a widget has problems most of the website will still work. Invested a lot in protecting the critical path.

  • Protect against database crashes. Want to recover quickly. Replicate databases and failover to the secondary database.

  • Protect against data corruption and data poisoning.  Doesn’t have to be malicious, a bug is enough to spoil the barrel. All data is immutable. Revisions are stored for everything. Worst case  if corruption can’t be fixed is to revert to version where the data was fine.

  • Protect against unavailability. A website has to work all the time. This drove an investment in replicating data across different geographical locations and multiple clouds. This makes the system very resilient.

    • Clicking save on a website editing session sends a JSON file to the editor server.

    • The server sends the page to the active MySQL server which is replicated to another datacenter.

    • After the page is saved to locally, an asynchronous process is kicked upload the data to a static grid, which is the Media Segment.

    • After data is uploaded to the static grid, a notification is sent to a archive service running on the Google Compute Engine. The archive goes to the grid, downloads a page, and stores a copy on the Google cloud.

    • Then a notification is sent back to the editor saying the page was saved to GCE.

    • Another copy is saved to Amazon from GCE.

    • One the final notification is received it means there are three copies of the current revision of data: one in the database, the static grid, and on GCE.

    • For the current revision there are three copies. For old revision there two revisions (static grid, GCE).

    • The process is self-healing. If there’s a failure the next time a user updates their website everything that wasn’t uploaded will be uploaded again.

    • Orphan files are garbage collected.

Modeling Data With No Database Transactions

  • Don’t want a situation where a user edit two pages and only one page is saved in the database, which is an inconsistent state.

  • Take all the JSON files and stick them in the database one after the other. When all the files are saved another save command is issued which contains a manifest of all the IDs (which is hash of the content which is the file name on the static server) of the saved pages that were uploaded to the static servers.

Media Segment

  • Stores lots of files. 800TB of user media files, 3M files uploaded daily, and 500M metadata records.

  • Images are modified. They are resized for different devices and sharpened. Watermarks can be inserted and there’s also audio format conversion.

  • Built an eventually consistent distributed file system that is multi datacenter aware with automatic fallback across DCs. This is before Amazon.

  • A pain to run. 32 servers, doubling the number every 9 months.

  • Plan to push stuff to the cloud to help scale.

  • Vendor lock-in is a myth. It’s all APIs. Just change the implementation and you can move to different clouds in weeks.

  • What really locks you down is data. Moving 800TB of data to a different cloud is really hard.

  • They broke Google Compute Engine when they moved all their data into GCE. They reached the limits of the Google cloud. After some changes by Google it now works.

  • Files are immutable so the are highly cacheable.

  • Image requests first go to a CDN. If the image isn’t in the CDN the request goes to their primary datacenter in Austin. If the image isn’t in Austin the request then goes to Google Cloud. If it’s not in Google cloud it goes to a datacenter in Tampa.

Public Segment

  • Resolve URLs (45 million of them), dispatch to the appropriate renderer, and then render into HTML, sitemap XML, or robots TXT, etc.

  • Public SLA is that response time is < 100ms at peak traffic. Websites have to be available, but also fast. Remember, no caching.

  • When a user clicks publish after editing a page, the manifest, which contains references to pages, are pushed to Public. The routing table is also published.

  • Minimize out-of-service hops. Requires 1 database call to resolve the route. 1 RPC call to dispatch the request to the renderer. 1 database call to get the site manifest.

  • Lookup tables are cached in memory and are updated every 5 minutes.

  • Data is not stored in the same format as it is for the editor. It is stored in a denormalized format, optimized for read by primary key. Everything that is needed is returned in a single request.

  • Minimize business logic. The data is denormalized and precalculated. When you handle large scale every operation, every millisecond you add, it’s times 45 million, so every operation that happens on the public server has to be justified.

  • Page rendering.

    • The html returned by the public server is bootstrap html. It’s a shell with JavaScript imports and JSON data with references to site manifest and dynamic data.

    • Rendering is offloaded to the client. Laptops and mobile devices are very fast and can handle the rendering.

    • JSON was chosen because it’s easy to parse and compressible.

    • It’s easier to fix bugs on the client. Just redeploy new client code. When rendering is done on the server the html will be cached, so fixing a bug requires re-rendering millions of websites again.

High Availability For Public Segment

  • Goal is to be always available, but stuff happens.

  • On a good day: a browser makes a request, the request goes to a datacenter, through a load balancer, goes to a public server, resolves the route, goes to the renderer, the html goes back to the browser, and the browser runs the javascript. The javascript fetches all media files and the JSON data and renders a very beautiful web site. The browser then make a request to the Archive service. The Archive service replays the request in the same way the browser does and stores the data in a cache.

  • On a bad day a datacenter is lost, which did happen. All the UPSs died and the datacenter was down. The DNS was changed and then all the requests went to the secondary datacenter.

  • On a bad day Public is lost. This happened once when a load balancer got half of a configuration so all the Public servers were gone. Or a bad version can be deployed that starts returning errors. Custom code in the load balancer handles this problem by routing to the Archive service to fetch the cached if the Public servers are not available. This approach meant customers were not affected when Public went down, even though the system was reverberating with alarms at the time.

  • On a bad day the Internet sucks. The browser makes a request, goes to the datacenter, goes to the load balancer, gets the html back. Now the JavaScript code has to fetch all the pages and JSON data. It goes to the CDN, it goes to the static grid and fetches all the JSON files to render the site. In these processes Internet problems can prevent files from being returned. Code in JavaScript says if you can’t get to the primary location, try and get it from the archive service, if that fails try the editor database.

Lessons Learned

  • Identify your critical path and concerns. Think through how your product works. Develop usage scenarios. Focus your efforts on these as they give the biggest bang for the buck.

  • Go multi-datacenter and multi-cloud. Build redundancy on the critical path (for availability).

  • De-normalize data and Minimize out-of-process hops (for performance). Precaluclate and do everything possible to minimize network chatter.

  • Take advantage of client’s CPU power. It saves on your server count and it’s also easier to fix bugs in the client.

  • Start small, get it done, then figure out where to go next. Wix did what they needed to do to get their product working. Then they methodically moved to a sophisticated services architecture.

  • The long tail requires a different approach. Rather than cache everything Wix chose to optimize the heck out of the render path and keep data in both an active and archive databases.

  • Go immutable. Immutability has far reaching consequences for an architecture. It affects everything from the client through the back-end. It’s an elegant solution to a lot of problems.

  • Vendor lock-in is a myth. It’s all APIs. Just change the implementation and you can move to different clouds in weeks.

  • What really locks you down is data. Moving lots of data to a different cloud is really hard.

( Via HighScalability.com )

For Understanding Microservices

“Microservices” – yet another new term on the crowded streets of software architecture. Although our natural inclination is to pass such things by with a contemptuous glance, this bit of terminology describes a style of software systems that we are finding more and more appealing. We’ve seen many projects use this style in the last few years, and results so far have been positive, so much so that for many of our colleagues this is becoming the default style for building enterprise applications. Sadly, however, there’s not much information that outlines what the microservice style is and how to do it.

In short, the microservice architectural style [1] is an approach to developing a single application as a suite of small services, each running in its own process and communicating with lightweight mechanisms, often an HTTP resource API. These services are built around business capabilities and independently deployable by fully automated deployment machinery. There is a bare mininum of centralized management of these services, which may be written in different programming languages and use different data storage technologies.

To start explaining the microservice style it’s useful to compare it to the monolithic style: a monolithic application built as a single unit. Enterprise Applications are often built in three main parts: a client-side user interface (consisting of HTML pages and javascript running in a browser on the user’s machine) a database (consisting of many tables inserted into a common, and usually relational, database management system), and a server-side application. The server-side application will handle HTTP requests, execute domain logic, retrieve and update data from the database, and select and populate HTML views to be sent to the browser. This server-side application is a monolith – a single logical executable[2]. Any changes to the system involve building and deploying a new version of the server-side application.

Such a monolithic server is a natural way to approach building such a system. All your logic for handling a request runs in a single process, allowing you to use the basic features of your language to divide up the application into classes, functions, and namespaces. With some care, you can run and test the application on a developer’s laptop, and use a deployment pipeline to ensure that changes are properly tested and deployed into production. You can horizontally scale the monolith by running many instances behind a load-balancer.

Monolithic applications can be successful, but increasingly people are feeling frustrations with them – especially as more applications are being deployed to the cloud . Change cycles are tied together – a change made to a small part of the application, requires the entire monolith to be rebuilt and deployed. Over time it’s often hard to keep a good modular structure, making it harder to keep changes that ought to only affect one module within that module. Scaling requires scaling of the entire application rather than parts of it that require greater resource.

sketch

Figure 1: Monoliths and Microservices

These frustrations have led to the microservice architectural style: building applications as suites of services. As well as the fact that services are independently deployable and scalable, each service also provides a firm module boundary, even allowing for different services to be written in different programming languages. They can also be managed by different teams .

We do not claim that the microservice style is novel or innovative, its roots go back at least to the design principles of Unix. But we do think that not enough people consider a microservice architecture and that many software developments would be better off if they used it.


Characteristics of a Microservice Architecture

We cannot say there is a formal definition of the microservices architectural style, but we can attempt to describe what we see as common characteristics for architectures that fit the label. As with any definition that outlines common characteristics, not all microservice architectures have all the characteristics, but we do expect that most microservice architectures exhibit most characteristics. While we authors have been active members of this rather loose community, our intention is to attempt a description of what we see in our own work and in similar efforts by teams we know of. In particular we are not laying down some definition to conform to.

Componentization via Services

For as long as we’ve been involved in the software industry, there’s been a desire to build systems by plugging together components, much in the way we see things are made in the physical world. During the last couple of decades we’ve seen considerable progress with large compendiums of common libraries that are part of most language platforms.

When talking about components we run into the difficult definition of what makes a component. Our definition is that a component is a unit of software that is independently replaceable and upgradeable.

Microservice architectures will use libraries, but their primary way of componentizing their own software is by breaking down into services. We define libraries as components that are linked into a program and called using in-memory function calls, while services are out-of-process components who communicate with a mechanism such as a web service request, or remote procedure call. (This is a different concept to that of a service object in many OO programs [3].)

One main reason for using services as components (rather than libraries) is that services are independently deployable. If you have an application [4] that consists of a multiple libraries in a single process, a change to any single component results in having to redeploy the entire application. But if that application is decomposed into multiple services, you can expect many single service changes to only require that service to be redeployed. That’s not an absolute, some changes will change service interfaces resulting in some coordination, but the aim of a good microservice architecture is to minimize these through cohesive service boundaries and evolution mechanisms in the service contracts.

Another consequence of using services as components is a more explicit component interface. Most languages do not have a good mechanism for defining an explicit Published Interface. Often it’s only documentation and discipline that prevents clients breaking a component’s encapsulation, leading to overly-tight coupling between components. Services make it easier to avoid this by using explicit remote call mechanisms.

Using services like this does have downsides. Remote calls are more expensive than in-process calls, and thus remote APIs need to be coarser-grained, which is often more awkward to use. If you need to change the allocation of responsibilities between components, such movements of behavior are harder to do when you’re crossing process boundaries.

At a first approximation, we can observe that services map to runtime processes, but that is only a first approximation. A service may consist of multiple processes that will always be developed and deployed together, such as an application process and a database that’s only used by that service.

Organized around Business Capabilities

When looking to split a large application into parts, often management focuses on the technology layer, leading to UI teams, server-side logic teams, and database teams. When teams are separated along these lines, even simple changes can lead to a cross-team project taking time and budgetary approval. A smart team will optimise around this and plump for the lesser of two evils – just force the logic into whichever application they have access to. Logic everywhere in other words. This is an example of Conway’s Law[5]in action.

Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization’s communication structure.

– Melvyn Conway, 1967

conways-law

Figure 2: Conway’s Law in action

The microservice approach to division is different, splitting up into services organized around business capability. Such services take a broad-stack implementation of software for that business area, including user-interface, persistant storage, and any external collaborations. Consequently the teams are cross-functional, including the full range of skills required for the development: user-experience, database, and project management.

PreferFunctionalStaffOrganization

Figure 3: Service boundaries reinforced by team boundaries

One company organised in this way is www.comparethemarket.com. Cross functional teams are responsible for building and operating each product and each product is split out into a number of individual services communicating via a message bus.

Large monolithic applications can always be modularized around business capabilities too, although that’s not the common case. Certainly we would urge a large team building a monolithic application to divide itself along business lines. The main issue we have seen here, is that they tend to be organised around too many contexts. If the monolith spans many of these modular boundaries it can be difficult for individual members of a team to fit them into their short-term memory. Additionally we see that the modular lines require a great deal of discipline to enforce. The necessarily more explicit separation required by service components makes it easier to keep the team boundaries clear.

Products not Projects

Most application development efforts that we see use a project model: where the aim is to deliver some piece of software which is then considered to be completed. On completion the software is handed over to a maintenance organization and the project team that built it is disbanded.

Microservice proponents tend to avoid this model, preferring instead the notion that a team should own a product over its full lifetime. A common inspiration for this is Amazon’s notion of “you build, you run it” where a development team takes full responsibility for the software in production. This brings developers into day-to-day contact with how their software behaves in production and increases contact with their users, as they have to take on at least some of the support burden.

The product mentality, ties in with the linkage to business capabilities. Rather than looking at the software as a set of functionality to be completed, there is an on-going relationship where the question is how can software assist its users to enhance the business capability.

There’s no reason why this same approach can’t be taken with monolithic applications, but the smaller granularity of services can make it easier to create the personal relationships between service developers and their users.

Smart endpoints and dumb pipes

When building communication structures between different processes, we’ve seen many products and approaches that stress putting significant smarts into the communication mechanism itself. A good example of this is the Enterprise Service Bus (ESB), where ESB products often include sophisticated facilities for message routing, choreography, transformation, and applying business rules.

The microservice community favours an alternative approach: smart endpoints and dumb pipes. Applications built from microservices aim to be as decoupled and as cohesive as possible – they own their own domain logic and act more as filters in the classical Unix sense – receiving a request, applying logic as appropriate and producing a response. These are choreographed using simple RESTish protocols rather than complex protocols such as WS-Choreography or BPEL or orchestration by a central tool.

The two protocols used most commonly are HTTP request-response with resource API’s and lightweight messaging[6]. The best expression of the first is

Be of the web, not behind the web

– Ian Robinson

Microservice teams use the principles and protocols that the world wide web (and to a large extent, Unix) is built on. Often used resources can be cached with very little effort on the part of developers or operations folk.

The second approach in common use is messaging over a lightweight message bus. The infrastructure chosen is typically dumb (dumb as in acts as a message router only) – simple implementations such as RabbitMQ or ZeroMQ don’t do much more than provide a reliable asynchronous fabric – the smarts still live in the end points that are producing and consuming messages; in the services.

In a monolith, the components are executing in-process and communication between them is via either method invocation or function call. The biggest issue in changing a monolith into microservices lies in changing the communication pattern. A naive conversion from in-memory method calls to RPC leads to chatty communications which don’t perform well. Instead you need to replace the fine-grained communication with a coarser -grained approach.

Decentralized Governance

One of the consequences of centralised governance is the tendency to standardise on single technology platforms. Experience shows that this approach is constricting – not every problem is a nail and not every solution a hammer. We prefer using the right tool for the job and while monolithic applications can take advantage of different languages to a certain extent, it isn’t that common.

Splitting the monolith’s components out into services we have a choice when building each of them. You want to use Node.js to standup a simple reports page? Go for it. C++ for a particularly gnarly near-real-time component? Fine. You want to swap in a different flavour of database that better suits the read behaviour of one component? We have the technology to rebuild him.

Of course, just because you can do something, doesn’t mean you should- but partitioning your system in this way means you have the option.

Teams building microservices prefer a different approach to standards too. Rather than use a set of defined standards written down somewhere on paper they prefer the idea of producing useful tools that other developers can use to solve similar problems to the ones they are facing. These tools are usually harvested from implementations and shared with a wider group, sometimes, but not exclusively using an internal open source model. Now that git and github have become the de facto version control system of choice, open source practices are becoming more and more common in-house .

Netflix is a good example of an organisation that follows this philosophy. Sharing useful and, above all, battle-tested code as libraries encourages other developers to solve similar problems in similar ways yet leaves the door open to picking a different approach if required. Shared libraries tend to be focused on common problems of data storage, inter-process communication and as we discuss further below, infrastructure automation.

For the microservice community, overheads are particularly unattractive. That isn’t to say that the community doesn’t value service contracts. Quite the opposite, since there tend to be many more of them. It’s just that they are looking at different ways of managing those contracts. Patterns like Tolerant Reader and Consumer-Driven Contracts are often applied to microservices. These aid service contracts in evolving independently. Executing consumer driven contracts as part of your build increases confidence and provides fast feedback on whether your services are functioning. Indeed we know of a team in Australia who drive the build of new services with consumer driven contracts. They use simple tools that allow them to define the contract for a service. This becomes part of the automated build before code for the new service is even written. The service is then built out only to the point where it satisfies the contract – an elegant approach to avoid the ‘YAGNI’[9] dilemma when building new software. These techniques and the tooling growing up around them, limit the need for central contract management by decreasing the temporal coupling between services.

Perhaps the apogee of decentralised governance is the build it / run it ethos popularised by Amazon. Teams are responsible for all aspects of the software they build including operating the software 24/7. Devolution of this level of responsibility is definitely not the norm but we do see more and more companies pushing responsibility to the development teams. Netflix is another organisation that has adopted this ethos[11]. Being woken up at 3am every night by your pager is certainly a powerful incentive to focus on quality when writing your code. These ideas are about as far away from the traditional centralized governance model as it is possible to be.

Decentralized Data Management

Decentralization of data management presents in a number of different ways. At the most abstract level, it means that the conceptual model of the world will differ between systems. This is a common issue when integrating across a large enterprise, the sales view of a customer will differ from the support view. Some things that are called customers in the sales view may not appear at all in the support view. Those that do may have different attributes and (worse) common attributes with subtly different semantics.

This issue is common between applications, but can also occur withinapplications, particular when that application is divided into separate components. A useful way of thinking about this is the Domain-Driven Design notion of Bounded Context. DDD divides a complex domain up into multiple bounded contexts and maps out the relationships between them. This process is useful for both monolithic and microservice architectures, but there is a natural correlation between service and context boundaries that helps clarify, and as we describe in the section on business capabilities, reinforce the separations.

As well as decentralizing decisions about conceptual models, microservices also decentralize data storage decisions. While monolithic applications prefer a single logical database for persistant data, enterprises often prefer a single database across a range of applications – many of these decisions driven through vendor’s commercial models around licensing. Microservices prefer letting each service manage its own database, either different instances of the same database technology, or entirely different database systems – an approach called Polyglot Persistence. You can use polyglot persistence in a monolith, but it appears more frequently with microservices.

decentralised-data

Decentralizing responsibility for data across microservices has implications for managing updates. The common approach to dealing with updates has been to use transactions to guarantee consistency when updating multiple resources. This approach is often used within monoliths.

Using transactions like this helps with consistency, but imposes significant temporal coupling, which is problematic across multiple services. Distributed transactions are notoriously difficult to implement and and as a consequence microservice architectures emphasize transactionless coordination between services, with explicit recognition that consistency may only be eventual consistency and problems are dealt with by compensating operations.

Choosing to manage inconsistencies in this way is a new challenge for many development teams, but it is one that often matches business practice. Often businesses handle a degree of inconsistency in order to respond quickly to demand, while having some kind of reversal process to deal with mistakes. The trade-off is worth it as long as the cost of fixing mistakes is less than the cost of lost business under greater consistency.

Infrastructure Automation

Infrastructure automation techniques have evolved enormously over the last few years – the evolution of the cloud and AWS in particular has reduced the operational complexity of building, deploying and operating microservices.

Many of the products or systems being build with microservices are being built by teams with extensive experience of Continuous Delivery and it’s precursor, Continuous Integration. Teams building software this way make extensive use of infrastructure automation techniques. This is illustrated in the build pipeline shown below.

basic-pipeline

Figure 5: basic build pipeline

Since this isn’t an article on Continuous Delivery we will call attention to just a couple of key features here. We want as much confidence as possible that our software is working, so we run lots of automated tests. Promotion of working software ‘up’ the pipeline means we automate deployment to each new environment.

A monolithic application will be built, tested and pushed through these environments quite happlily. It turns out that once you have invested in automating the path to production for a monolith, then deploying more applications doesn’t seem so scary any more. Remember, one of the aims of CD is to make deployment boring, so whether its one or three applications, as long as its still boring it doesn’t matter[12].

Another area where we see teams using extensive infrastructure automation is when managing microservices in production. In contrast to our assertion above that as long as deployment is boring there isn’t that much difference between monoliths and microservices, the operational landscape for each can be strikingly different.

micro-deployment

Figure 6: Module deployment often differs

Design for failure

A consequence of using services as components, is that applications need to be designed so that they can tolerate the failure of services. Any service call could fail due to unavailability of the supplier, the client has to respond to this as gracefully as possible. This is a disadvantage compared to a monolithic design as it introduces additional complexity to handle it. The consequence is that microservice teams constantly reflect on how service failures affect the user experience. Netflix’s Simian Army induces failures of services and even datacenters during the working day to test both the application’s resilience and monitoring.

This kind of automated testing in production would be enough to give most operation groups the kind of shivers usually preceding a week off work. This isn’t to say that monolithic architectural styles aren’t capable of sophisticated monitoring setups – it’s just less common in our experience.

Since services can fail at any time, it’s important to be able to detect the failures quickly and, if possible, automatically restore service. Microservice applications put a lot of emphasis on real-time monitoring of the application, checking both architectural elements (how many requests per second is the database getting) and business relevant metrics (such as how many orders per minute are received). Semantic monitoring can provide an early warning system of something going wrong that triggers development teams to follow up and investigate.

This is particularly important to a microservices architecture because the microservice preference towards choreography and event collaboration leads to emergent behavior. While many pundits praise the value of serendipitous emergence, the truth is that emergent behavior can sometimes be a bad thing. Monitoring is vital to spot bad emergent behavior quickly so it can be fixed.

Monoliths can be built to be as transparent as a microservice – in fact, they should be. The difference is that you absolutely need to know when services running in different processes are disconnected. With libraries within the same process this kind of transparency is less likely to be useful.

Microservice teams would expect to see sophisticated monitoring and logging setups for each individual service such as dashboards showing up/down status and a variety of operational and business relevant metrics. Details on circuit breaker status, current throughput and latency are other examples we often encounter in the wild.

Evolutionary Design

Microservice practitioners, usually have come from an evolutionary design background and see service decomposition as a further tool to enable application developers to control changes in their application without slowing down change. Change control doesn’t necessarily mean change reduction – with the right attitudes and tools you can make frequent, fast, and well-controlled changes to software.

Whenever you try to break a software system into components, you’re faced with the decision of how to divide up the pieces – what are the principles on which we decide to slice up our application? The key property of a component is the notion of independent replacement and upgradeability[13] – which implies we look for points where we can imagine rewriting a component without affecting its collaborators. Indeed many microservice groups take this further by explicitly expecting many services to be scrapped rather than evolved in the longer term.

The Guardian website is a good example of an application that was designed and built as a monolith, but has been evolving in a microservice direction. The monolith still is the core of the website, but they prefer to add new features by building microservices that use the monolith’s API. This approach is particularly handy for features that are inherently temporary, such as specialized pages to handle a sporting event. Such a part of the website can quickly be put together using rapid development languages, and removed once the event is over. We’ve seen similar approaches at a financial institution where new services are added for a market opportunity and discarded after a few months or even weeks.

This emphasis on replaceability is a special case of a more general principle of modular design, which is to drive modularity through the pattern of change [14]. You want to keep things that change at the same time in the same module. Parts of a system that change rarely should be in different services to those that are currently undergoing lots of churn. If you find yourself repeatedly changing two services together, that’s a sign that they should be merged.

Putting components into services adds an opportunity for more granular release planning. With a monolith any changes require a full build and deployment of the entire application. With microservices, however, you only need to redeploy the service(s) you modified. This can simplify and speed up the release process. The downside is that you have to worry about changes to one service breaking its consumers. The traditional integration approach is to try to deal with this problem using versioning, but the preference in the microservice world is to only use versioning as a last resort. We can avoid a lot of versioning by designing services to be as tolerant as possible to changes in their suppliers.


Are Microservices the Future?

Our main aim in writing this article is to explain the major ideas and principles of microservices. By taking the time to do this we clearly think that the microservices architectural style is an important idea – one worth serious consideration for enterprise applications. We have recently built several systems using the style and know of others who have used and favor this approach.

Those we know about who are in some way pioneering the architectural style include Amazon, Netflix, The Guardian, the UK Government Digital Service, realestate.com.au, Forward and comparethemarket.com. The conference circuit in 2013 was full of examples of companies that are moving to something that would class as microservices – including Travis CI. In addition there are plenty of organizations that have long been doing what we would class as microservices, but without ever using the name. (Often this is labelled as SOA – although, as we’ve said, SOA comes in many contradictory forms. [15])

Despite these positive experiences, however, we aren’t arguing that we are certain that microservices are the future direction for software architectures. While our experiences so far are positive compared to monolithic applications, we’re conscious of the fact that not enough time has passed for us to make a full judgement.

Often the true consequences of your architectural decisions are only evident several years after you made them. We have seen projects where a good team, with a strong desire for modularity, has built a monolithic architecture that has decayed over the years. Many people believe that such decay is less likely with microservices, since the service boundaries are explicit and hard to patch around. Yet until we see enough systems with enough age, we can’t truly assess how microservice architectures mature.

There are certainly reasons why one might expect microservices to mature poorly. In any effort at componentization, success depends on how well the software fits into components. It’s hard to figure out exactly where the component boundaries should lie. Evolutionary design recognizes the difficulties of getting boundaries right and thus the importance of it being easy to refactor them. But when your components are services with remote communications, then refactoring is much harder than with in-process libraries. Moving code is difficult across service boundaries, any interface changes need to be coordinated between participants, layers of backwards compatibility need to be added, and testing is made more complicated.

Another issue is If the components do not compose cleanly, then all you are doing is shifting complexity from inside a component to the connections between components. Not just does this just move complexity around, it moves it to a place that’s less explicit and harder to control. It’s easy to think things are better when you are looking at the inside of a small, simple component, while missing messy connections between services.

Finally, there is the factor of team skill. New techniques tend to be adopted by more skillful teams. But a technique that is more effective for a more skillful team isn’t necessarily going to work for less skillful teams. We’ve seen plenty of cases of less skillful teams building messy monolithic architectures, but it takes time to see what happens when this kind of mess occurs with microservices. A poor team will always create a poor system – it’s very hard to tell if microservices reduce the mess in this case or make it worse.

One reasonable argument we’ve heard is that you shouldn’t start with a microservices architecture. Instead begin with a monolith, keep it modular, and split it into microservices once the monolith becomes a problem. (Although this advice isn’t ideal, since a good in-process interface is usually not a good service interface.)

So we write this with cautious optimism. So far, we’ve seen enough about the microservice style to feel that it can be a worthwhile road to tread. We can’t say for sure where we’ll end up, but one of the challenges of software development is that you can only make decisions based on the imperfect information that you currently have to hand.

( Via martinfowler.com )

 

Introducing Proxygen, Facebook’s C++ HTTP framework

We are excited to announce the release of Proxygen, a collection of C++ HTTP libraries, including an easy-to-use HTTP server. In addition to HTTP/1.1, Proxygen (rhymes with “oxygen”) supports SPDY/3 and SPDY/3.1. We are also iterating and developing support for HTTP/2.

Proxygen is not designed to replace Apache or nginx — those projects focus on building extremely flexible HTTP servers written in C that offer good performance but almost overwhelming amounts of configurability. Instead, we focused on building a high performance C++ HTTP framework with sensible defaults that includes both server and client code and that’s easy to integrate into existing applications. We want to help more people build and deploy high performance C++ HTTP services, and we believe that Proxygen is a great framework to do so. You can read the documentation and send pull requests on our Github site.

Background

Proxygen began as a project to write a customizable, high-performance HTTP(S) reverse-proxy load balancer nearly four years ago. We initially planned for Proxygen to be a software library for generating proxies, hence the name. But Proxygen has evolved considerably since the early days of the project. While there were a variety of software stacks that provided similar functionality to Proxygen at the time (Apache, nginx, HAProxy, Varnish, etc), we opted go in a different direction.

Why build our own HTTP stack?

  • Integration.The ability to deeply integrate with existing Facebook infrastructure and tools was a driving factor. For instance, being able to administer our HTTP infrastructure with tools such as Thrift simplifies integration with existing systems. Being able to easily track and measure Proxygen with systems like ODS, our internal monitoring tool, makes rapid data correlation possible and reduces operational overhead. Building an in-house HTTP stack also meant that we could leverage improvements made to these Facebook technologies.
  • Code reuse. We wanted a platform for building and delivering event-driven networking libraries to other projects. Currently, more than a dozen internal systems are built on top of the Proxygen core code, including parts of systems like Haystack, HHVM, our HTTP load balancers, and some of our mobile infrastructure. Proxygen provides a platform where we can develop support for a protocol like HTTP/2 and have it available anywhere that leverages Proxygen’s core code.
  • Scale. We tried to make some of the existing software stacks work, and some of them did for a long time. Operating a huge HTTP infrastructure built on top of lots of workarounds eventually stopped scaling well for us.
  • Features.There were a number of features that didn’t exist in other software (some still don’t) that seemed quite difficult to integrate in existing projects but would be immediately useful for Facebook. Examples of features that we were interested in included SPDY, WebSockets, HTTP/1.1 (keep-alive), TLS false start, and Facebook-specific load-scheduling algorithms. Building an in-house HTTP stack gave us the flexibility to iterate quickly on these features.

Initially kicked off in 2011 by a few engineers who were passionate about seeing HTTP usage at Facebook evolve, Proxygen has been supported since then by a team of three to four people. Outside of the core development team, dozens of internal contributors have also added to the project. Since the project started, the major milestones have included:

  • 2011 – Proxygen development began and started taking production traffic
  • 2012 – Added SPDY/2 support and started initial public testing
  • 2013 – SPDY/3 development/testing/rollout, SPDY/3.1 development started
  • 2014 – Completed SPDY/3.1 rollout, began HTTP/2 development

There are a number of other milestones that we are excited about, but we think the code tells a better story than we can.

We now have a few years of experience managing a large Proxygen deployment. The library has been battle-tested with many, many trillions of HTTP(S) and SPDY requests. Now we’ve reached a point where we are ready to share this code more widely.

Architecture

The central abstractions to understand in proxygen/lib are the session, codec, transaction, and handler. These are the lowest level abstractions, and we don’t generally recommend building off of these directly.

When bytes are read off the wire, the HTTPCodec stored inside HTTPSession parses these into higher level objects and associates with it a transaction identifier. The codec then calls into HTTPSessionwhich is responsible for maintaining the mapping between transaction identifier and HTTPTransactionobjects. Each HTTP request/response pair has a separate HTTPTransaction object. Finally,HTTPTransaction forwards the call to a handler object which implements HTTPTransaction::Handler. The handler is responsible for implementing business logic for the request or response.

The handler then calls back into the transaction to generate egress (whether the egress is a request or response). The call flows from the transaction back to the session, which uses the codec to convert the higher level semantics of the particular call into the appropriate bytes to send on the wire.

The same handler and transaction interfaces are used to both create requests and handle responses. The API is generic enough to allow both. HTTPSession is specialized slightly differently depending on whether you are using the connection to issue or respond to HTTP requests.

CoreProxygenArchitecture

 

Moving into higher levels of abstraction, proxygen/httpserver has a simpler set of APIs and is the recommended way to interface with proxygen when acting as a server if you don’t need the full control of the lower level abstractions.

The basic components here are HTTPServer, RequestHandlerFactory, and RequestHandler. AnHTTPServer takes some configuration and is given a RequestHandlerFactory. Once the server is started, the installed RequestHandlerFactory spawns a RequestHandler for each HTTP request.RequestHandler is a simple interface users of the library implement. Subclasses of RequestHandlershould use the inherited protected member ResponseHandler* downstream_ to send the response.

HTTP Server

The server framework we’ve included with the release is a great choice if you want to get set up quickly with a simple, fast-out-of-the-box event driven server. With just a few options, you are ready to go. Check out the echo server example to get a sense of what it is like to work with. For fun, we benchmarked the echo server on a 32 logical core Intel(R) Xeon(R) CPU E5-2670 @ 2.60GHz with 16 GiB of RAM,* *varying the number of worker threads from one to eight. We ran the client on the same machine to eliminate network effects, and achieved the following performance numbers.

    # Load test client parameters:
    # Twice as many worker threads as server
    # 400 connections open simultaneously
    # 100 requests per connection
    # 60 second test
    # Results reported are the average of 10 runs for each test
    # simple GETs, 245 bytes of request headers, 600 byte response (in memory)
    # SPDY/3.1 allows 10 concurrent requests per connection

While the echo server is quite simple compared to a real web server, this benchmark is an indicator of the parsing efficiency of binary protocols like SPDY and HTTP/2. The HTTP server framework included is easy to work with and gives good performance, but focuses more on usability than the absolute best possible performance. For instance, the filter model in the HTTP server allows you to easily compose common behaviors defined in small chunks of code that are easily unit testable. On the other hand, the allocations associated with these filters may not be ideal for the highest-performance applications.

Impact

Proxygen has allowed us to rapidly build out new features, get them into production, and see the results very quickly. As an example, we were interested in evaluating the compression format HPACK, however we didn’t have any HTTP/2 clients deployed yet and the HTTP/2 spec itself was still under heavy development. Proxygen allowed us to implement HPACK, use it on top of SPDY, and deploy that to mobile clients and our servers simultaneously. The ability to rapidly experiment with a real HPACK deployment enabled us to quickly understand performance and data efficiency characteristics in a production environment. As another example, the internal configuration systems at Facebook continue to evolve and get better. As these systems are built out, Proxygen is able to quickly take advantage of them as soon as they’re available. Proxygen has made a significant positive impact on our ability to rapidly iterate with our HTTP infrastructure.

Open Source

The Proxygen codebase is under active development and will continue to evolve. If you are passionate about HTTP, high performance networking code, and modern C++, we would be excited to work with you! Please send us pull requests on GitHub.

We’re committed to open source and are always looking for new opportunities to share our learnings and software. The traffic team has now open sourced Thrift as well as Proxygen, two important components of the network software infrastructure at Facebook. We hope that these software components will be building blocks for other systems that can be open sourced in the future.

( Via Facebook)

Google Launches Managed Service For Running Docker-Based Applications On Its Cloud Platform

docker-google

Google today announced the alpha launch of Google Container Engine, a new managed service for building and running Docker container-based applications on its cloud platform.

Docker is probably the hottest technology in developer circles these days — it’s almost impossible to have a discussion with a developer without it coming up — and Google’s Cloud Platform team has decided to go all in on this technology that makes it easier for developers to run distributed applications.

In essence, this new service is a “Cluster-as-a-Service” platform based on Google’s open source Kubernetes project. Kubernetes, which helps developers manage their container clusters, is based on Google’s own work with containers in its massive data centers. In this new service, Kubernetes dynamically manages the different Docker containers that make up an application for the user.

Google says the combination of “fast booting, efficient VM hosts and seamless virtualized network integration” will make its cloud computing service “the best place to run container-based applications.” The company’s competitors would likely argue with that, but none of them offer a similar service at this point.

Google initially launched support for Docker images in May as part of its new Managed VM service. These managed VMs are coming out of Google’s limited alpha — with the addition of auto-scaling support — and with that, developers can now use Docker containers on Google’s platform without having to jump through any hoops. Managed VMs will remain in beta for now, though, which in Google’s new language means there is no access control and that charges may be waived, but there is no SLA or technical support obligation either.

And remember, because this new service is officially in alpha, it isn’t feature-complete and the whole infrastructure could melt down at any minute.

(Via TechCrunch.com)