Scaling Pinterest – From 0 To 10s Of Billions Of Page Views A Month In Two Years

6886606039_bc5c70dbf9_m

Pinterest has been riding an exponential growth curve, doubling every month and half. They’ve gone from 0 to 10s of billions of page views a month in two years, from 2 founders and one engineer to over 40 engineers, from one little MySQL server to 180 Web Engines, 240 API Engines, 88 MySQL DBs (cc2.8xlarge) + 1 slave each, 110 Redis Instances, and 200 Memcache Instances.

Stunning growth. So what’s Pinterest’s story? To tell their story we have our bards, Pinterest’s Yashwanth Nelapati and Marty Weiner, who tell the dramatic story of Pinterest’s architecture evolution in a talk titled Scaling Pinterest. This is the talk they would have liked to hear a year and half ago when they were scaling fast and there were a lot of options to choose from. And they made a lot of incorrect choices.

This is a great talk. It’s full of amazing details. It’s also very practical, down to earth, and it contains strategies adoptable by nearly anyone. Highly recommended.

Two of my favorite lessons from the talk:

  1. Architecture is doing the right thing when growth can be handled by adding more of the same stuff. You want to be able to scale by throwing money at a problem which means throwing more boxes at a problem as you need them. If your architecture can do that, then you’re golden.
  2. When you push something to the limit all technologies fail in their own special way. This lead them to evaluate tool choices with a preference for tools that are: mature; really good and simple; well known and liked; well supported; consistently good performers; failure free as possible; free. Using these criteria they selected: MySQL, Solr, Memcache, and Redis. Cassandra and Mongo were dropped.

These two lessons are interrelated. Tools following the principles in (2) can scale by adding more boxes. And as load increases mature products should have fewer problems. When you do hit problems you’ll at least have a community to help fix them.  It’s when your tools are too tricky and too finicky that you hit walls so high you can’t climb over.

It’s in what I think is the best part of the entire talk, the discussion of why sharding is better than clustering, that you see the themes of growing by adding resources, few failure modes, mature, simple, and good support, come into full fruition. Notice all the tools they chose grow by adding shards, not through clustering. The discussion of why they prefer sharding and how they shard is truly interesting and will probably cover ground you’ve never considered before.

Now, let’s see how Pinterest scales:

Basics

  • Pins are an image with a collection of other bits of information, a description of what makes it important to the user, and link back to where they found it.

  • Pinterest is a social network. You can follow people and boards.

  • Database: They have users who have boards and boards have pins; follow and repin relationships; authentication information.

Launched In March 2010 – The Age Of Finding Yourself

At this point you don’t even know what product you are going to build. You have ideas, so you are iterating and changing things quickly. So you end up with a lot of strange little MySQL queries you would never do in real life.

The numbers at this early date:

  • 2 founders

  • 1 engineer

  • Rackspace

  • 1 small web engine

  • 1 small MySQL DB

January 2011

Still in stealth mode and the product is evolving from user feedback. The numbers:

  • Amazon EC2 + S3 + CloudFront

  • 1 NGinX, 4 Web Engines (for redundancy, not really for load)

  • 1 MySQL DB + 1 Read Slave (in case master goes down)

  • 1 Task Queue + 2 Task Processors

  • 1 MongoDB (for counters)

  • 2 Engineers

Through Sept 2011 – The Age Of Experimentation

Went on a crazy run where they were doubling every month and half. Insane growth.

  • When you are growing that fast everything breaks every night and every week.

  • At this point you read a lot of white papers that say just ad a box and you’re done. So they start adding a lot of technology. They all break.

  • As a result you end up with a very complicated picture:

    • Amazon EC2 + S3 + CloudFront

    • 2NGinX, 16 Web Engines + 2 API Engines

    • 5 Functionally sharded MySQL DB + 9 read slaves

    • 4 Cassandra Nodes

    • 15 Membase Nodes (3 separate clusters)

    • 8 Memcache Nodes

    • 10 Redis Nodes

    • 3 Task Routers + 4 Task Processors

    • 4 Elastic Search Nodes

    • 3 Mongo Clusters

    • 3 Engineers

  • 5 major database technologies for just their data alone.

  • Growing so fast that MySQL was hot and all the other technologies were being pushed to the limits.

  • When you push something to the limit all these technologies fail in their own special way.

  • Started dropping technologies and asked themselves what they really wanted to be. Did a massive rearchitecture of everything.

January 2012 – The Age Of Maturity

  • After everything was rearchitected the system now looks like:

    • Amazon EC2 + S3 + Akamai, ELB

    • 90 Web Engines + 50 API Engines

    • 66 MySQL DBs (m1.xlarge) + 1 slave each

    • 59 Redis Instances

    • 51 Memcache Instances

    • 1 Redis Task Manager + 25 Task Processors

    • Sharded Solr

    • 6 Engineers

  • Now on sharded MySQL, Redis, Memcache, and Solr. That’s it. The advantage is it’s really simple and mature technologies

  • Web traffic keeps going up at the same velocity and iPhone traffic starts ramping up.

October 12 2012 – The Age Of Return

About 4x where they were in January.

  • The numbers now looks like:
    • Amazon EC2 + S3 + Edge Cast,Akamai, Level 3

    • 180 Web Engines + 240 API Engines

    • 88 MySQL DBs (cc2.8xlarge) + 1 slave each

    • 110 Redis Instances

    • 200 Memcache Instances

    • 4 Redis Task Manager + 80 Task Processors

    • Sharded Solr

    • 40 Engineers (and growing)

  • Notice that the architecture is doing the right thing. Growth is by adding more of the same stuff. You want to be able to scale by throwing money at the problem. You want to just be able to throw more boxes at the problem as you need them.

  • Now moving to SSDs

Why Amazon EC2/S3?

  • Pretty good reliability. Datacenters go down too. Multitenancy adds some risk, but it’s not bad.

  • Good reporting and support. They have really good architects and they know the problems

  • Nice peripherals, especially when you are growing. You can spin up in App Engine and talk to their managed cache, load balancing, map reduce, managed databases, and all the other parts that you don’t have to write yourself. You can bootstrap on Amazon’s services and then evolve them when you have the engineers.

  • New instances available in seconds. The power of the cloud. Especially with two engineers you don’t have to worry about capacity planning or wait two weeks to get your memcache. 10 memcaches can be added in a matter of minutes.

  • Cons: limited choice. Until recently you could get SSDs and you can’t get large RAM configurations.

  • Pros: limited choice. You don’t end up with a lot of boxes with different configurations.

Why MySQL?

  • Really mature.

  • Very solid. It’s never gone down for them and it’s never lost data.

  • You can hire for it. A lot of engineers know MySQL.

  • Response time to request rate increases linearly. Some technologies don’t respond as well when the request rate spikes.

  • Very good software support – XtraBackup, Innotop, Maatkit

  • Great community. Getting questions answered is easy.

  • Very good support from companies like Percona.

  • Free – important when you don’t have any funding initially.

Why Memcache?

  • Very mature.

  • Really simple. It’s a hashtable with a socket jammed in.

  • Consistently good performance

  • Well known and liked.

  • Consistently good performance.

  • Never crashes.

  • Free

Why Redis?

  • Not mature, but it’s really good and pretty simple.

  • Provides a variety of data structures.

  • Has persistence and replication, with selectability on how you want them implemented. If you want MySQL style persistence you can have it. If you want no persistence you can have it. Or if you want 3 hour persistence you can have it.

    • Your home feed is on Redis and is saved every 3 hours. There’s no 3 hour replication. They just backup every 3 hours.

    • If the box your data is stored on dies, then they just backup a few hours. It’s not perfectly reliable, but it is simpler. You don’t need complicated persistence and replication. It’s a lot simpler and cheaper architecture.

  • Well known and liked.

  • Consistently good performance.

  • Few failure modes. Has a few subtle failure modes that you need to learn about. This is where maturity comes in. Just learn them and get past it.

  • Free

Solr

  • A great get up and go type product. Install it and a few minutes later you are indexing.

  • Doesn’t scale past one box. (not so with latest release)

  • Tried Elastic Search, but at their scale it had trouble with lots of tiny documents and lots of queries.

  • Now using Websolr. But they have a search team and will roll their own.

Clustering Vs Sharding

  • During rapid growth they realized they were going to need to spread their data evenly to handle the ever increasing load.

  • Defined a spectrum to talk about the problem and they came up with a range of options between Clustering and Sharding.

Clustering – Everything Is Automatic:

  • Examples: Cassandra, MemBase, HBase

  • Verdict: too scary, maybe in the future, but for now it’s too complicated and has way too many failure modes.

  • Properties:

    • Data distributed automatically

    • Data can move

    • Rebalance to distribute capacity

    • Nodes communicate with each other. A lot of crosstalk, gossiping and negotiation.

  • Pros:

    • Automatically scale your datastore. That’s what the white papers say at least.

    • Easy to setup.

    • Spatially distribute and colocate your data. You can have datacenter in different regions and the database takes care of it.

    • High availability

    • Load balancing

    • No single point of failure

  • Cons (from first hand experience):

    • Still fairly young.

    • Fundamentally complicated. A whole bunch nodes have to symmetrical agreement, which is a hard problem to solve in production.

    • Less community support. There’s a split in the community along different product lines so there’s less support in each camp.

    • Fewer engineers with working knowledge. Probably most engineers have not used Cassandra.

    • Difficult and scary upgrade mechanisms. Could be related to they all use an API and talk amongst themselves so you don’t want them to be confused. This leads to complicated upgrade paths.

    • Cluster Management Algorithm is a SPOF. If there’s a bug it impacts every node. This took them down 4 times.

    • Cluster Managers are complex code replicated over all nodes that have the following failure modes:

      • Data rebalance breaks. Bring a new box and data starts replicating and then it gets stuck. What do you do? There aren’t tools to figure out what’s going on. There’s no community to help, so they were stuck. They reverted back to MySQL.

      • Data corruption across all nodes. What if there’s a bug that sprays badness into the write log across all of them and compaction or some other mechanism stops? Your read latencies increase. All your data is screwed and the data is gone.

      • Improper balancing that cannot be easily fixed. Very common. You have 10 nodes and you notice all the node is on one node. There’s a manual process, but it redistributes the load back to one node.

      • Data authority failure. Clustering schemes are very smart. In one case they bring in a new secondary. At about 80% the secondary says it’s primary and the primary goes to secondary and you’ve lost 20% of the data. Losing 20% of the data is worse than losing all of it because you don’t know what you’ve lost.

Sharding – Everything Is Manual:

  • Verdict: It’s the winner. Note, I think their sharding architecture has a lot in common with Flickr Architecture.

  • Properties:

    • Get rid of all the properties of clustering that you don’t like and you get sharding.

    • Data distributed manually

    • Data does not move. They don’t ever move data, though some people do, which puts them higher on the spectrum.

    • Split data to distribute load.

    • Nodes are not aware of each other. Some master node controls everything.

  • Pros:

    • Can split your database to add more capacity.

    • Spatially distribute and collocate your data

    • High availability

    • Load balancing

    • Algorithm for placing data is very simple. The main reason. Has a SPOF, but it’s half a page of code rather than a very complicated Cluster Manager. After the first day it will work or won’t work.

    • ID generation is simplistic.

  • Cons:

    • Can’t perform most joins.

    • Lost all transaction capabilities. A write to one database may fail when a write to another succeeds.

    • Many constraints must be moved to the application layer.

    • Schema changes require more planning.

    • Reports require running queries on all shards and then perform all the aggregation yourself.

    • Joins are performed in the application layer.

    • Your application must be tolerant to all these issues.

When To Shard?

  • If your project will have a few TBs of data then you should shard as soon as possible.

  • When the Pin table went to a billion rows the indexes ran out of memory and they were swapping to disk.

  • They picked the largest table and put it in its own database.

  • Then they ran out of space on the single database.

  • Then they had to shard.

Transition To Sharding

  • Started the transition process with a feature freeze.

  • Then they decided how they wanted to shard. Want to perform the least amount of queries and go to least number of databases to render a single page.

  • Removed all MySQL joins. Since the tables could be loaded into separate partitions joins would not work.

  • Added a ton of cache. Basically every query has to be cached.

  • The steps looked like:

    • 1 DB + Foreign Keys + Joins

    • 1 DB + Denormalized + Cache

    • 1 DB + Read Slaves + Cache

    • Several functionally sharded DBs + Read slaves + Cache

    • ID sharded DBs + Backup slaves + cache

  • Earlier read slaves became a problem because of slave lag. A read would go to the slave and the master hadn’t replicated the record yet, so it looked like a record was missing. Getting around that require cache.

  • They have background scripts that duplicate what the database used to do. Check for integrity constraints, references.

  • User table is unsharded. They just use a big database and have a uniqueness constraint on user name and email. Inserting a User will fail if it isn’t unique. Then they do a lot of writes in their sharded database.

How To Shard?

  • Looked at Cassandra’s ring model. Looked at Membase. And looked at Twitter’s Gizzard.

  • Determined: the least data you move across your nodes the more stable your architecture.

  • Cassandra has a data balancing and authority problems because the nodes weren’t sure of who owned which part of the data. They decided the application should decide where the data should go so there is never an issue.

  • Projected their growth out for the next five years and presharded with that capacity plan in mind.

  • Initially created a lot of virtual shards. 8 physical servers, each with 512 DBs. All the databases have all the tables.

  • For high availability they always run in multi-master replication mode. Each master is assigned to a different availability zone. On failure the switch to the other master and bring in a new replacement node.

  • When load increasing on a database:

    • Look at code commits to see if a new feature, caching issue, or other problem occurred.

    • If it’s just a load increase they split the database and tell the applications to find the data on a new host.

    • Before splitting the database they start slaves for those masters. Then they swap application code with the new database assignments. In the few minutes during the transition writes are still write to old nodes and be replicated to the new nodes. Then the pipe is cut between the nodes.

ID Structure

  • 64 bits:

    • shard ID: 16 bits

    • type : 10 bits – Pin, Board, User, or any other object type

    • local ID – rest of the bits for the ID within the table. Uses MySQL auto increment.

  • Twitter uses a mapping table to map IDs to a physical host. Which requires a lookup. Since Pinterest is on AWS and MySQL queries took about 3ms, they decided this extra level of indirection would not work. They build the location into the ID.

  • New users are randomly distributed across shards.

  • All data (pins, boards, etc) for a user is collocated on the same shard. Huge advantage. Rendering a user profile, for example, does not take multiple cross shard queries. It’s fast.

  • Every board is collocated with the user so boards can be rendered from one database.

  • Enough shards IDs for 65536 shards, but they only opened 4096 at first, they’ll expand horizontally. When the user database gets full they’ll open up more shards and allow new users to go to the new shards.

Lookups

  • If they have 50 lookups, for example, they split the IDs and run all the queries in parallel so the latency is the longest wait.

  • Every application has a configuration file that maps a shard range to a physical host.

    • “sharddb001a”: : (1, 512)

    • “sharddb001b”: : (513, 1024) – backup hot master

  • If you want to look up a User whose ID falls into sharddb003a:

    • Decompose the ID into its parts

    • Perform the lookup in the shard map

    • Connect to the shard, go to the database for the type, and use the local ID to find the right user and return the serialized data.

Objects And Mappings

  • All data is either an object (pin, board, user, comment) or a mapping (user has boards, pins has likes).

  • For objects a Local ID maps to a MySQL blob. The blob format started with JSON but is moving to serialized thrift.

  • For mappings there’s a mapping table.  You can ask for all the boards for a user. The IDs contain a timestamp so you can see the order of events.

    • There’s a reverse mapping, many to many table, to answer queries of the type give me all the users who like this pin.

    • Schema naming scheme is noun_verb_noun: user_likes_pins, pins_like_user.

  • Queries are primary key or index lookups (no joins).

  • Data doesn’t move across database as it does with clustering. Once a user lands on shard 20, for example, and all the user data is collocated, it will never move. The 64 bit ID has contains the shard ID so it can’t be moved. You can move the physical data to another database, but it’s still associated with the same shard.

  • All tables exist on all shards. No special shards, not counting the huge user table that is used to detect user name conflicts.

  • No schema changes required and a new index requires a new table.

    • Since the value for a key is a blob, you can add fields without destroying the schema. There’s versioning on each blob so applications will detect the version number and change the record to the new format and write it back. All the data doesn’t need to change at once, it will be upgraded on reads.

    • Huge win because altering a table takes a lock for hours or days.  If you want a new index you just create a new table and start populating it. When you don’t want it anymore just drop it. (no mention of how these updates are transaction safe).

Rendering A User Profile Page

  • Get the user name from the URL. Go to the single huge database to find the user ID.

  • Take the user ID and split it into its component parts.

  • Select the shard and go to that shard.

  • SELECT body from users WHERE id = <local_user_id>

  • SELECT board_id FROM user_has_boards WHERE user_id=<user_id>

  • SELECT body FROM boards WHERE id IN (<boards_ids>)

  • SELECT pin_id FROM board_has_pins WHERE board_id=<board_id>

  • SELECT body FROM pins WHERE id IN (pin_ids)

  • Most of the calls are served from cache (memcache or redis), so it’s not a lot of database queries in practice.

Scripting

  • When moving to a sharded architecture you have two different infrastructures, one old, the unsharded system, and one new, the sharded system. Scripting is all the code to transfer the old stuff to the new stuff.

  • They moved 500 million pins and 1.6 billion follower rows, etc

  • Underestimated this portion of the project. They thought it would take 2 months to put data in the shards, it actually took 4-5 months. And remember, this was during a feature freeze.

  • Applications must always write to both old and new infrastructures.

  • Once confident that all your data is in the new infrastructure then point your reads to the new infrastructure and slowly increase the load and test your backends.

  • Built a scripting farm. Spin up more workers to complete the task faster. They would do tasks like move these tables over to the new infrastructure.

  • Hacked up a copy of Pyres, a Python interface to Github’s Resque queue, a queue on built on top of redis. Supports priorities and retries. It was so good they replaced Celery and RabbitMQ with Pyres.

  • A lot of mistakes in the process. Users found errors like missing boards. Had to run the process several times to make sure no transitory errors prevented data from being moved correctly.

Development

  • Initially tried to give developers a slice of the system. Each having their own MySQL server, etc, but things changed so fast this didn’t work.

  • Went to Facebook’s model where everyone has access to everything. So you have to be really careful.

Future Directions

  • Service based architecture.

    • As they are starting to see a lot of database load they start to spawn a lot of app servers and a bunch of other servers. All these servers connect to MySQL and memcache. This means there are 30K connections on memcache which takes up a couple of gig of RAM and causes the memcache daemons to swap.

    • As a fix these are moving to a service architecture. There’s a follower service, for example, that will only answer follower queries. This isolates the number of machines to 30 that access the database and cache, thus reducing the number of connections.

    • Helps isolate functionality. Helps organize teams around around services and support for those services. Helps with security as developer can’t access other services.

Lessons Learned

  • It will fail. Keep it simple.

  • Keep it fun. There’s a lot of new people joining the team. If you just give them a huge complex system it won’t be fun. Keeping the architecture simple has been a big win. New engineers have been contributing code from week one.

  • When you push something to the limit all these technologies fail in their own special way.

  • Architecture is doing the right thing when growth can be handled by adding more of the same stuff. You want to be able to scale by throwing money at the problem by throwing more boxes at the problem as you need them. If your architecture can do that, then you’re golden.

  • Cluster Management Algorithm is a SPOF. If there’s a bug it impacts every node. This took them down 4 times.

  • To handle rapid growth you need to spread data out evenly to handle the ever increasing load.

  • The least data you move across your nodes the more stable your architecture. This is why they went with sharding over clusters.

  • A service oriented architecture rules. It isolates functionality, helps reduce connections, organize teams, organize support, and  improves security.

  • Asked yourself what your really want to be. Drop technologies that match that vision, even if you have to rearchitecture everything.

  • Don’t freak out about losing a little data. They keep user data in memory and write it out periodically. A loss means only a few hours of data are lost, but the resulting system is much simpler and more robust, and that’s what matters.

(via Highscalability.com)

NoSQL Benchmark Compares Aerospike, Cassandra, Couchbase and MongoDB

A recent set of benchmarks compares Aerospike, Cassandra, Couchbase and MongoDB to see how they fare when it comes to insert throughput, maximum throughput, latency and behavior during a failover.

Thumbtack Technology has released two benchmark whitepapers with results from the comparison of several key-value stores: Ultra-High Performance NoSQL Benchmarking: Analyzing Durability and Performance Tradeoffs (PDF) and NoSQL Failover Characteristics: Aerospike, Cassandra, Couchbase, MongoDB (PDF). Both benchmarks attempted to test “consumer-facing applications which require extremely high throughput and low latency, and whose information can be represented using a key-value schema.”

Thumbtack used an improved version of Yahoo! Cloud Serving Benchmark (YCSB) one that is supposed to overcome some limitations reached when using very high volumes and multiple clients. The YCSB changes have been documented in the first whitepaper and committed back to the community.

The NoSQL databases tested were AerospikeCassandraCouchbase (1.8 and 2.0), andMongoDB. The first is a commercial product, and the last is a document data store not a key-value store, but it was included because “in our experience clients often consider it for similar kinds of applications.” All databases were optimized using recommendations from the vendors backing them. The test systems used SSD storage rather than rotating disks. The whitepapers contain detailed information regarding the methodology used, the client and workload configuration, the hardware configuration, etc.

Thumbtack acknowledged having “strategic and/or commercial relationships with Aerospike, Couchbase, and 10gen” and the hardware used was rented from Aerospike.

We are including some of the results of the benchmarks.

Insert Throughput

The databases were loaded with initial working sets using YCSB’s loading routing which performs a number of inserts. Couchbase had good results when the working sets were loaded into memory, but it had problems loading on SSD when Couchbase 1.8 did not finish the operation and a smaller set and asynchronous mode had to be used for Couchbase 2.0. That explains the gradient blue used for it. Aerospike came second.

image

Maximum Throughput

This test used a “a strong durability model, using a dataset that, when replicated, would be significantly larger than the server’s RAM. This test is intended to model usage for transactional data that requires strong durability guarantees.”

Couchbase does not appear on the graphic because it could not complete the test using synchronous replication.

image

When asynchronous replication was applied, the result in memory was:

image

Latency/Throughput

The benchmarks also measured read and update latency at different levels of traffic. The next graphic includes a full view and a zoomed one for each of those:

image

Failover

Thumbtack attempted to see what happens when a node goes down, a hardware failure being simulated:

image

The downtime was also measured, i.e. the time needed for the cluster the become responsive again after a failure, all databases showing reasonable values:

image

The Thumbtack benchmarks contains more results for different cases but were not included here.

Another NoSQL benchmark was published in October 2012 when Cassandra, HBase, MongoDB, and Riak were compared. MySQL was also included in those tests for a reference against SQL technologies.Another NoSQL benchmark was published in October 2012 when Cassandra, HBase, MongoDB, and Riak were compared. MySQL was also included in those tests for a reference against SQL technologies.

(via infoq.com)

NGINX 1.4 News: SPDY and PageSpeed

NGINX 1.4 comes with experimental support for SPDY, WebSocket proxying, and gunzip support, while Google enters beta with PageSpeed for NGINX.

Since the release of NGINX 1.3 in May of 2012, the community behind this growing web serverhas worked on a number of new features, the most notable including:

  • SPDY experimental support. The module needs to be enabled with the --with-http_spdy_module configuration parameter. Server push is still not supported.
  • HTTP/1.1 connections can be turned into WebSocket ones using HTTP/1.1’s protocol switch feature.
  • gunzip support for decompressing gzipped data useful when the client cannot do it.

The complete list of enhancements, changes and bug fixes can be found in the Changes 1.4 log.

On the heels of NGINX 1.4 release, Google has announced PageSpeed Beta for NGINX, making available over 40 optimization filters for its users, such as image compression, JavaScript and CSS minifying, HTML rewriting, etc.

According to Google, the ngx_pagespeed module has been used in production by some customers including MaxCDN, a CDN provider, which reported a reduction of “1.57 seconds from our average page load, dropped our bounce rate by 1%, and our exit percentage by 2.5%”.ZippyKid, a WordPress hosting provider, saw “75% reduction in page sizes and a 50% improvement in page rendering speeds” after using PageSpeed for NGINX.

NGINX is currently #3 web server overall with 16.2% market share, and #2 with 31.9% of top 10,000 websites, according to W3Techs.

(via infoq.com)

Linux 3.9 Release

Linus Torvalds has announced the release of Linux Kernel 3.9:

So the last week was much quieter than the preceding ones, which makes me suspect that one reason -rc7 was bigger than I liked was that people were gaming the system and had timed some of their pull requests for just before the release, explaining why -rc7 was big enough that I didn’t  actually want to do a final release last week. Please don’t do that.

Anyway. Whatever the reason, this week has been very quiet, which makes me much more comfortable doing the final 3.9 release, so I guess the last -rc8 ended up working. Because not only aren’t there very many commits here, even the ones that made it really are tiny and not pretty obscure and not very interesting.

Also, this obviously means that the merge window is open. I won’t be merging anything today, but if you start sending me your pull requests (Konrad already sent in his Xen pull request for the 3.10 merge window a week ago), tomorrow the flood gates start opening..

Linux 3.8 brought file systems enhancement for Btrfs, XFS and ext-4, introduce F2FS file system, memory management improvement, and the removal for i386 support.

Linux 3.9 brings the following key changes (Sources H-Online and Phoronix):

  • File-system improvements:
    • Btrfs has experimental RAID5/6 support and fsync performance improvements.
    • EXT-4 bug fixes for corruption issue, and JBD2 journaling layer issue which affected performance.
    • Various improvements for F2FS file-system.
  • Added ability to use SSDs as hard-disk cache.
  • Update the latest LZO compression implementation within the kernel. Decompression and compression performance has been massively improved and x86 and ARM targets.
  • Improved power management, including “lightweight suspend” (aka “suspend freeze”) mode.
  • Improved ARM SoC support
    • Added Nvidia Tegra 4 support including support for Dalmore and Pluto development boards.
    • Added Nvidia Tegra 3 Beaver Board support
    • Kernel-based Virtual Machine (KVM) support for ARMv7 (Cortex A15 required)
  • Mainlining of Google’s Goldfish virtual CPU.
  • Initial ARC Linux support. See commit.
  • Added support for Imagination Meta ATP (Meta 1) and HTP (Meta 2)
  • Graphics drivers updates – Nouveau, the open-source reverse-engineered NVIDIA Linux graphics driver, is faster for some Linux OpenGL games. There’s also some Intel OpenGL performance changes.
  • Support for Intel 7000 Wi-Fi components supporting 802.11ac.
  • CONFIG_EXPERIMENTAL kernel configuration option has been removed

More details about Linux 3.9 will be available on Kernelnewbies.org (which is currently down).

(via CNXSoft blog)

Amazon S3 – Two Trillion Objects, 1.1 Million Requests / Second

Last June I blogged about the first trillion objects stored in Amazon S3. On the first day of re:Invent I updated that number to 1.3 trillion.

It is time for another update!

I’m pleased to announce that there are now more than 2 trillion (2 x 1012) objects stored in Amazon S3 and that the service is regularly peaking at over 1.1 million requests per second.

It took us six years to grow to one trillion stored objects, and less than a year to double that number.

What Does That Mean?
It is always fun to try and put these numbers into real-world terms:

I spoke at a cloud computing conference in China last week. With a population of 1.35 billion, there are 1,481 Amazon S3 objects per Chinese citizen!

Our galaxy is estimated to contain about 400 billion stars. That works out to five objects for every star in the galaxy.

The field of Paleodemography estimates that 100 billion people have been born on planet Earth. Each of them can have 20 S3 objects.

And one more — our universe is about 13.6 billion years old. If you added one S3 object every 60 hours starting at the Big Bang, you’d have accumulated almost two trillion of them by now.

(via Amazon Web Services)

Welcome to Berkeley: Where Hadoop isn’t nearly fast enough

amplab

photo: AMPLab
Hadoop not fast enough for you? Then you might want to get to know AMPLab, a University of California, Berkeley team developing faster versions of many core Hadoop components.

Tucked within the computer science deparment at the University of California, Berkeley, there’s an institution called AMPLab that’s making a name for itself by — among other things — essentially rebuilding the Hadoop platform, only faster.

Results for linear regression test

Results for linear regression test

AMPLab’s most well-known product in the big data space, called Spark, is an in-memory parallel processing framework that’s comparable to Hadoop MapReduce except, its creators claim, it is up to 100 times faster. Because it runs in-memory, Spark might be comparable with something like Druid or SAP’s HANA system, too. Spark is the processing engine that powers ClearStory’s next-generation analytics and visualization service.

Like Hive as a data warehouse for Hadoop? Then you’ll love Shark, which is short for “Hive on Spark.”

Even as Hadoop gets more flexible thanks to new features such as YARN, which would technically allow it to run an alternative framework like Spark, AMPLab has its own cluster-management project called Mesos. Twitter is a big fan of Mesos, which is also an Apache Incubator project.

AMPLab’s latest target is the Hadoop Distributed File System, or HDFS. HDFS has long been criticized for availability and speed, so AMPLab created an alternative called Tachyon (hat tip to High Scalability for calling my attention to it). According to the Tachyon homepage, “it offers up to 300 times higher throughput than HDFS, by leveraging lineage information and using memory aggressively. Tachyon caches working set files in memory, and enables different jobs/queries and frameworks to access cached files at memory speed.”

AMPLab isn’t the first to question the cult of HDFS, though. There are numerous commercial options available, and Quantcast built its own open source file systemthat it claims is faster and more efficient when running at massive scale.

But it’s probably unfair to call AMPLab’s projects competitors to Hadoop. They’re certainly alternatives, but they’re also complementary, as Twitter’s heavy use of Hadoop and Mesos demonstrates. And Spark, Shark, Mesos and Tachyon are all compatible with their peer projects from the Apache Hadoop project.

Really, AMPLab is doing what any research institution does by pushing the limits of the current commercially available software. If it happens to disrupt the status quo, then so be it. For users, though, it’s just providing some new options to play around with as they try to find the right tool for their particular jobs. Its partners and sponsors, including Google, Facebook, Microsoft and Amazon Web Services, certainly have an interest in finding the best-possible technology, or creating it if necessary.

The MLBase architecture.

The MLBase architecture.

Other related AMPLab projects include PIQL, a SQL-like query language that sits atop a key-value store; MLBase, a system for doing machine learning on distributed systems; Akaros, an operating system for manycore and large SMP systems; and Sparrow, a cluster-scheduling system designed for low-latency computing.

(via gigaom.com)

Does WebKit face a troubled future now that Google is gone?

chromium-webkit-hands

Now that Google is going its own way and developing its rendering engine independently of the WebKit project, both sides of the split are starting the work of removing all the things they don’t actually need.

This is already causing some tensions among WebKit users and Web developers, as it could lead to the removal of technology that they use or technology that is in the process of being standardized. This is leading some to question whether Apple is willing or able to fill in the gaps that Google has left.

Since Google first released Chrome in 2008, WebCore, the part of WebKit that does the actual CSS and HTML processing, has had to serve two masters. The major contributors to the project, and the contributors with the most widely used browsers, were Apple and Google.

While both used WebCore, the two companies did a lot of things very differently. They used different JavaScript engines (JavaScriptCore [JSC] for Apple, V8 for Google). They adopted different approaches to handling multiple processes and sandboxing. They used different options when compiling the software, too, so their browsers actually had different HTML and CSS features.

The WebCore codebase had to accommodate all this complexity. JavaScript, for example, couldn’t be integrated too tightly with the code that handles the DOM (the standard API for manipulating HTML pages from JavaScript), because there was an intermediary layer to ensure that JSC and V8 could be swapped in and out.

Google said that the decision to fork was driven by engineering concerns and that forking would enable faster development by both sides. That work is already under way, and both teams are now preparing to rip all these unnecessary bits out.

Right now, it looks like Google has it easier. So far, only Google and Opera are planning to use Blink, and Opera intends to track Chromium (the open source project that contains the bulk of Chrome’s code) and Blink anyway, so it won’t diverge too substantially from either. This means that Google has a fairly free hand to turn features that were optional in WebCore into ones that are permanent in Blink if Chrome uses them, or eliminate them entirely if it doesn’t.

Apple’s position is much trickier, because many other projects use WebKit, and no one person knows which features are demanded by which projects. Apple also wants to remove the JavaScript layers and just bake in the use of JSC, but some WebKit-derived projects may depend on them.

Samsung, for example, is using WebKit with V8. But with Google’s fork decision, there’s now nobody maintaining the code that glues V8 to WebCore. The route that Apple wants to take is to purge this stuff and leave it up to third-party projects to maintain their variants themselves. This task is likely to become harder as Cupertino increases the integration between JSC and WebCore.

Oracle is working on a similar project: a version of WebKit with its own JavaScript engine, “Nashorn,” that’s based on the Java virtual machine. This takes advantage of the current JavaScript abstractions, so it’s likely to be made more complicated as Apple removes them.

One plausible outcome for this is further consolidation among the WebKit variants. For those dead set on using V8, switching to Blink may be the best option. If sticking with WebKit is most important, reverting to JSC may be the only practical long-term solution.

Google was an important part of the WebKit project, and it was responsible for a significant part of the codebase’s maintenance. The company’s departure has left various parts of WebKit without any developers to look after them. Some of these, such as some parts of the integrated developer tools, are probably too important for Apple to abandon—even Safari uses them.

Others, however, may be culled—even if they’re on track to become Web standards. For example, Google developed code to provide preliminary support for CSS Custom Properties (formerly known as CSS Variables). It was integrated into WebKit but only enabled in Chromium. That code now has nobody to maintain it, so Apple wants to remove it.

This move was immediately criticized by Web developer Jon Rimmer, who pointed out that the standard was being actively developed by the World Wide Web Consortium (W3C), was being implemented by Mozilla, and was fundamentally useful. The developer suggested that Apple had two options for dealing with Google’s departure from the project: either by “cutting out [Google-developed] features and continuing at a reduced pace, or by stepping up yourselves to fill the gap.”

Discussion of Apple’s ability to fill the Google-sized gap in WebKit was swiftly shut down, but Rimmer’s concern remains the elephant in the room. Removing the JavaScript layer is one thing; this was a piece of code that existed only to support Google’s use of V8, and with JavaScriptCore now the sole WebKit engine, streamlining the code makes sense. Google, after all, is doing the same thing in Blink. But removing features headed toward standardization is another thing entirely.

If Apple doesn’t address Rimmer’s concerns, and if Blink appears to have stronger corporate backing and more development investment, one could see a future in which more projects switch to using Blink rather than WebKit. Similarly, Web developers could switch to Blink—with a substantial share of desktop usage and a growing share of mobile usage—and leave WebKit as second-best.

(via arstechnica.com)