Netflix is open sourcing a tool called Suro that the company uses to direct data from its source to its destination in real time. More than just serving a key role in the Netflix data pipeline, though, it’s also a great example of the impressive — if not sometimes redunant — ecosystem of open source data-analysis tools making their way out of large web properties.
Netflix’s various applications generate tens of billions of events per day, and Suro collects them all before sending them on their way. Most head to Hadoop (via Amazon S3) for batch processing, while others head to Druid and ElasticSearch (via Apache Kafka) for real-time analysis. According to the Netflix blog post explaining Suro (which goes into much more depth), the company is also looking at how it might use real-time processing engines such as Storm or Samza to perform machine learning on event data.
An example Suro workflow. Source: Netflix
To anyone familiar with the big data space, the names of the various technologies at play here are probably associated to some degree with the companies that developed them. Netflix created Suro, LinkedIn created Kafka and Samza, Twitter (actually, Backtype, which Twitter acquired) created Storm, and Metamarkets (see disclosure) created Druid. Suro, the blog post’s authors acknowledged, is based on the Apache Chukwa project and is similar to Apache Flume (created by Hadoop vendor Cloudera) and Facebook’s Scribe. Hadoop, of course, was all but created at Yahoo and has since seen notable ecosystem contributions from all sorts of web companies.
I sometimes wonder whether all these companies really need to be creating their own technologies all the time, or if they could often get by using the stuff their peers have already created. Like most things in life, though, the answer to that question is probably best decided on a case-by-case basis. Storm, for example, is becoming a very popular tool for stream processing, but LinkedIn felt like it needed something different and thus built Samza. Netflix decided it needed Suro as opposed to just using some pre-existing technologies (largely because of its cloud-heavy infrastructure running largely in Amazon Web Services), but also clearly uses lots of tools built elsewhere (including the Apache Cassandra database).
A diagram of LinkedIn’s data architecture as of February 2013, including everything from Kafka to Teradata.
Hopefully, the big winners in all this innovation will be mainstream technology users that don’t have the in-house talent (or, necessarily, the need) to develop these advanced systems in-house but could benefit from their capabilities. We’re already seeing Hadoop vendors, for example, trying to make projects such as Storm and the Spark processing framework usable by their enterprise customers, and it seems unlikely they’ll be the last. There are a whole lot of AWS users, after all, and they might want the capabilities Suro can provide without having to rely on Amazon to deliver them. (We’ll likely hear a lot more about the future of Hadoop as a data platform at our Structure Data conference, which takes place in March.)