Apache has released Kafka 0.8, the first major release of Kafka since becoming an Apache Software Foundation top level project. Apache Kafka is publish-subscribe messaging implemented as a distributed commit log, suitable for both offline and online message consumption. It is a messaging system initially developed at LinkedIn for collecting and delivering high volumes of event and log data with low latency. The latest release includes intra-cluster replication and multiple data directories support. Request processing is also now asynchronous, implemented via a secondary pool of request handling threads. Log files can be rotated by age, and log levels can be set dynamically via JMX. A performance test tool has been added, to help fix existing performance concerns and look for potential performance enhancements.
Kafka is a distributed, partitioned, replicated commit log service. Producers publish messages to Kafka topics, and consumers subscribe to these topics and consume the messages. A server in a Kafka cluster is called a broker. For each topic, the Kafka cluster maintains a partition for scaling, parallelism and fault-tolerance. Each partition is an ordered, immutable sequence of messages that is continually appended to a commit log. The messages in the partitions are each assigned a sequential id number called the offset.
The offset is controlled by the consumer. The typical consumer will process the next message in the list, although it can consume messages in any order, as the Kafka cluster retains all published messages for a configurable period of time. This makes consumers very cheap, as they can come and go without much impact on the cluster, and allows for offline consumers like Hadoop clusters. Producers are able to choose which topic, and which partition within the topic, to publish the message to. Consumers assign themselves a consumer group name, and each message is delivered to one consumer within each subscribing consumer group. If all the consumers have different consumer groups, then messages are broadcasted to each consumer.
Kafka can be used like a traditional message broker. It has high throughput, built-in partitioning, replication, and fault-tolerance, which makes it a good solution for large scale message processing applications. Kafka can also be used for high volume website activity tracking. Site activity can be published, and can be processed real-time, or loaded into Hadoop or offline data warehousing systems. Kafka can also be used as a log aggregation solution. Instead of working with log files, logs can be treated a stream of messages.
Kafka is used at LinkedIn and it handles over 10 billion message writes per day with a sustained load that averages 172,000 messages per second. There is heavy use of multi-subscriber support, both from internal and external applications that make use of the data. There is a ratio of roughly 5.5 messages consumed for each message produced, which results in a daily total in excess of 55 billion messages delivered to real-time consumers. There are 367 topics that cover both user activity topics and operational data, the largest of which adds an average of 92GB per day of batch-compressed messages. Messages are kept for 7 days, and these average at about 9.5 TB of compressed messages across all topics. In addition to live consumers, there are numerous large Hadoop clusters which consume infrequent, high-throughput, parallel bursts as part of the offline data load.
To get started, visit the official Apache Kafka documentation page from where you can learn more and download Kafka. There is also a paper from LinkedIn titled Building LinkedIn’s Real-time Activity Data Pipeline, which talks about why Kafka was built and the factors that contributed to Kafka’s design.