World’s First 1,000-Processor Chip Said to Show Promise Across Multiple Workloads


A blindingly fast microchip, the first to contain 1,000 independent processors and said to show promise for digital signal processing, video processing, encryption and datacenter/cloud workloads, has been announced by a team at the University of California, Davis. The “KiloCore” chip has a maximum computation rate of 1.78 trillion instructions per second and contains 621 million transistors, according to the development team, which presented the microchip at the 2016 Symposium on VLSI Technology at Circuits this month in Honolulu.

By way of comparison, if the KiloCore’s area were the same as a 32 nm Intel Core i7 processor, it would contain approximately 2300-3700 processors and have a peak execution rate of 4.1 to 6.6 trillion independent instructions per second, according to the design team.

Although Bevan Baas, professor of electrical and computer engineering at UC/Davis who led the chip architectural design team, told HPCwire’s sister publication, EnterpriseTech, that there are no current plans to commercialize the processor, he said it has important commercial implications, with several applications already developed for the chip.


“It has been shown to excel spectacularly with many digital signal processing, wireless coding/decoding, multimedia and embedded workloads, and recent projects have shown that it can also excel at computing kernels for some datacenter/cloud and scientific workloads,” he said. He said the KiloCore innovates in a number of areas covering architectures, application development and mapping, circuits, and VLSI design.

“We hope multiple aspects of KiloCore will influence the design of future computing systems,” he said. “For workloads that can be mapped to its architecture, it could very well have a place in exascale-class computing.”

The design team claims for KiloCore the highest clock-rate processor ever designed in a university. And while other multiple-processor chips have been created, none exceed about 300 processors, according to the team.

The KiloCore chip was fabricated by IBM using their 32 nm CMOS technology.

Beyond throughput performance, Baas said KiloCore also is the most energy-efficient many-core processor ever reported, Baas said. For example, the 1,000 processors can execute 115 billion instructions per second while dissipating only 0.7 Watts, low enough to be powered by a single AA battery. The KiloCore chip executes instructions more than 100 times more efficiently than a modern laptop processor.

Each processor core can run its own small program independently of the others, Baas explained, which he said is a fundamentally more flexible approach than Single-Instruction-Multiple-Data approaches utilized by processors such as GPUs. The idea is to break an application up into many small pieces, each of which can run in parallel on different processors, enabling high throughput with lower energy use.

The KiloCore architecture is an example of a “fine-grain many-core” processor array, Baas said. Processors are kept as simple as possible so they occupy a small chip area, with numerous cores per chip. “Short low-capacitance wires result in high efficiency,” he said, and “operate at high clock frequencies (high performance in terms of high throughput and low latency).” The cores dissipate low power when both active and idle – in fact, he said, they dissipate perfect zero active power when there is no work to do. Energy efficiency also is achieved by operation at low supply voltages and a relatively-simple architecture consisting of a single-issue 7-stage pipeline with a small amount of memory per core and a message-passing-based inter-processor interconnect rather than a cache-based shared-memory model.

Baas said the team has completed a compiler and automatic program mapping tools for use in programming the chip



Design Of A Modern Cache

Caching is a common approach for improving performance, yet most implementations use strictly classical techniques. In this article we will explore the modern methods used by Caffeine, an open-source Java caching library, that yield high hit rates and excellent concurrency. These ideas can be translated to your favorite language and hopefully some readers will be inspired to do just that.

Eviction Policy

A cache’s eviction policy tries to predict which entries are most likely to be used again in the near future, thereby maximizing the hit ratio. The Least Recently Used (LRU) policy is perhaps the most popular due to its simplicity, good runtime performance, and a decent hit rate in common workloads. Its ability to predict the future is limited to the history of the entries residing in the cache, preferring to give the last access the highest priority by guessing that it is the most likely to be reused again soon.

Modern caches extend the usage history to include the recent past and give preference to entries based on recency and frequency. One approach for retaining history is to use a popularity sketch (a compact, probabilistic data structure) to identify the “heavy hitters” in a large stream of events. Take for example CountMin Sketch, which uses a matrix of counters and multiple hash functions. The addition of an entry increments a counter in each row and the frequency is estimated by taking the minimum value observed. This approach lets us tradeoff between space, efficiency, and the error rate due to collisions by adjusting the matrix’s width and depth.


Window TinyLFU (W-TinyLFU) uses the sketch as a filter, admitting a new entry if it has a higher frequency than the entry that would have to be evicted to make room for it. Instead of filtering immediately, an admission window gives an entry a chance to build up its popularity. This avoids consecutive misses, especially in cases like sparse bursts where an entry may not be deemed suitable for long-term retention. To keep the history fresh an aging process is performed periodically or incrementally to halve all of the counters.



W-TinyLFU uses the Segmented LRU (SLRU) policy for long term retention. An entry starts in the probationary segment and on a subsequent access it is promoted to the protected segment (capped at 80% capacity). When the protected segment is full it evicts into the probationary segment, which may trigger a probationary entry to be discarded. This ensures that entries with a small reuse interval (the hottest) are retained and those that are less often reused (the coldest) become eligible for eviction.



As the database and search traces show, there is a lot of opportunity to improve upon LRU by taking into account recency and frequency. More advanced policies such as ARC, LIRS, and W-TinyLFU narrow the gap to provide a near optimal hit rate. For additional workloads see the research papers and try our simulator if you have your own traces to experiment with.

Expiration Policy

Expiration is often implemented as variable per entry and expired entries are evicted lazily due to a capacity constraint. This pollutes the cache with dead items, so sometimes a scavenger thread is used to periodically sweep the cache and reclaim free space. This strategy tends to work better than ordering entries by their expiration time on a O(lg n) priority queue due to hiding the cost from the user instead of incurring a penalty on every read or write operation.

Caffeine takes a different approach by observing that most often a fixed duration is preferred. This constraint allows for organizing entries on O(1) time ordered queues. A time to live duration is a write order queue and a time to idle duration is an access order queue. The cache can reuse the eviction policy’s queues and the concurrency mechanism described below, so that expired entries are discarded during the cache’s maintenance phase.


Concurrent access to a cache is viewed as a difficult problem because in most policies every access is a write to some shared state. The traditional solution is to guard the cache with a single lock. This might then be improved through lock striping by splitting the cache into many smaller independent regions. Unfortunately that tends to have a limited benefit due to hot entries causing some locks to be more contented than others. When contention becomes a bottleneck the next classic step has been to update only per entry metadata and use either a random sampling or a FIFO-based eviction policy. Those techniques can have great read performance, poor write performance, and difficulty in choosing a good victim.

An alternative is to borrow an idea from database theory where writes are scaled by using a commit log. Instead of mutating the data structures immediately, the updates are written to a log and replayed in asynchronous batches. This same idea can be applied to a cache by performing the hash table operation, recording the operation to a buffer, and scheduling the replay activity against the policy when deemed necessary. The policy is still guarded by a lock, or a try lock to be more precise, but shifts contention onto appending to the log buffers instead.

In Caffeine separate buffers are used for cache reads and writes. An access is recorded into a striped ring buffer where the stripe is chosen by a thread specific hash and the number of stripes grows when contention is detected. When a ring buffer is full an asynchronous drain is scheduled and subsequent additions to that buffer are discarded until space becomes available. When the access is not recorded due to a full buffer the cached value is still returned to the caller. The loss of policy information does not have a meaningful impact because W-TinyLFU is able to identify the hot entries that we wish to retain. By using a thread-specific hash instead of the key’s hash the cache avoids popular entries from causing contention by more evenly spreading out the load.


In the case of a write a more traditional concurrent queue is used and every change schedules an immediate drain. While data loss is unacceptable, there are still ways to optimize the write buffer. Both types of buffers are written to by multiple threads but only consumed by a single one at a given time. This multiple producer / single consumer behavior allows for simpler, more efficient algorithms to be employed.

The buffers and fine grained writes introduce a race condition where operations for an entry may be recorded out of order. An insertion, read, update, and removal can be replayed in any order and if improperly handled the policy could retain dangling references. The solution to this is a state machine defining the lifecycle of an entry.


In benchmarks the cost of the buffers is relatively cheap and scales with the underlying hash table. Reads scale linearly with the number of CPUs at about 33% of the hash table’s throughput. Writes have a 10% penalty, but only because contention when updating the hash table is the dominant cost.



There are many pragmatic topics that have not been covered. This could include tricks to minimize the memory overhead, testing techniques to retain quality as complexity grows, and ways to analyze performance to determine whether a optimization is worthwhile. These are areas that practitioners must keep an eye on, because once neglected it can be difficult to restore confidence in one’s own ability to manage the ensuing complexity.

The design and implementation of Caffeine is the result of numerous insights and the hard work of many contributors. Its evolution over the years wouldn’t have been possible without the help from the following people: Charles Fry, Adam Zell, Gil Einziger, Roy Friedman, Kevin Bourrillion, Bob Lee, Doug Lea, Josh Bloch, Bob Lane, Nitsan Wakart, Thomas Müeller, Dominic Tootell, Louis Wasserman, and Vladimir Blagojevic.


ejabberd Massive Scalability: 1 Node — 2+ Million Concurrent Users

How Far Can You Push ejabberd?

From our experience, we all get the idea that ejabberd is massively scalable. However, we wanted to provide benchmark results and hard figures to demonstrate our outstanding performance level and give a baseline about what to expect in simple cases.

That’s how we ended up with the challenge of fitting a very large number of concurrent users on a single ejabberd node.

It turns out you can get very far with ejabberd.

Scenario and Platforms

Here is our benchmark scenario: Target was to reach 2,000,000 concurrent users, each with 18 contacts on the roster and a session lasting around 1h. The scenario involves 2.2M registered users, so almost all contacts are online at the peak load. It means that presence packets were broadcast for those users, so there was some traffic as an addition to packets handling users connections and managing sessions. In that situation, the scenario produced 550 connections/second and thus 550 logins per second.

Database for authentication and roster storage was MySQL, running on the same node as ejabberd.

For the benchmark itself, we used Tsung, a tool dedicated to generating large loads to test servers performance. We used a single large instance to generate the load.

Both ejabberd and the test platform were running on Amazon EC2 instances. ejabberd was running on a single node of instance type m4.10xlarge (40 vCPU, 160 GiB). Tsung instance was identical.

Regarding ejabberd software itself, the test was made with ejabberd Community Server version 16.01. This is the standard open source version that is widely available and widely used across the world.

The connections were not using TLS to make sure we were focusing on testing ejabberd itself and not openSSL performance.

Code snippets and comments regarding the Tsung scenario are available for download:

Overall Benchmark Results


We managed to surpass the target and we support more than2 million concurrent users on a single ejabberd.

For XMPP servers, the main limitation to handle a massive number of online users is usually memory consumption. With proper tuning, we managed to handle the traffic with a memory footprint of 28KB per online user.

The 40 CPUs were almost evenly used, with the exception of the first core that was handling all the network interruptions. It was more loaded by the Operating System and thus less loaded by the Erlang VM.

In the process, we also optimized our XML parser, released now as Fast XML, a high-performance, memory efficient Expat-based Erlang and Elixir XML parser.

Detailed Results

ejabberd Performance


Benchmark shows that we reached 2 million concurrent users after one hour. We were logging in about 33k users per minute, producing session traffic of a bit more than 210k XMPP packets per minute (this includes the stanzas to do the SASL authentication, binding, roster retrieval, etc). Maximum number of concurrent users is reached shortly after the 2 million concurrent users mark, by design in the scenario. At this point, we still connect new users but, as the first users start disconnecting, the number of concurrent users gets stable.

As we try to reproduce common client behavior we setup Tsung to send “keepalive pings” on the connections. Since each session sends one of such whitespace pings each minute, the number of such requests grows proportionately with the number of connected users. And while idle connections consume few resources on the server, it is important to note that in this scale they start to be noticeable. Once you have 2M users, you will be handling 33K/sec of such pings just from idle connections. They are not represented on the graphs, but are an important part of the real life traffic we were generating.

ejabberd Health


At all time, ejabberd health was fine. Typically, when ejabberd is overloaded, TCP connection establishment time and authentication tend to grow to an unacceptable level. In our case, both of those operations performed very fast during all bench, in under 10 milliseconds. There was almost no errors (the rare occurrences are artefacts of the benchmark process).

Platform Behavior


Good health and performance are confirmed by the state of the platform. CPU and memory consumption were totally under control, as shown in the graph. CPU consumption stays far from system limits. Memory grows proportionally to the number of concurrent users.

We also need to mention that values for CPUs are slightly overestimated as seen by the OS, as Erlang schedulers stay a bit of busy waiting when running out of work.

Challenge: The Hardest Part

The hardest part is definitely tuning the Linux system for ejabberd and for the benchmark tool, to overcome the default limitations. By default, Linux servers are not configured to allow you to handle, nor even generate, 2 million TCP sockets. It required quite a bit of network setup not to have problems with exhausted ports on the Tsung side.

On a similar topic, we worked with the Amazon server team, as we have been pushing the limits of their infrastructure like no one before. For example, we had to use a second Ethernet adapter with multiple IP addresses (2 x 15 IP, spread across 2 NICs). It also helped a lot to use latest Enhanced Networking drivers from Intel.

All in all, it was a very interesting process that helped make progress on Amazon Web Services by testing and tuning the platform itself.

What’s Next?

This benchmark was intended to demonstrate that ejabberd can scale to a large volume and serve as a baseline reference for more complex and full-featured platforms.

Next step is to keep on going with our benchmark and optimization iteration work. Our next target is to benchmark Multi-User Chat performance and Pubsub performance. The goal is to find the limits, optimize and demonstrate that massive internet scale can be done with these ejabberd components as well.


Google Launches Cloud CDN Alpha

Earlier this month, Google announced an Alpha Cloud Content Delivery Network (CDN) offering. The service aims to provide static content closer to end users by caching content in Google’s globally distributed edge caches.  Google provides many more edge caches than it does data centers and as a result content can be provided quicker than making full round trip requests to a Google data center. In total, Google provides over 70 Edge points of presence which will help address customer CDN needs.

In order to use Cloud CDN, you must use Google’s Compute Engine HTTP(s) load balancers on your instances. Enabling an HTTP(S) load balancer is achieved through a simple command.

In a recent post, Google explains the mechanics of the service in the following way: “When a user requests content from your site, that request passes through network locations at the edges of Google’s network, usually far closer to the user than your actual instances. The first time that content is requested, the edge cache sees that it can’t fulfill the request and forwards the request on to your instances. Your instances respond back to the edge cache, and the cache immediately forwards the content to the user while also storing it for future requests. For subsequent requests for the same content that pass through the same edge cache, the cache responds directly to the user, shortening the round trip time and saving your instances the overhead of processing the request.”

The following image illustrates how Google leverages Edge point of presence caches to improve responsiveness.


Image Source:

Once the CDN service has been enabled, caching will automatically occur for all cacheable content. Cacheable content is typically defined by requests made through an HTTP GET request.  The service will respect explicit Cache-Control headers taking into account for expiration or make age headers.  Some responses will not be cached including ones that include Set-Cookie headers, message bodies that exceed 4 mb in size or where caching has been explicitly disabled through no-cache directives.  A complete list of cached rules can be found in Google’s documentation.

Google has traditionally partnered with 3rd parties in order to speed up the delivery of content to consumers.  These partnerships include Akamai, Cloudflare, Fastly, Level 3 Communications and Highwinds.

Other cloud providers also have CDN offerings including Amazon’s CloudFront and Microsoft’s Azure CDN.  Google will also see competition from Akamai, one of the aforementioned partners, who has approximately 16.3% CDN market share of the Alexa top 1 million sites.

How Wistia Handles Millions Of Requests Per Hour And Processes Rich Video Analytics

This is a guest repost from Christophe Limpalair of his interview with Max Schnur, Web Developer at  Wistia.

Wistia is video hosting for business. They offer video analytics like heatmaps, and they give you the ability to add calls to action, for example. I was really interested in learning how all the different components work and how they’re able to stream so much video content, so that’s what this episode focuses on.

What Does Wistia’s Stack Look Like?

As you will see, Wistia is made up of different parts. Here are some of the technologies powering these different parts:

What Scale Are You Running At?

Wistia has three main parts to it:

  1. The Wistia app (The ‘Hub’ where users log in and interact with the app)
  2. The Distillery (stats processing)
  3. The Bakery (transcoding and serving)

Here are some of their stats:

  • 1.5 million player loads per hour (loading a page with a player on it counts as one. Two players counts as two, etc…)
  • 18.8 million peak requests per hours to their Fastly CDN
  • 740,000 app requests per hour
  • 12,500 videos transcoded per hour
  • 150,000 plays per hour
  • 8 million stats pings per hour

They are running on Rackspace.

What Challenges Come From Receiving Videos, Processing Them, Then Serving Them?

  1. They want to balance quality and deliverability, which has two sides to it:
    1. Encoding derivatives of the original video
    2. Knowing when to play which derivative


    Derivatives, in this context, are different versions of a video. Having different quality versions is important, because it allows you to have smaller file sizes. These smaller video files can be served to users with less bandwidth. Otherwise, they would have to constantly buffer. Have enough bandwidth? Great! You can enjoy higher quality (larger) files.

    Having these different versions and knowing when to play which version is crucial for a good user experience.

  2. Lots of I/O. When you have users uploading videos at the same time, you end up having to move many heavy files across clusters. A lot of things can go wrong here.
  3. Volatility. There is volatility in both the number of requests and the number of videos to process. They need to be able to sustain these changes.
  4. Of course, serving videos is also a major challenge. Thankfully, CDNs have different kinds of configurations to help with this.
  5. On the client side, there are a ton of different browsers, device sizes, etc… If you’ve ever had to deal with making websites responsive or work on older IE versions, you know exactly what we’re talking about here.

How Do You Handle Big Changes In The Number Of Video Uploads?

They have boxes which they call Primes. These Prime boxes are in charge of receiving files from user uploads.

If uploads start eating up all the available disk space, they can just spin up new Primes using Chef recipes. It’s a manual job right now, but they don’t usually get anywhere close to the limit. They have lots of space.


What Transcoding System Do You Use?

The transcoding part is what they call the Bakery.

The Bakery is comprised of Primes we just looked at, which receive and serve files. They also have a cluster of workers. Workers process tasks and create derivative files from uploaded videos.

Beefy systems are required for this part, because it’s very resource intensive. How beefy?

We’re talking about several hundred workers. Each worker usually performs two tasks at one time. They all have 8 GB of RAM and a pretty powerful CPU.

What kinds of tasks and processing goes on with workers? They encode video primarily in x264 which is a fast encoding scheme. Videos can usually be encoded in about half or a third of the video length.

Videos must also be resized and they need different bitrate versions.

In addition, there are different encoding profiles for different devices–like HLS for iPhones. These encodings need to be doubled for Flash derivatives that are also x264 encoded.

What Does The Whole Process Of Uploading A Video And Transcoding It Look Like?

Once a user uploads a video, it is queued and sent to workers for transcoding, and then slowly transferred over to S3.

Instead of sending a video to S3 right away, it is pushed to Primes in the clusters so that customers can serve videos right away. Then, over a matter of a few hours, the file is pushed to S3 for permanent storage and cleared from Prime to make space.

How Do You Serve The Video To A Customer After It Is Processed?

When you’re hitting a video hosted on Wistia, you make a request to, which is actually hitting a CDN (they use a few different ones. I just tried it and mine went through Akamai). That CDN goes back to Origin, which is their Primes we just talked about.

Primes run HAProxy with a special balancer called Breadroute. Breadroute is a routing layer that sits in front of Primes and balances traffic.

Wistia might have the file stored locally in the cluster (which would be faster to serve), and Breadroute is smart enough to know that. If that is the case, then Nginx will serve the file directly from the file system. Otherwise, they proxy S3.

How Can You Tell Which Video Quality To Load?

That’s primarily decided on the client side.

Wistia will receive data about the user’s bandwidth immediately after they hit play. Before the user even hits play, Wistia receives data about the device, the size of the video embed, and other heuristics to determine the best asset for that particular request.

The decision of whether to go HD or not is only debated when the user goes into full screen. This gives Wistia a chance to not interrupt playback when a user first clicks play.

How Do You Know When Videos Don’t Load? How Do You Monitor That?

They have a service they call pipedream, which they use within their app and within embed codes to constantly send data back.

If a user clicks play, they get information about the size of their window, where it was, and if it buffers (if it’s in a playing state but the time hasn’t changed after a few seconds).

A known problem with videos is slow load times. Wistia wanted to know when a video was slow to load, so they tried tracking metrics for that. Unfortunately, they only knew if it was slow if the user actually waited around for the load to finish. If users have really slow connections, they might just leave before that happens.

The answer to this problem? Heartbeats. If they receive heartbeats and they never receive a play, then the user probably bailed.

What Other Analytics Do You Collect?

They also collect analytics for customers.

Analytics like play, pause, seeks, conversions (entering an email, or clicking a call to action). Some of these are shown in heatmaps. They also aggregate those heatmaps into graphs that show engagement, like the areas that people watched and rewatched.


How Do You Get Such Detailed Data?

When the video is playing, they use a video tracker which is an object bound to plays, pauses, and seeks. This tracker collects events into a data structure. Once every 10-20 seconds, it pings back to the Distillery which in turn figures out how to turn data into heatmaps.

Why doesn’t everyone do this? I asked him that, because I don’t even get this kind of data from YouTube. The reason, Max says, is because it’s pretty intensive. The Distillery processes a ton of stuff and has a huge database.

What Is Scaling Ease Of Use?

Wistia has about 60 employees. Once you start scaling up customer base, it can be really hard to make sure you keep having good customer support.

Their highest touch points are playback and embeds. These have a lot of factors that are out of their control, so they must make a choice:

  1. Don’t help customers
  2. Help customers, but it takes a lot of time

Those options weren’t good enough. Instead, they went after the source that caused the most customer support issues–embeds.

Embeds have gone through multiple iterations. Why? Because they often broke. Whether it was WordPress doing something weird to an embed, or Markdown breaking the embed, Wistia has had a lot of issues with this over time. To solve this problem, they ended up dramatically simplifying embed codes.

These simpler embed codes run into less problems on the customer’s end. While it does add more complexity behind the scenes, it results in fewer support tickets.

This is what Max means by scaling ease of use. Make it easier for customers, so that they don’t have to contact customer support as often. Even if that means more engineering challenges, it’s worth it to them.

Another example of scaling ease of use is with playbacks. Max is really interested in implementing a kind of client-side CDN load balancer that determines the lowest latency server to serve content from (similar to what Netflix does).

What Projects Are You Working On Right Now?

Something Max is planning on starting soon is what they’re calling the upload server. This upload server is going to give them the ability to do a lot of cool stuff around uploads.

As we talked about, and with most upload services, you have to wait for the file to get to a server before you can start transcoding and doing things with it.

The upload server will make it possible to transcode while uploading. This means customers could start playing their video before it’s even completely uploaded to their system. They could also get the still capture and embed almost immediately


I have to try and fit as much information in this summary as possible without making it too long. This means I cut out some information that could really help you out one day. I highly recommend you view the full interview if you found this interesting, as there are more hidden gems.

You can watch it, read it, or even listen to it. There’s also an option to download the MP3 so you can listen on your way to work. Of course, the show is also on iTunes and Stitcher.

Thanks for reading, and please let me know if you have any feedback!
– Christophe Limpalair (@ScaleYourCode)


Intel will ship Xeon Phi-equipped workstations starting in 2016

Intel has announced that its second-generation Xeon Phi hardware (codenamed Knights Landing) is now shipping to early customers. Knights Landing is built on 14nm process technology, with up to 72 Silvermont-derived CPU cores. While the design is derived from Atom, there are some obvious differences between these cores and the chips Intel uses in consumer hardware. Traditional Atom doesn’t support Hyper-Threading on the consumer side, while Knights Landing supports four threads per core. Knights Landing also supports AVX-512 extensions.


The new Xeon Phi runs at roughly 1.3GHz and is split into tiles, as Anandtech reports. There are two cores (eight threads) per tile along with two VPUs (Vector Proc essing Units, aka AVX-512 units). Each tile shares 1MB of L2 cache (36MB cache total). Unlike first-gen Xeon Phi, Knights Landing can actually run the OS natively and out-of-order performance is supposedly much improved compared to the P54C-derived chips that powered Knights Ferry. The chip includes ten memory controllers — two for DDR4 (six channels total) and eight MCDRAM controllers for a total of 16GB of on-chip memory and six channels of DDR4-2400 (up to 386GB total, according to Anandtech.).


Memory accesses can be mapped in different ways, depending on which model best suits the target workload. The 72 CPU cores can treat the entire MCDRAM space as a giant cache, but if they do, accessing main memory incurs a greater penalty in the event of a cache miss. Alternately, data can be flat mapped to both the DDR4 and MCDRAM and accessed that way. Finally, some MCDRAM can be mapped as a cache (with a higher-latency DDR4 fallback) while other MCDRAM is mapped as main memory with less overall latency. The card connects to the rest of the system via 36 PCIe 3.0 lanes. That’s 36GB/s of memory bandwidth in each direction (72GB/s of bandwidth in total) assuming that all 36 lanes can be dedicated to a single co-processor.


The overall image Intel is painting is that of a serious computing powerhouse, with far more horsepower than the previous generation. According to the company, at least some Xeon Phi workstations are going to ship next year. Intel will target researchers who want to work on Xeon Phi but don’t have access to a supercomputer for testing their software. With 3TFLOPS of double precision floating point performance, Xeon Phi can lay fair claim to the title of “Supercomputer on a PCB.” 3TFLOPs might not seem like much compared to the modern TOP500, but it’s more than enough to evaluate test cases and optimizations.

Intel has no plans to offer Xeon Phi in wide release (at least not right now), but if this program proves successful, we could see a limited run of smaller Xeon Phi coprocessors for application acceleration in other contexts. In theory, any well-parallelized workload that can run on x86 should perform well on Xeon Phi, and while we don’t see Intel making a return to the graphics market, it would be interesting to see the chip deployed as a rendering accelerator.

As far as comparisons to Nvidia are concerned, the only Nvidia Tesla that comes close to 3TFLOPS is the dual-GPU K80 GPU compute module. It’s not clear if that solution can match a single Xeon Phi, given that the Nvidia Tesla is scaling across two discrete chips. Future Nvidia products based on Pascal are expected to pack up to 32GB of on-board memory and should substantially improve the relative performance between the two, but we don’t know when that hardware will hit the market.


Why unikernels might kill containers in five years

Sinclair Schuller is the CEO and cofounder of Apprenda, a leader in enterprise Platform as a Service.

Container technologies have received explosive attention in the past year – and rightfully so. Projects like Docker and CoreOS have done a fantastic job at popularizing operating system features that have existed for years by making those features more accessible.

Containers make it easy to package and distribute applications, which has become especially important in cloud-based infrastructure models. Being slimmer than their virtual machine predecessors, containers also offer faster start times and maintain reasonable isolation, ensuring that one application shares infrastructure with another application safely. Containers are also optimized for running many applications on single operating system instances in a safe and compatible way.

So what’s the problem?
Traditional operating systems are monolithic and bulky, even when slimmed down. If you look at the size of a container instance – hundreds of megabytes, if not gigabytes, in size – it becomes obvious there is much more in the instance than just the application being hosted. Having a copy of the OS means that all of that OS’ services and subsystems, whether they are necessary or not, come along for the ride. This massive bulk conflicts with trends in broader cloud market, namely the trend toward microservices, the need for improved security, and the requirement that everything operate as fast as possible.

Containers’ dependence on traditional OSes could be their demise, leading to the rise of unikernels. Rather than needing an OS to host an application, the unikernel approach allows developers to select just the OS services from a set of libraries that their application needs in order to function. Those libraries are then compiled directly into the application, and the result is the unikernel itself.

The unikernel model removes the need for an OS altogether, allowing the application to run directly on a hypervisor or server hardware. It’s a model where there is no software stack at all. Just the app.

There are a number of extremely important advantages for unikernels:

  1. Size – Unlike virtual machines or containers, a unikernel carries with it only what it needs to run that single application. While containers are smaller than VMs, they’re still sizeable, especially if one doesn’t take care of the underlying OS image. Applications that may have had an 800MB image size could easily come in under 50MB. This means moving application payloads across networks becomes very practical. In an era where clouds charge for data ingress and egress, this could not only save time, but also real money.
  2. Speed – Unikernels boot fast. Recent implementations have unikernel instances booting in under 20 milliseconds, meaning a unikernel instance can be started inline to a network request and serve the request immediately. MirageOS, a project led byAnil Madhavapeddy, is working on a new tool named Jitsuthat allows clouds to quickly spin unikernels up and down.
  3. Security – A big factor in system security is reducing surface area and complexity, ensuring there aren’t too many ways to attack and compromise the system. Given that unikernels compile only which is necessary into the applications, the surface area is very small. Additionally, unikernels tend to be “immutable,” meaning that once built, the only way to change it is to rebuild it. No patches or untrackable changes.
  4. Compatibility – Although most unikernel designs have been focused on new applications or code written for specific stacks that are capable of compiling to this model, technology such as Rump Kernels offer the ability to run existing applications as a unikernel. Rump kernels work by componentizing various subsystems and drivers of an OS, and allowing them to be compiled into the app itself.

These four qualities align nicely with the development trend toward microservices, making discrete, portable application instances with breakneck performance a reality. Technologies like Docker and CoreOS have done fantastic work to modernize how we consume infrastructure so microservices can become a reality. However, these services will need to change and evolve to survive the rise of unikernels.

The power and simplicity of unikernels will have a profound impact during the next five years, which at a minimum will complement what we currently call a container, and at a maximum, replace containers altogether. I hope the container industry is ready.