World’s First 1,000-Processor Chip Said to Show Promise Across Multiple Workloads


A blindingly fast microchip, the first to contain 1,000 independent processors and said to show promise for digital signal processing, video processing, encryption and datacenter/cloud workloads, has been announced by a team at the University of California, Davis. The “KiloCore” chip has a maximum computation rate of 1.78 trillion instructions per second and contains 621 million transistors, according to the development team, which presented the microchip at the 2016 Symposium on VLSI Technology at Circuits this month in Honolulu.

By way of comparison, if the KiloCore’s area were the same as a 32 nm Intel Core i7 processor, it would contain approximately 2300-3700 processors and have a peak execution rate of 4.1 to 6.6 trillion independent instructions per second, according to the design team.

Although Bevan Baas, professor of electrical and computer engineering at UC/Davis who led the chip architectural design team, told HPCwire’s sister publication, EnterpriseTech, that there are no current plans to commercialize the processor, he said it has important commercial implications, with several applications already developed for the chip.


“It has been shown to excel spectacularly with many digital signal processing, wireless coding/decoding, multimedia and embedded workloads, and recent projects have shown that it can also excel at computing kernels for some datacenter/cloud and scientific workloads,” he said. He said the KiloCore innovates in a number of areas covering architectures, application development and mapping, circuits, and VLSI design.

“We hope multiple aspects of KiloCore will influence the design of future computing systems,” he said. “For workloads that can be mapped to its architecture, it could very well have a place in exascale-class computing.”

The design team claims for KiloCore the highest clock-rate processor ever designed in a university. And while other multiple-processor chips have been created, none exceed about 300 processors, according to the team.

The KiloCore chip was fabricated by IBM using their 32 nm CMOS technology.

Beyond throughput performance, Baas said KiloCore also is the most energy-efficient many-core processor ever reported, Baas said. For example, the 1,000 processors can execute 115 billion instructions per second while dissipating only 0.7 Watts, low enough to be powered by a single AA battery. The KiloCore chip executes instructions more than 100 times more efficiently than a modern laptop processor.

Each processor core can run its own small program independently of the others, Baas explained, which he said is a fundamentally more flexible approach than Single-Instruction-Multiple-Data approaches utilized by processors such as GPUs. The idea is to break an application up into many small pieces, each of which can run in parallel on different processors, enabling high throughput with lower energy use.

The KiloCore architecture is an example of a “fine-grain many-core” processor array, Baas said. Processors are kept as simple as possible so they occupy a small chip area, with numerous cores per chip. “Short low-capacitance wires result in high efficiency,” he said, and “operate at high clock frequencies (high performance in terms of high throughput and low latency).” The cores dissipate low power when both active and idle – in fact, he said, they dissipate perfect zero active power when there is no work to do. Energy efficiency also is achieved by operation at low supply voltages and a relatively-simple architecture consisting of a single-issue 7-stage pipeline with a small amount of memory per core and a message-passing-based inter-processor interconnect rather than a cache-based shared-memory model.

Baas said the team has completed a compiler and automatic program mapping tools for use in programming the chip


Intel will ship Xeon Phi-equipped workstations starting in 2016

Intel has announced that its second-generation Xeon Phi hardware (codenamed Knights Landing) is now shipping to early customers. Knights Landing is built on 14nm process technology, with up to 72 Silvermont-derived CPU cores. While the design is derived from Atom, there are some obvious differences between these cores and the chips Intel uses in consumer hardware. Traditional Atom doesn’t support Hyper-Threading on the consumer side, while Knights Landing supports four threads per core. Knights Landing also supports AVX-512 extensions.


The new Xeon Phi runs at roughly 1.3GHz and is split into tiles, as Anandtech reports. There are two cores (eight threads) per tile along with two VPUs (Vector Proc essing Units, aka AVX-512 units). Each tile shares 1MB of L2 cache (36MB cache total). Unlike first-gen Xeon Phi, Knights Landing can actually run the OS natively and out-of-order performance is supposedly much improved compared to the P54C-derived chips that powered Knights Ferry. The chip includes ten memory controllers — two for DDR4 (six channels total) and eight MCDRAM controllers for a total of 16GB of on-chip memory and six channels of DDR4-2400 (up to 386GB total, according to Anandtech.).


Memory accesses can be mapped in different ways, depending on which model best suits the target workload. The 72 CPU cores can treat the entire MCDRAM space as a giant cache, but if they do, accessing main memory incurs a greater penalty in the event of a cache miss. Alternately, data can be flat mapped to both the DDR4 and MCDRAM and accessed that way. Finally, some MCDRAM can be mapped as a cache (with a higher-latency DDR4 fallback) while other MCDRAM is mapped as main memory with less overall latency. The card connects to the rest of the system via 36 PCIe 3.0 lanes. That’s 36GB/s of memory bandwidth in each direction (72GB/s of bandwidth in total) assuming that all 36 lanes can be dedicated to a single co-processor.


The overall image Intel is painting is that of a serious computing powerhouse, with far more horsepower than the previous generation. According to the company, at least some Xeon Phi workstations are going to ship next year. Intel will target researchers who want to work on Xeon Phi but don’t have access to a supercomputer for testing their software. With 3TFLOPS of double precision floating point performance, Xeon Phi can lay fair claim to the title of “Supercomputer on a PCB.” 3TFLOPs might not seem like much compared to the modern TOP500, but it’s more than enough to evaluate test cases and optimizations.

Intel has no plans to offer Xeon Phi in wide release (at least not right now), but if this program proves successful, we could see a limited run of smaller Xeon Phi coprocessors for application acceleration in other contexts. In theory, any well-parallelized workload that can run on x86 should perform well on Xeon Phi, and while we don’t see Intel making a return to the graphics market, it would be interesting to see the chip deployed as a rendering accelerator.

As far as comparisons to Nvidia are concerned, the only Nvidia Tesla that comes close to 3TFLOPS is the dual-GPU K80 GPU compute module. It’s not clear if that solution can match a single Xeon Phi, given that the Nvidia Tesla is scaling across two discrete chips. Future Nvidia products based on Pascal are expected to pack up to 32GB of on-board memory and should substantially improve the relative performance between the two, but we don’t know when that hardware will hit the market.


POWER8 packs more than twice the big data punch

Last week, in London, the top minds in financial technology came face to face at the STAC Summit to discuss the technology challenges facing the financial industry. The impact big data is having on the financial industry was a hot topic, and many discussions revolved around having the best technology on hand for handling data faster to gain a competitive advantage.

It was in this setting that IBM shared recent benchmark results revealing that an IBM POWER8-based system server can deliver more than twice the performance of the best x86 server when running standard financial industry workloads.


In fact, IBM’s POWER8-based systems set four new performance records for financial workloads. According to the most recent published and certified STAC-A2 benchmarks, IBM POWER8-based systems deliver:

  • 3x performance over the best 2-socket solution using x86 CPUs
  • 1x performance for path scaling over the best 4-socket solution using x86 CPUs (NEW PUBLIC RECORD)
  • 7x performance over the best 2 x86 CPU and one Xeon Phi co-processor
  • 16 percent increase for asset capacity over the best 4-socket solution (NEW PUBLIC RECORD)

Why the financial industry created the STAC benchmark

STAC-A2 is a set of standard benchmarks that help estimate the relative performance of full systems running complete financial applications. This enables clients in the financial industry to evaluate how IBM POWER8-based systems will perform on real applications.

The IBM Power System S 824 delivered more than twice the performance of the best x86 based server measuredSTAC-A2 gives a much more accurate view of the expected performance as compared to micro benchmarks or simple code loops. STAC recently performed STAC-A2 Benchmark tests on a stack consisting of the STAC-A2 Pack for Linux on an IBM Power System S824 server using two IBM POWER8 Processor cards at 3.52 GHz and 1TB of DRAM, with Red Hat Enterprise Linux version 7.

And, as reported above, according to audited STAC results, the IBM Power System S824 delivered more than twice the performance of the best x86 based server measured. Those are the kind of results that matter—real results for real client challenges.

POWER8 processors are based on high performance, multi-threaded cores with each core of the Power System S824 server running up to eight simultaneous threads at 3.5 GHz. Power System S824 also has a very high bandwidth memory interface that runs at 192 GB/s per socket which is almost three times the speed of a typical x86 processor. These factors along with a balanced system structure including a large internal 8MB per core L3 are the primary reasons why financial computing workloads run significantly faster on POWER8-based systems than alternatives.

The STAC-A2 financial industry benchmarks add to the performance data that the Cabot Partners published recently. Cabot evaluated the performance of POWER8-based systems versus x86-based systems, evaluating functionality, performance and price/performance across several industries, including life sciences, financial services, oil and gas and analytics, referencing standard benchmarks as well as application oriented benchmark data.

The findings in the STAC-A2 benchmarking report position POWER8 as the ideal platform for the financial industry. This data, combined with the recently published Cabot Partners report, represents overwhelming proof that IBM POWER8-based systems take the performance lead in the financial services space (and beyond)—clearly packing a stronger punch when compared to the competition.



Hard disk reliability examined once more: HGST rules, Seagate is alarming


A year ago we got some insight into hard disk reliability when cloud backup provider Backblaze published its findings for the tens of thousands of disks that it operated. Backblaze uses regular consumer-grade disks in its storage because of the cheaper cost and good-enough reliability, but it also discovered that some kinds of disks fared extremely poorly when used 24/7.

A year later the company has collected even more data and drawn out even more differences between the different disks it uses.

For a second year, the standout reliability leader was HGST. Now a wholly owned subsidiary of Western Digital, HGST inherited the technology and designs from Hitachi (which itself bought IBM’s hard disk division). Across a range of models from 2 to 4 terabytes, the HGST models showed low failure rates; at worse, 2.3 percent failing a year. This includes some of the oldest disks among Backblaze’s collection; 2TB Desktop 7K2000 models are on average 3.9 years old, but still have a failure rate of just 1.1 percent.

At the opposite end of the spectrum are Seagate disks. Last year, the two 1.5TB Seagate models used by Backblaze had failure rates of 25.4 percent (for the Barracuda 7200.11) and 9.9 percent (for the Barracuda LP). Those units fared a little better this time around, with failure rates of 23.8 and 9.6 percent, even though they were the oldest disks in the test (average ages of 4.7 and 4.9 years, respectively). However, their poor performance was eclipsed by the 3TB Barracuda 7200.14 units, which had a whopping 43.1 percent failure rate, in spite of an average age of just 2.2 years.

Backblaze’s storage is largely split between Seagate and HGST disks. HGST’s parent company, Western Digital, is almost absent, not because its disks are bad, but because they came out as consistently more expensive than those from Seagate and HGST.


Newer Seagate disks also show more encouraging results. Although still young, at an average age of just 0.9 years, the 4TB HDD.15 models show a reasonably low 2.6 percent failure rate. Coupled with their low price—Backblaze says that they tend to undercut HGST’s disks—they’ve become the company’s preferred hard drive model.

As before, this doesn’t mean that anyone with a Seagate disk is at risk of an imminent hard disk failure (though you should always have backups!). Backblaze operates disks outside of the manufacturer’s specified parameters. Significantly, most consumer-grade disks aren’t intended to be heavily used 24/7; they’re meant to be operational for about 8 hours a day and replaced every 3 to 5 years. Most home usage environments are likely to be lower in vibration than Backblaze’s 45-disk storage pods, too. In more normal conditions, the Seagates are likely to fare much better.


DDN Pushes the Envelope for Parallel Storage I/O

Today at Supercomputing 2014, DataDirect Networks lifted the veil a bit more on Infinite Memory Engine (IME), its new software that will employ Flash storage and a bunch of smart algorithms to create a buffer between HPC compute and parallel file system resources, with the goal of improving file I/O by up to 100x. The company also announced the latest release of its Exascaler, its Lustre-based storage appliance lineup.

The data patterns have been changing at HPC sites in a way that is creating bottlenecks in the I/O. While many HPC shops may think they’re primarily working with large and sequential files, the reality is that most data is relatively small and random, and that fragmented I/O creates problems when moving the data across the interconnect, says Jeff Sisilli, Sr. Director Product Marketing at DataDirect Networks.

“Parallel file systems were really built for large files,” Sisilli tells HPCwire. “What we’re finding is 90 percent of typical I/O in HPC data centers utilizes small files, those less than 32KB. What happens is, when you inject those into a parallel file system, it starts to really bring down performance.”

DDN says it overcame the restrictions in how parallel file systems were created with IME, which creates a storage tier above the file system and provides a “fast data” layer between the compute nodes in an HPC cluster and the backend file system. The software, which resides on the I/O nodes in the cluster, utilizes any available Flash solid state drives (SSDs) or other non-volatile memory (NVM) storage resources available, creating a “burst buffer” to absorb peak loads and eliminate I/O contention.

IME works in two ways. First, it removes any limitations of the POSIX layer, such as file locks, that can slow down communication. Secondly, algorithms bundle up the small and random I/O operations into larger files that can be more efficiently read into the file system.

In lab tests at a customer site, DDN ran IME against the S3D turbulent flow modeling software. The software was really designed for larger sequential files, but is often used in the real world with smaller and random files. In the customer’s case, these “mal-aligned and fragmented” files were causing I/O throughput across the InfiniBand interconnect to drop to 25 MBs per second.

After introducing IME, the customer was able to ingest data from the compute cluster onto IME’s SSDs at line rate. “This customer was using InfiniBand, and we were able to fill up InfiniBand all the way to line rate, and absorb at 50 GB per second,” Sisilli says.

The data wasn’t written back into the file system quite that quickly. But because the algorithms were able to align all those small files and convert fragments into full stripe writes, it did provide a speed up compared to 25MB per second. “We were able to drain out the buffer and write to the parallel file system at 4GB per second, which is two orders of magnitude faster than before,” Sisilli says.

The “net net” of IME, Sisilli says, is it frees up HPC compute cluster resources. “From the parallel file system side, we’re able to shield the parallel file system and underlying storage arrays from fragmented I/O, and have those be able to ingest optimized data and utilize much less hardware to be able to get to the performance folks need up above,” he says.

IME will work with any Lustre- or GPFS-based parallel file system. That includes DDN’s own EXAscaler line of Lustre-based storage appliances, or the storage appliances of any other vendor. There are no application modifications required to use IME, which also features data erasure encoding capabilities typicaly found in object file stores. The only requirements are that the application is POSIX compliant or uses the MPI job scheduler. DDN also provides an API that customers can use if they want to modify their apps to work with IME; the company has plans to create an ecosystem of compatible tools using this API.

There are other vendors developing similar Flash-bashed storage buffer offerings. But DDN says the fact that it’s taking an open, software-based approach gives customer an advantage over those vendors that are requiring customers to purchase specialized hardware, or those that work with only certain types of Interconnects.


IME isn’t available yet; it’s still in technology preview mode. But when it becomes available, scalability won’t be an issue. The software will be able to corral and make available petabytes worth of Flash or NVM storage resources living across thousands of nodes, Sisilli says. “What we’re recommending is [to have in IME] anywhere between two to three amount of your compute cluster memory to have a great working space within IME to accelerate your applications and and do I/O,” he says. “That can be all the way down to terabytes, and for supercomputers, it’s multi petabytes.”

IME is still undergoing tests, and is expected to become generally available in the second quarter of 2015. DDN will offer it as an appliance or as software.

DDN also today unveiled a new release of EXAScaler. With Version 2.1, DDN has improved read and write I/O performance by 25 percent. That will give DDN a comfortable advantage over competing file systems for some time, says Roger Goff, Sr. Product Manager for DDN.

“We know what folks are about to announce because they pre-announce those things,” Goff says. “Our solution is tremendously faster than what you will see [from other vendors], particularly on a per-rack performance basis.”

Other new features in version 2.1 include support for self-encrypting drives; improved rebuild times; InfiniBand optimizations; and better integration with DDN’s Storage Fusion Xcelerator (SFX Flash Caching) software.

DDN has also standardized on the Lustre file system from Intel, called Intel Enterprise Edition for Lustre version 2.5. That brings it several new capabilities, including a new MapReduce connector for running Hadoop workloads.

“So instead of having data replicated across multiple nodes in the cluster, which is the native mode for HDFS, with this adapter, you can run those Hadoop applications and take advantages of the single-copy nature of a parallel file system, yet have the same capability of a parallel file system to scale to thousands and thousands of clients accessing that same data,” Goff says.

EXAScaler version 2.1 is available now across all three EXAScaler products, including the entry-level SFA7700, the midrange ES12k/SFA12k-20, and the high-end SFA12KX/SFA212k-40.

( Via )

Facebook has built its own switch. And it looks a lot like a server


SUMMARY:Facebook has built its own networking switch and developed a Linux-based operating systems to run it. The goal is to create networking infrastructure that mimics a server in terms of how its managed and configured.

Not content to remake the server, Facebook’s engineers have taken on the humble switch, building their own version of the networking box and the software to go with it. The resulting switch, dubbed Wedge, and the software called FBOSS will be provided to the Open Compute Foundation as an open source design for others to emulate. Facebook is already testing it with production traffic in its data centers.

Jay Parikh, the VP of infrastructure engineering at Facebook shared the news of the server onstage at the Gigaom Structure event Wednesday, explaining that Facebook’s goal in creating this project was to eliminate the network engineer and run its networking operations in the same easily swapped out and dynamic fashion as their servers. In many ways Facebook’s efforts with designing its own infrastructure have stemmed from the need to build hardware that was as flexible as the software running on top of it. It makes no sense to be innovating all the time with your code if you can’t adjust the infrastructure to run that code efficiently.


And networking has long been a frustrating aspect of IT infrastructure because it has been a black box that both delivered packets and also did the computing to figure out the path those packets should take. But as networks scaled out that combination — and the domination of the market by giants Cisco and Juniper — was becoming untenable. Thus efforts to separate the physical delivery of packets and the routing of the packets was split into two jobs allowing the networks to become software-defined — and allowing other companies to start innovating.

The creation of a custom-designed switch that allows Facebook to control its networking like it currently manages its servers has been a long time coming. It began the Open Compute effort with a redesigned server in 2011 and focused on servers and a bit of storage for the next two years. In May 2013 it called for vendors to submit designs for an open source switch and at our last year’s Structure event Parikh detailed Facebook’s new networking fabricthat allowed the social networking giant to move large amounts of traffic more efficiently.

But the combination of the modular hardware approach to the Wedge server and the Linux-based FBOSS operating system blow the switch apart in the same way Facebook blew the server apart. The switch will use the Group Hug microprocessor boards so any type of chip could slot into the box to control configuration and run the OS. The switch will still rely on a networking processor for routing and delivery of the packets and has a throughput of 640 Gbps, but eventually Facebook could separate the transport and decision-making process.

The whole goal here is to turn the monolithic switch into something that is modular and controlled by the FBOSS software that can be updated as needed without having to learn proprietary networking languages required by other providers’ gear. The question with Facebook’s efforts here is how it will affect the larger market for networking products.

Facebook’s infrastructure is relatively unique in that it wholly controls it and has the engineering talent to build software and new hardware to meet its computing needs. Google is another company that has built its own networking switch, but it didn’t open source those designs and keeps them close. But many enterprise customers don’t have the technical expertise of a web giant, so the tweaks that others contribute to the Open Compute Foundation to make the gear and the software will likely influence adoption.


Microway Rolls out Octoputer Servers with up to 8 GPUs

Today Microway announced a new line of servers designed for GPU and storage density. As part of the announcement, the company’s new OctoPuter GPU servers pack 34 TFLOPS of computing power when paired with up to up to eight NVIDIA Tesla K40 GPU accelerators.

NVIDIA GPU accelerators offer the fastest parallel processing power available, but this requires high-speed access to the data. Microway’s newest GPU computing solutions ensure that large amounts of source data are retained in the same server as a high-density of Tesla GPUs. The result is faster application performance by avoiding the bottleneck of data retrieval from network storage,” said Stephen Fried, CTO of Microway.

Microway also introduced an additional NumberSmasher 1U GPU server housing up to three NVIDIA Tesla K40 GPU accelerators. With nearly 13 TFLOPS of computing power, the NumberSmasher includes up to 512GB of memory, 24 x86 compute cores, hardware RAID, and optional InfiniBand.