Facebook has built its own switch. And it looks a lot like a server

parikh_002

SUMMARY:Facebook has built its own networking switch and developed a Linux-based operating systems to run it. The goal is to create networking infrastructure that mimics a server in terms of how its managed and configured.

Not content to remake the server, Facebook’s engineers have taken on the humble switch, building their own version of the networking box and the software to go with it. The resulting switch, dubbed Wedge, and the software called FBOSS will be provided to the Open Compute Foundation as an open source design for others to emulate. Facebook is already testing it with production traffic in its data centers.

Jay Parikh, the VP of infrastructure engineering at Facebook shared the news of the server onstage at the Gigaom Structure event Wednesday, explaining that Facebook’s goal in creating this project was to eliminate the network engineer and run its networking operations in the same easily swapped out and dynamic fashion as their servers. In many ways Facebook’s efforts with designing its own infrastructure have stemmed from the need to build hardware that was as flexible as the software running on top of it. It makes no sense to be innovating all the time with your code if you can’t adjust the infrastructure to run that code efficiently.

ocpnetwork

And networking has long been a frustrating aspect of IT infrastructure because it has been a black box that both delivered packets and also did the computing to figure out the path those packets should take. But as networks scaled out that combination — and the domination of the market by giants Cisco and Juniper — was becoming untenable. Thus efforts to separate the physical delivery of packets and the routing of the packets was split into two jobs allowing the networks to become software-defined — and allowing other companies to start innovating.

The creation of a custom-designed switch that allows Facebook to control its networking like it currently manages its servers has been a long time coming. It began the Open Compute effort with a redesigned server in 2011 and focused on servers and a bit of storage for the next two years. In May 2013 it called for vendors to submit designs for an open source switch and at our last year’s Structure event Parikh detailed Facebook’s new networking fabricthat allowed the social networking giant to move large amounts of traffic more efficiently.

But the combination of the modular hardware approach to the Wedge server and the Linux-based FBOSS operating system blow the switch apart in the same way Facebook blew the server apart. The switch will use the Group Hug microprocessor boards so any type of chip could slot into the box to control configuration and run the OS. The switch will still rely on a networking processor for routing and delivery of the packets and has a throughput of 640 Gbps, but eventually Facebook could separate the transport and decision-making process.

The whole goal here is to turn the monolithic switch into something that is modular and controlled by the FBOSS software that can be updated as needed without having to learn proprietary networking languages required by other providers’ gear. The question with Facebook’s efforts here is how it will affect the larger market for networking products.

Facebook’s infrastructure is relatively unique in that it wholly controls it and has the engineering talent to build software and new hardware to meet its computing needs. Google is another company that has built its own networking switch, but it didn’t open source those designs and keeps them close. But many enterprise customers don’t have the technical expertise of a web giant, so the tweaks that others contribute to the Open Compute Foundation to make the gear and the software will likely influence adoption.

(Via GigaOm.com)

Microway Rolls out Octoputer Servers with up to 8 GPUs

Today Microway announced a new line of servers designed for GPU and storage density. As part of the announcement, the company’s new OctoPuter GPU servers pack 34 TFLOPS of computing power when paired with up to up to eight NVIDIA Tesla K40 GPU accelerators.

NVIDIA GPU accelerators offer the fastest parallel processing power available, but this requires high-speed access to the data. Microway’s newest GPU computing solutions ensure that large amounts of source data are retained in the same server as a high-density of Tesla GPUs. The result is faster application performance by avoiding the bottleneck of data retrieval from network storage,” said Stephen Fried, CTO of Microway.

Microway also introduced an additional NumberSmasher 1U GPU server housing up to three NVIDIA Tesla K40 GPU accelerators. With nearly 13 TFLOPS of computing power, the NumberSmasher includes up to 512GB of memory, 24 x86 compute cores, hardware RAID, and optional InfiniBand.

Octoputer_Tesla1000px-434x400

(Via InsideHPC.com)

A3Cube develop Extreme Parallel Storage Fabric, 7x Infiniband

News from EETimes points towards a startup that claims to offer an extreme performance advantage over Infiniband.  A3Cube Inc. has developed a variation of the PCIe Express on a Network Interface Card to offer lower latency.  The company is promoting their Ronniee Express technology via a PCIe 2.0 driven FPGA to offer sub-microsecond latency across a 128 server cluster.

In the Sockperf benchmark, numbers from A3Cube put performance at around 7x that of Infiniband and PCIe 3.0 x8, and thus claim that the approach beats the top alternatives.  The PCIe support of the device at the physical layer enables quality-of-service features, and A3Cube claim the fabric enables a cluster of 10000 nodes to be represented in a single image without congestion.

The aim for A3Cube will be primarily in HFT, genomics, oil/gas exploration and real-time data analytics.  Prototypes for merchants are currently being worked on, and it is expected that two versions of network cards and a 1U switch based on the technology will be available before July.

The new IP from A3Cube is kept hidden away, but the logic points towards device enumeration and the extension of the PCIe root complex of a cluster of systems.  This is based on the quote regarding PCIe 3.0 incompatibility based on the different device enumeration in that specification.  The plan is to build a solid platform on PCIe 4.0, which puts the technology several years away in terms of non-specialized deployment.

As many startups, the process for A3Cube is to now secure venture funding.  The approach to Ronniee Express is different to that of PLX who are developing a direct PCIe interconnect for computer racks.

A3Cube’s webpage on the technology states the fabric uses a combination of hardware and software, while remaining application transparent.  The product combines multiple 20 or 40 Gbit/s channels, with the aim at petabyte-scale Big Data and HPC storage systems.

Information from Willem Ter Harmsel puts the Ronniee NIC system as a global shared memory container, with an in-memory network between nodes.  CPU/Memory/IO are directly connected, with 800-900 nanosecond latencies, and the ‘memory windows’ facilitates low latency traffic.

Using A3cube’s storage OS, byOS, and 40 terabytes of SSDs and the Ronniee Express fabric, five storage nodes were connected together via 4 links per NIC allowing for 810ns latency in any direction.  A3Cube claim 4 million IOPs with this setup.

Further, in interview by Willem and Anontella Rubicco shows that “Ronniee is designed to build massively parallel storage and analytics machines; not to be used as an “interconnection” as Infiniband or Ethernet.  It is designed to accelerate applications and create parallel storage and analytics architecture.”

(via AnandTech.com)

 

Virident vCache vs. FlashCache

Ease of basic installation. The setup process was simply a matter of installing two RPMs and running a couple of commands to enable vCache on the PCIe flash card (a Virident FlashMAX II) and set up the cache device with the command-line utilities supplied with one of the RPMs. Moreover, the vCache software is built in to the Virident driver, so there is no additional module to install. FlashCache, on the other hand, requires building a separate kernel module in addition to whatever flash memory driver you’ve already had to install, and then further configuration requires modification to assorted sysctls. I would also argue that the vCache documentation is superior. Winner: vCache.

Ease of post-setup modification / advanced installation. Many of the FlashCache device parameters can be easily modified by echoing the desired value to the appropriate sysctl setting; with vCache, there is a command-line binary which can modify many of the same parameters, but doing so requires a cache flush, detach, and reattach. Winner: FlashCache.

Operational Flexibility: Both solutions share many features here; both of them allow whitelisting and blacklisting of PIDs or simply running in a “cache everything” mode. Both of them have support for not caching sequential IO, adjusting the dirty page threshold, flushing the cache on demand, or having a time-based cache flushing mechanism, but some of these features operate differently with vCache than with FlashCache. For example, when doing a manual cache flush with vCache, this is a blocking operation. With FlashCache, echoing “1″ to the do_sync sysctl of the cache device triggers a cache flush, but it happens in the background, and while countdown messages are written to syslog as the operation proceeds, the device never reports that it’s actually finished. I think both kinds of flushing are useful in different situations, and I’d like to see a non-blocking background flush in vCache, but if I had to choose one or the other, I’ll take blocking and modal over fire-and-forget any day. FlashCache does have the nice ability to switch between FIFO and LRU for its flushing algorithm; vCache does not. This is something that could prove useful in certain situations. Winner: FlashCache.

Operational Monitoring: Both solutions offer plenty of statistics; the main difference is that FlashCache stats can be pulled from /proc but vCache stats have to be retrieved by running the vgc-vcache-monitor command. Personally, I prefer “cat /proc/something” but I’m not sure that’s sufficient to award this category to FlashCache. Winner: None.

Time-based Flushing: This wouldn’t seem like it should be a separate category, but because the behavior seems to be so different between the two cache solutions, I’m listing it here. The vCache manual indicates that “flush period” specifies the time after which dirty blocks will be written to the backing store, whereas FlashCache has a setting called “fallow_delay”, defined in the documentation as the time period before “idle” dirty blocks are cleaned from the cache device. It is not entirely clear whether or not these mechanisms operate in the same fashion, but based on the documentation, it appears that they do not. I find the vCache implementation more useful than the one present in FlashCache. Winner: vCache.

Although nobody likes a tie, if you add up the scores, usability is a 2-2-1 draw between vCache and FlashCache. There are things that I really liked better with FlashCache, and there are other things that I thought vCache did a much better job with. If I absolutely must pick a winner in terms of usability, then I’d give a slight edge to FlashCache due to configuration flexibility, but if the GA release of vCache added some of FlashCache’s additional configuration options and exposed statistics via /proc, I’d vote in the other direction.

Disclosure: The research and testing conducted for this post were sponsored by Virident.

First, some background information. All tests were conducted on Percona’s Cisco UCS C250test machine, and both the vCache and FlashCache tests used the same 2.2TB Virident FlashMAX II as the cache storage device. EXT4 is the filesystem, and CentOS 6.4 the operating system, although the pre-release modules I received from Virident required the use of the CentOS 6.2 kernel, 2.6.32-220, so that was the kernel in use for all of the benchmarks on both systems. The benchmark tool used was sysbench 0.5 and the version of MySQL used was Percona Server 5.5.30-rel30.1-465. Each test was allowed to run for 7200 seconds, and the first 3600 seconds were discarded as warmup time; the remaining 3600 seconds were averaged into 10-second intervals. All tests were conducted with approximately 78GiB of data (32 tables, 10M rows each) and a 4GiB buffer pool. The cache devices were flushed to disk immediately prior to and immediately following each test run.

With that out of the way, let’s look at some numbers.

vCache vs. vCache – MySQL parameter testing

The first test was designed to look solely at vCache performance under some different sets of MySQL configuration parameters. For example, given that the front-end device is a very fast PCIe SSD, would it make more sense to configure MySQL as if it were using SSD storage or to just use an optimized HDD storage configuration? After creating a vCache device with the default configuration, I started with a baseline HDD configuration for MySQL (configuration A, listed at the bottom of this post) and then tried three additional sets of experiments. First, the baseline configuration plus:

innodb_read_io_threads = 16
innodb_write_io_threads = 16

We call this configuration B. The next one contained four SSD-specific optimizations based partially on some earlier work that I’d done with this Virident card (configuration C):

innodb_io_capacity = 30000
innodb_adaptive_flushing_method = keep_average
innodb_flush_neighbor_pages=none
innodb_max_dirty_pages_pct = 60

And then finally, a fourth test (configuration D) which combined the parameter changes from tests B and C. The graph below shows the sysbench throughput (tps) for these four configurations:
vcache_trx_params
As we can see, all of the configuration options produce numbers that, in the absence of outliers, are roughly identical, but it’s configuration C (shown in the graph as the blue line – SSD config) which shows the most consistent performance. The others all have assorted performance drops scattered throughout the graph. We see the exact same pattern when looking at transaction latency; the baseline numbers are roughly identical for all four configurations, but configuration C avoids the spikes and produces a very constant and predictable result.
vcache_response_params

vCache vs. FlashCache – the basics

Once I’d determined that configuration C appeared to produce the most optimal results, I moved on to reviewing FlashCache performance versus that of vCache, and I also included a “no cache” test run as well using the base HDD MySQL configuration for purposes of comparison. Given the apparent differences in time-based flushing in vCache and FlashCache, both cache devices were set up so that time-based flushing was disabled. Also, both devices were set up such that all IO would be cached (i.e., no special treatment of sequential writes) and with a 50% dirty page threshold. Again, for comparison purposes, I also include the numbers from the vCache test where the time-based flushing is enabled.
vcache_fcache_trx_params
As we’d expect, the HDD-only solution barely registered on the graph. With a buffer pool that’s much smaller than the working set, the no-cache approach is fairly crippled and ineffectual. FlashCache does substantially better, coming in at an average of around 600 tps, but vCache is about 3x better. The interesting item here is that vCache with time-based flushing enabled actually produces better and more consistent performance than vCache without time-based flushing, but even at its worst, the vCache test without time-based flushing still outperforms FlashCache by over 2x, on average.

Looking just at sysbench reads, vCache with time-based flushing consistently hit about 27000 per second, whereas without time-based flushing it averaged about 12500. FlashCache came in around 7500 or so. Sysbench writes came in just under 8000 for vCache + time-based flushing, around 6000 for vCache without time-based flushing, and somewhere around 2500 for FlashCache.
vcache_fcache_read_write

We can take a look at some vmstat data to see what’s actually happening on the system during all these various tests. Clockwise from the top left in the next graph, we have “no cache”, “FlashCache”, “vCache with no time-based flushing”, and “vCache with time-based flushing.” As the images demonstrate, the no-cache system is being crushed by IO wait. FlashCache and vCache both show improvements, but it’s not until we get to vCache with the time-based flushing that we see some nice, predictable, constant performance.
cpu-usage-all

So why is it the case that vCache with time-based flushing appears to outperform all the rest? My hypothesis here is that time-based flushing allows the backing store to be written to at a more constant and, potentially, submaximal, rate compared to dirty-page-threshold flushing, which kicks in at a given level and then attempts to flush as quickly as possible to bring the dirty pages back within acceptable bounds. This is, however, only a hypothesis.

vCache vs. FlashCache – dirty page threshold

Finally, we examine the impact of a couple of different dirty-page ratios on device performance, since this is the only parameter which can be reliably varied between the two in the same way. The following graph shows sysbench OLTP performance for FlashCache vs. vCache with a 10% dirty threshold versus the same metrics at a 50% dirty threshold. Time-based flushing has been disabled. In this case, both systems produce better performance when the dirty-page threshold is set to 50%, but once again, vCache at 10% outperforms FlashCache at 10%.

vcache-dirty_trx_params

The one interesting item here is that vCache actually appears to get *better* over time; I’m not entirely sure why that’s the case or at what point the performance is going to level off since these tests were all run for 2 hours anyway, but I think the overall results still speak for themselves, and even with a vCache volume where the dirty ratio is only 10%, such as might be the case where a deployment has a massive data set size in relation to both the working set and the cache device size, the numbers are encouraging.

Conclusion

Overall, the I think the graphs speak for themselves. When the working set outstrips the available buffer pool memory but still fits into the cache device, vCache shines. Compared to a deployment with no SSD cache whatsoever, FlashCache still does quite well, massively outperforming the HDD-only setup, but it doesn’t even really come close to the numbers obtained with vCache. There may be ways to adjust the FlashCache configuration to produce better or more consistent results, or results that are more inline with the numbers put up by vCache, but when we consider that overall usability was one of the evaluation points and combine that with the fact that the best vCache performance results were obtained with the default vCache configuration, I think vCache can be declared the clear winner.

Base MySQL & Benchmark Configuration

All benchmarks were conducted with the following:

sysbench ­­--num­-threads=32 ­­--test=tests/db/oltp.lua ­­--oltp_tables_count=32 \
--oltp­-table­-size=10000000 ­­--rand­-init=on ­­--report­-interval=1 ­­--rand­-type=pareto \
--forced­-shutdown=1 ­­--max­-time=7200 ­­--max­-requests=0 ­­--percentile=95 ­­\
--mysql­-user=root --mysql­-socket=/tmp/mysql.sock ­­--mysql­-table­-engine=innodb ­­\
--oltp­-read­-only=off run

The base MySQL configuration (configuration A) appears below:

#####fixed innodb options
innodb_file_format = barracuda
innodb_buffer_pool_size = 4G
innodb_file_per_table = true
innodb_data_file_path = ibdata1:100M
innodb_flush_method = O_DIRECT
innodb_log_buffer_size = 128M
innodb_flush_log_at_trx_commit = 1
innodb_log_file_size = 1G
innodb_log_files_in_group = 2
innodb_purge_threads = 1
innodb_fast_shutdown = 1
#not innodb options (fixed)
back_log = 50
wait_timeout = 120
max_connections = 5000
max_prepared_stmt_count=500000
max_connect_errors = 10
table_open_cache = 10240
max_allowed_packet = 16M
binlog_cache_size = 16M
max_heap_table_size = 64M
sort_buffer_size = 4M
join_buffer_size = 4M
thread_cache_size = 1000
query_cache_size = 0
query_cache_type = 0
ft_min_word_len = 4
thread_stack = 192K
tmp_table_size = 64M
server­id = 101
key_buffer_size = 8M
read_buffer_size = 1M
read_rnd_buffer_size = 4M
bulk_insert_buffer_size = 8M
myisam_sort_buffer_size = 8M
myisam_max_sort_file_size = 10G
myisam_repair_threads = 1
myisam_recover 

(Source: ssdperformanceblog.com)

Cray brings top supercomputer tech to businesses for a mere $500,000

A Cray XC30-AC server rack.

Cray, the company that built the world’s fastest supercomputer, is bringing its next generation of supercomputer technology to regular ol’ business customers with systems starting at just $500,000.

The new XC30-AC systems announced today range in price from $500,000 to roughly $3 million, providing speeds of 22 to 176 teraflops. That’s just a fraction of the speed of the aforementioned world’s fastest supercomputer, the $60 million Titan, which clocks in at 17.59 petaflops. (A teraflop represents a thousand billion floating point operations per second, while a petaflop is a million billion operations per second.)

But in fact, the processors and interconnect used in XC30-AC is a step up from those used to build Titan. The technology Cray is selling to smaller customers today could someday be used to build supercomputers even faster than Titan.

Titan uses a mix of AMD Opteron and Nvidia processors for a total of 560,640 cores, and uses Cray’s proprietary Gemini interconnect.

XC30-AC systems ship with Intel Xeon 5400 Series processors (CORRECTION: Cray had told us this comes with Xeon 5400, but the product documents say it actually comes with the Xeon E5-2600). It’s the first Intel-based supercomputer Cray is selling into smaller businesses, what it calls the “technical enterprise” market. (Cray’s previous systems for this market used AMD processors.) Perhaps more importantly, XC30-AC uses Aries, an even faster interconnect than the one found in Titan.

In short, it’s “the latest Intel architectures with the latest Cray interconnect” that will be installed in “future #1 and top 10 class systems,” Cray VP of Marketing Barry Bolding told Ars. “It’s like buying a car model where you’re getting exactly same engine you’re getting in a top-of-the-line BMW. The only thing that’s changing are some of the peripherals that make the system easier to fit into a data center and make it more affordable.” Compared to systems like Titan, Cray says XC30-AC has “physically smaller compute cabinets with 16 vertical blades per cabinet.”

Oil and gas firms or electronics companies performing complex simulations are among the potential customers for an XC30-AC supercomputer.

XC30-AC is a followup to the XC30 systems which are meant for larger customers and typically cost tens of millions of dollars. The “AC” refers to the fact that the smaller systems are air-cooled instead of water-cooled. The power requirements aren’t as immense, and using air cooling makes it easier to install in a wider range of data centers.

XC30 systems scale up to 482 cabinets and 185,000 sockets with more than a million processor cores. The XC30-AC goes from one to eight cabinets, with each holding 16 blades of eight sockets each for 128 sockets in each cabinet. With an Intel 8-core Xeon processor in each socket, that adds up to 1,024 sockets and as many as 8,192 processor cores in an eight cabinet-system. A single cabinet with about 30TB of usable storage and 128 sockets would cost about $500,000, while eight cabinet systems with 140TB of usable storage and 1,024 sockets hit the $3 million range.

To begin with, the XC30-AC supports only Intel Xeon processors with Sandy Bridge architecture. Those will be updated to server-class Ivy Bridge chips later on. Nvidia GPUs and Intel Xeon Phi chips will become available as co-processors by the end of the year, Cray said.

While XC30-AC systems will be smaller than traditional supercomputers, the Cray Aries interconnect makes it incredibly fast, Bolding said. He noted that Ethernet interconnects generally aren’t fast enough for the world’s fastest supercomputers. InfiniBand has really taken off, being used in about half of the top 100 systems and two of the top 10. But the top five systems in the world all use custom, proprietary interconnects such as Cray’s or IBM’s.

Enlarge / A look at the XC30 architecture.
Cray

Aries supports injection bandwidth of 10Gbps and 120 million get and put operations per second. “Injection bandwidth” is less than the full system’s bandwidth. As a paper on the interconnect’s architecture notes, the “global bandwidth of a full network exceeds the injection bandwidth—all of the traffic injected by a node can be routed to other groups.”

Latency is another key factor, with Aries providing point to point latency of less than a microsecond, Bolding said. Moreover, latency remains strong when a cluster is going full blast. “When a system is very busy and sending messages from one end of the machine to another across a fully loaded network where everything’s working at once, Cray’s latencies are literally almost as good as they are in point to point. They go up to around two or three microseconds,” Bolding said.

The speed allows memory to be shared across processors. “No matter how many nodes you have you can actually treat it as if it’s a shared memory machine, every node can talk to every other node, directly into the memory of that compute node,” Bolding said. “That’s something that is very powerful for certain types of applications and programming models.”

Aries also features more sophisticated network congestion algorithms than the previous generation, preventing messages from getting backed up during times of high usage.

As for software, XC30-AC comes with the SUSE-based Cray Linux Environment also used in Titan, allowing customers to run almost any Linux application, Bolding said. While some of Cray’s other systems are designed to run any form of Linux a customer wants, the XC30-AC comes with software optimized for the system. This allows it to be ready to go shortly after it comes out of the box, instead of requiring a week of setup.

Who will buy an entry-level supercomputer?

Cray isn’t the financial success it once was, with its latest earnings announcement showing a year-over-year drop in quarterly revenue from $112.3 million to $79.5 million. The company also experienced a net loss of $7.6 million. Cray fared better in fiscal 2012, with full-year revenue of $421.1 million and net income of $161.2 million.

High-performance computing revenue is on the rise, with supercomputing products ($500,000 and up) leading the way according to IDC. HPC and supercomputing revenue is growing faster than theserver market as a whole.

Cray is hoping to take its share of that revenue by selling both the smallest and largest supercomputer-class systems. While the XC30-AC was just announced today, it’s been shipping for a few weeks. Early customers include an unnamed “Fortune 100″ commercial electronics firm whose R&D department needs a powerful machine for simulations.

The oil and gas industry has a need for such machines to model oil fields. Biotechnology, engineering, and various manufacturing industries may provide interested customers as well, Cray says.

We’ve written about the trend of Amazon and other cloud services being used for supercomputing, with one-off jobs costing up to several thousand dollars an hour. Those are generally for customers that have only occasional need for a supercomputer, however. Many businesses would use a supercomputer often enough that owning one is more cost-efficient. Cray is betting a lot of Fortune 500 companies and universities that can’t afford giant clusters costing tens of millions of dollars will be interested in systems like the XC30-AC.

“The complexity of problems that mid-range customers, technical enterprise customers face today are becoming so complex that they do need a tightly integrated supercomputer,” Bolding said. “They can’t always get away with a more conventional Ethernet cluster.”

(via arstechnica.com)

 

Titan Knocks Off Sequoia as Top Supercomputer

In the battle of the DOE labs, Oak Ridge Lab’s Titan supercomputer has taken the title from the former TOP500 champ, Lawrence Livermore’s Sequoia. The GPU-charged Titan, using the new NVIDIA K20X-equipped XK7 blades from Cray, delivered 17.6 petaflops to Sequoia’s 16.3 petaflops on Linpack, the sole metric for TOP500 rankings.

Titan looks like it will also take the energy-efficiency title from Sequoia and the Blue Gene/Q platform. The Oak Ridge super delivers 2,120 megaflops/watt, besting Sequoia’s current mark of 2,100 megaflops/watt. The results, however, won’t be official until the Green500 list is announced later this week.

Despite being knocked out the top spot, IBM machines still claim 6 of the top 10 systems:

  1. 17.6 petaflops, Titan (Cray), United States
  2. 16.3 petaflops, Sequoia (IBM), United States
  3. 10.5 petaflops, K computer (Fujitsu), Japan
  4. 8.2 petaflops, Mira (IBM), United States
  5. 4.1 petaflops, JUQUEEN (IBM, Germany
  6. 2.9 petaflops, SuperMUC (IBM), Germany
  7. 2.7 petaflops, Stampede (Dell), United States
  8. 2.6 petaflops, Tianhe-1A (NUDT), China
  9. 1.7 petaflops, Fermi (IBM), Italy
  10. 1.5 petaflops, DARPA Trial Subset (IBM), United States

Although turnover was minimal, the aggregate performance at the top is growing rapidly. These systems now represent more than 68 petaflops; a year ago those top 10 machines encompassed just over 22 petaflops.

A nice chunk of that is thanks to Titan, of course, but the ORNL super also brings GPU-accelerated supercomputer back to the head of the list. The last time such a machine held that title was November 2010, when China’s Tianhe-1A system was the number one machine. Despite the ascendence of Titan, HPC accelerators still constitute a relatively small portion of the list — currently 62 systems.

But that’s four more than just six months ago, and with the launch of the teraflop accelerators this week from Intel (Knights Corner), NVIDIA (Kepler K20 GPUs), and AMD (FirePro S10000), those numbers will almost certainly grow. When you can buy a teraflop on a PCIe card for a few thousand dollars, it becomes a lot easier to string together a petaflop machine. While CPU-only supercomputers still have a lot of life in them, the smart money is on these vector-heavy coprocessors to expand the number of petaflop systems in the world.

Besides Titan, new to the top 10 are Dell’s Stampede and IBM’s DARPA Trial Subset machine. The Stampede machine, installed at the Texas Advanced Supercomputing Center (TACC) debuts Intel’s Knights Corner manycore accelerator, while IBM’s DARPA Trial Subset is an implementation of the Power7-based PERCS architecture, developed in conjunction with the High Productivity Computing Systems (HPCS) program. JUQUEEN is not new to the top 10, but tripled its capacity since June, moving it from number 8 to number 4.

Stampede could also make its way further up the list by next June. The TACC super is slated to reach 10 peak petaflops when the system is fully deployed in 2013, which should get the Linpack mark to about 6.7 petaflops. By then though, there is likely to be even more competition in the multi-petaflops realm.

On the interconnect front, InfiniBand-based supercomputers continues to steal share from Ethernet. Over the last six months, 15 InfiniBand systems were added, for a total of 226, while Ethernet lost 19 machines, reducing its share to 188. At the top of the list though, custom interconnects rule. On the top 10, there is but one that uses InfiniBand (Stampede); the rest employ custom interconnects of various stripes from Cray, IBM, Fujitsu, and China’s NUDT.

The one TOP500 element that remained fairly constant this time around was the geographical distribution of Linpack FLOPS. The US is still the dominant nation with 251 systems (down one from last June). China is in second place with 72 systems (down two from June). The European superpowers — UK, France and Germany have reach parity, more or less, with 24, 21, and 20 systems, respectively.

Perhaps the most significant on this latest list is the growth of petascale supercomputers, which currently constitutes the top 23 systems. That’s up from the top 10 just a year ago. It’s projected that by 2015, all 500 machines will be a petaflop or greater.

(via HPCwire.com)

Nvidia Puts Tesla K20 GPU Coprocessor Through its Paces

Back in May, when Nvidia divulged some of the new features of its top-end GK110 graphics chip, the company said that two new features of the GPU, Hyper-Q and Dynamic Parallelism, would help GPUs run more efficiently and without the CPU butting in all the time. Nvidia is now dribbling out some benchmark test results ahead of the GK110′s shipment later this year inside the Tesla K20 GPU coprocessor for servers.

The GK110 GPU chip, sometimes called the Kepler2, is an absolute beast, with over 7.1 billion transistors etched on a die by foundry Taiwan Semiconductor Manufacturing Corp using its much-sought 28 nanometer processes. It sports 15 SMX (streaming multiprocessor extreme) processing units, each with 192 single-precision CUDA cores and 64 double-precision floating point units tacked on to every triplet of CUDA cores. That gives you 960 DP floating point units across a maximum of 2,880 CUDA cores on the GK110 chip.

Nvidia has been vague about absolute performance, but El Reg expects for the GK110 to deliver just under 2 teraflops of raw DP floating point performance at 1GHz clock speeds on the cores and maybe 3.5 teraflops at single precision. That’s around three times the oomph – and three times the performance per watt of the thermals are about the same – of the existing Fermi GF110 GPUs used in the Tesla M20 series of GPU coprocessors.

Just having more cores is not going to boost performance. You have to use those cores more efficiently, and that’s what the Hyper-Q and Dynamic Parallelism features are all about.

Interestingly, these two features are not available on the GK104 GPU chips, which are used in the Tesla K10 coprocessors that Nvidia is already shipping to customers who need single-precision flops. The Tesla K10 GPU coprocessor puts two GK104 chips on a PCI-Express card and delivers 4.58 teraflops of SP number-crunching in a 225 watt thermal envelope – a staggering 3.5X the performance of the Fermi M2090 coprocessor.

A lot of supercomputer applications run the message passing interface (MPI) protocol to dispatch work on parallel machines, and Hyper-Q allows the GPU to work in a more cooperative fashion with the CPU when handling MPI dispatches. With the Fermi cards, the GPU could only have one MPI task dispatched from the CPU and offloaded to the GPU at a time. This is an obvious bottleneck.

Nvidia's Hyper-Q feature for Kepler GPUsNvidia’s Hyper-Q feature for Kepler GPUs

With Hyper-Q, Nvidia is adding a queue to the GPU itself, and now the processor can dispatch up to 32 different MPI tasks to the GPU at the same time. Not one line of MPI code has to be changed to take advantage of Hyper-Q, it just happens automagically as the CPU is talking to the GPU.

To show how well Hyper-Q works (and that those thousands of CUDA cores won’t be sitting around scratching themselves with boredom), Peter Messmer, a senior development engineer at Nvidia, grabbed some molecular simulation code called CP2K, which he said in a blog was “traditionally difficult code for GPUs” and tested how well it ran on a Tesla K20 coprocessor with Hyper-Q turned off and then on.

As Messmer explains, MPI applications “experienced reduced performance gains” when the MPI processes were limited by the CPU to small amounts of work. The CPU got hammered and the GPUs were inactive a lot of the time. And so the GPU speedup in a hybrid system was not what it could be, as you can see in this benchmark test that puts the Tesla K20 coprocessor inside of the future Cray XK7 supercomputer node with a sixteen-core Opteron 6200 processor.

Hyper-Q boosts for nodes running CP2K molecular simulations by 2.5XHyper-Q boosts for nodes running CP2K molecular simulations by 2.5X

With this particular data set, which simulates 864 water molecules, adding node pairs of CPUs and GPUs didn’t really boost the performance that much. With sixteen nodes without Hyper-Q enabled, you get twelve times the performance (for some reason, Nvidia has the Y axis as relative speedup compared to two CPU+GPU nodes). But on the same system with sixteen CPU+GPU nodes with Hyper-Q turned on, the performance is 2.5 times as high. Nvidia is not promising that all code will see a similar speedup with Hyper-Q, mind you.

El Reg asked Sumit Gupta, senior product manager of the Tesla line at Nvidia, why the CP2K tests didn’t pit the Fermi and Kepler GPUs against each other, and he quipped that Nvidia had to save something for the SC12 supercomputing conference out in Salt Lake City in November.

With Dynamic Parallelism, another feature of the GK110 GPUs, the GPU is given the ability to dispatch work inside the GPU as needed and forced by the calculations that are dispatched to it by the CPU. With the Fermi GPUs, the CPU in the system dispatched work to one or more CUDA cores, and the answer was sent back to the CPU. If further calculations were necessary, the CPU dispatched this data and the algorithms to the GPU, which sent replies back to the CPU, and so on until the calculation finished.

Dynamic Parallelism: Schedule your own work, GPUDynamic Parallelism: schedule your own work, GPU

There can be a lot of back-and-forth with the current Fermi GPUs. Dynamic Parallelism lets the GPU spawn its own work. But more importantly, it also allows for the granularity of the simulations to change dynamically, getting finer-grained where interesting things are going on and doing mostly nothing in the parts of the simulation space where nothing much is going on.

By tuning the granularity of the simulation with the granularity of the data across space and time, you will get better results and do less work (in less time) than you might otherwise with fine-grained simulation in all regions and timeslices.

Performance gains from dynamic parallelism for GK110 GPUsPerformance gains from dynamic parallelism for GK110 GPUs

The most important thing about Dynamic Parallelism is that the GPU automatically makes the decision about the coarseness of calculations, reacting to data and launching new threads as needed.

To show off early test results for Dynamic Parallelism, Nvidia did not do a fluid-mechanics simulation or anything neat like that, but rather inanother blog post, Nvidia engineer Steven Jones ran a Quicksort benchmark on the K20 GPU coprocessor with the Dynamic Parallelism turned off and then on.

If you’ve forgotten your CompSci 101, Jones included the Quicksort code he used in the test in the blog post. Interestingly, it takes half as much code to write the Quicksort routine on the GPU with Dynamic Parallelism turned on and used because you don’t have to control the bouncing back and forth between CPU and GPU.

As you can see in the chart above, if you do all of the launching for each segment of the data to be sorted from the CPU, which you do with Fermi GPUs, then it takes longer to do the sort. On the K20 GPU, Dynamic Parallelism boosts performance of Quicksort by a factor of two and scales pretty much with the size of the data set. It will be interesting to see how much better that K20 is at doing Quicksort compared to the actual Fermi GPUs, and how other workloads and simulations do with this GPU autonomy.

Gupta tells El Reg that the Tesla K20 coprocessors are on track for initial deliveries in the fourth quarter. ®

(via insideHPC.com)