DDN Pushes the Envelope for Parallel Storage I/O

Today at Supercomputing 2014, DataDirect Networks lifted the veil a bit more on Infinite Memory Engine (IME), its new software that will employ Flash storage and a bunch of smart algorithms to create a buffer between HPC compute and parallel file system resources, with the goal of improving file I/O by up to 100x. The company also announced the latest release of its Exascaler, its Lustre-based storage appliance lineup.

The data patterns have been changing at HPC sites in a way that is creating bottlenecks in the I/O. While many HPC shops may think they’re primarily working with large and sequential files, the reality is that most data is relatively small and random, and that fragmented I/O creates problems when moving the data across the interconnect, says Jeff Sisilli, Sr. Director Product Marketing at DataDirect Networks.

“Parallel file systems were really built for large files,” Sisilli tells HPCwire. “What we’re finding is 90 percent of typical I/O in HPC data centers utilizes small files, those less than 32KB. What happens is, when you inject those into a parallel file system, it starts to really bring down performance.”

DDN says it overcame the restrictions in how parallel file systems were created with IME, which creates a storage tier above the file system and provides a “fast data” layer between the compute nodes in an HPC cluster and the backend file system. The software, which resides on the I/O nodes in the cluster, utilizes any available Flash solid state drives (SSDs) or other non-volatile memory (NVM) storage resources available, creating a “burst buffer” to absorb peak loads and eliminate I/O contention.

IME_diagram-236x300
IME works in two ways. First, it removes any limitations of the POSIX layer, such as file locks, that can slow down communication. Secondly, algorithms bundle up the small and random I/O operations into larger files that can be more efficiently read into the file system.

In lab tests at a customer site, DDN ran IME against the S3D turbulent flow modeling software. The software was really designed for larger sequential files, but is often used in the real world with smaller and random files. In the customer’s case, these “mal-aligned and fragmented” files were causing I/O throughput across the InfiniBand interconnect to drop to 25 MBs per second.

After introducing IME, the customer was able to ingest data from the compute cluster onto IME’s SSDs at line rate. “This customer was using InfiniBand, and we were able to fill up InfiniBand all the way to line rate, and absorb at 50 GB per second,” Sisilli says.

The data wasn’t written back into the file system quite that quickly. But because the algorithms were able to align all those small files and convert fragments into full stripe writes, it did provide a speed up compared to 25MB per second. “We were able to drain out the buffer and write to the parallel file system at 4GB per second, which is two orders of magnitude faster than before,” Sisilli says.

The “net net” of IME, Sisilli says, is it frees up HPC compute cluster resources. “From the parallel file system side, we’re able to shield the parallel file system and underlying storage arrays from fragmented I/O, and have those be able to ingest optimized data and utilize much less hardware to be able to get to the performance folks need up above,” he says.

IME will work with any Lustre- or GPFS-based parallel file system. That includes DDN’s own EXAscaler line of Lustre-based storage appliances, or the storage appliances of any other vendor. There are no application modifications required to use IME, which also features data erasure encoding capabilities typicaly found in object file stores. The only requirements are that the application is POSIX compliant or uses the MPI job scheduler. DDN also provides an API that customers can use if they want to modify their apps to work with IME; the company has plans to create an ecosystem of compatible tools using this API.

There are other vendors developing similar Flash-bashed storage buffer offerings. But DDN says the fact that it’s taking an open, software-based approach gives customer an advantage over those vendors that are requiring customers to purchase specialized hardware, or those that work with only certain types of Interconnects.

IME_burst_buffer-300x153

IME isn’t available yet; it’s still in technology preview mode. But when it becomes available, scalability won’t be an issue. The software will be able to corral and make available petabytes worth of Flash or NVM storage resources living across thousands of nodes, Sisilli says. “What we’re recommending is [to have in IME] anywhere between two to three amount of your compute cluster memory to have a great working space within IME to accelerate your applications and and do I/O,” he says. “That can be all the way down to terabytes, and for supercomputers, it’s multi petabytes.”

IME is still undergoing tests, and is expected to become generally available in the second quarter of 2015. DDN will offer it as an appliance or as software.

DDN also today unveiled a new release of EXAScaler. With Version 2.1, DDN has improved read and write I/O performance by 25 percent. That will give DDN a comfortable advantage over competing file systems for some time, says Roger Goff, Sr. Product Manager for DDN.

“We know what folks are about to announce because they pre-announce those things,” Goff says. “Our solution is tremendously faster than what you will see [from other vendors], particularly on a per-rack performance basis.”

Other new features in version 2.1 include support for self-encrypting drives; improved rebuild times; InfiniBand optimizations; and better integration with DDN’s Storage Fusion Xcelerator (SFX Flash Caching) software.

DDN has also standardized on the Lustre file system from Intel, called Intel Enterprise Edition for Lustre version 2.5. That brings it several new capabilities, including a new MapReduce connector for running Hadoop workloads.

“So instead of having data replicated across multiple nodes in the cluster, which is the native mode for HDFS, with this adapter, you can run those Hadoop applications and take advantages of the single-copy nature of a parallel file system, yet have the same capability of a parallel file system to scale to thousands and thousands of clients accessing that same data,” Goff says.

EXAScaler version 2.1 is available now across all three EXAScaler products, including the entry-level SFA7700, the midrange ES12k/SFA12k-20, and the high-end SFA12KX/SFA212k-40.

( Via HPCwire.com )

Facebook has built its own switch. And it looks a lot like a server

parikh_002

SUMMARY:Facebook has built its own networking switch and developed a Linux-based operating systems to run it. The goal is to create networking infrastructure that mimics a server in terms of how its managed and configured.

Not content to remake the server, Facebook’s engineers have taken on the humble switch, building their own version of the networking box and the software to go with it. The resulting switch, dubbed Wedge, and the software called FBOSS will be provided to the Open Compute Foundation as an open source design for others to emulate. Facebook is already testing it with production traffic in its data centers.

Jay Parikh, the VP of infrastructure engineering at Facebook shared the news of the server onstage at the Gigaom Structure event Wednesday, explaining that Facebook’s goal in creating this project was to eliminate the network engineer and run its networking operations in the same easily swapped out and dynamic fashion as their servers. In many ways Facebook’s efforts with designing its own infrastructure have stemmed from the need to build hardware that was as flexible as the software running on top of it. It makes no sense to be innovating all the time with your code if you can’t adjust the infrastructure to run that code efficiently.

ocpnetwork

And networking has long been a frustrating aspect of IT infrastructure because it has been a black box that both delivered packets and also did the computing to figure out the path those packets should take. But as networks scaled out that combination — and the domination of the market by giants Cisco and Juniper — was becoming untenable. Thus efforts to separate the physical delivery of packets and the routing of the packets was split into two jobs allowing the networks to become software-defined — and allowing other companies to start innovating.

The creation of a custom-designed switch that allows Facebook to control its networking like it currently manages its servers has been a long time coming. It began the Open Compute effort with a redesigned server in 2011 and focused on servers and a bit of storage for the next two years. In May 2013 it called for vendors to submit designs for an open source switch and at our last year’s Structure event Parikh detailed Facebook’s new networking fabricthat allowed the social networking giant to move large amounts of traffic more efficiently.

But the combination of the modular hardware approach to the Wedge server and the Linux-based FBOSS operating system blow the switch apart in the same way Facebook blew the server apart. The switch will use the Group Hug microprocessor boards so any type of chip could slot into the box to control configuration and run the OS. The switch will still rely on a networking processor for routing and delivery of the packets and has a throughput of 640 Gbps, but eventually Facebook could separate the transport and decision-making process.

The whole goal here is to turn the monolithic switch into something that is modular and controlled by the FBOSS software that can be updated as needed without having to learn proprietary networking languages required by other providers’ gear. The question with Facebook’s efforts here is how it will affect the larger market for networking products.

Facebook’s infrastructure is relatively unique in that it wholly controls it and has the engineering talent to build software and new hardware to meet its computing needs. Google is another company that has built its own networking switch, but it didn’t open source those designs and keeps them close. But many enterprise customers don’t have the technical expertise of a web giant, so the tweaks that others contribute to the Open Compute Foundation to make the gear and the software will likely influence adoption.

(Via GigaOm.com)

Microway Rolls out Octoputer Servers with up to 8 GPUs

Today Microway announced a new line of servers designed for GPU and storage density. As part of the announcement, the company’s new OctoPuter GPU servers pack 34 TFLOPS of computing power when paired with up to up to eight NVIDIA Tesla K40 GPU accelerators.

NVIDIA GPU accelerators offer the fastest parallel processing power available, but this requires high-speed access to the data. Microway’s newest GPU computing solutions ensure that large amounts of source data are retained in the same server as a high-density of Tesla GPUs. The result is faster application performance by avoiding the bottleneck of data retrieval from network storage,” said Stephen Fried, CTO of Microway.

Microway also introduced an additional NumberSmasher 1U GPU server housing up to three NVIDIA Tesla K40 GPU accelerators. With nearly 13 TFLOPS of computing power, the NumberSmasher includes up to 512GB of memory, 24 x86 compute cores, hardware RAID, and optional InfiniBand.

Octoputer_Tesla1000px-434x400

(Via InsideHPC.com)

A3Cube develop Extreme Parallel Storage Fabric, 7x Infiniband

News from EETimes points towards a startup that claims to offer an extreme performance advantage over Infiniband.  A3Cube Inc. has developed a variation of the PCIe Express on a Network Interface Card to offer lower latency.  The company is promoting their Ronniee Express technology via a PCIe 2.0 driven FPGA to offer sub-microsecond latency across a 128 server cluster.

In the Sockperf benchmark, numbers from A3Cube put performance at around 7x that of Infiniband and PCIe 3.0 x8, and thus claim that the approach beats the top alternatives.  The PCIe support of the device at the physical layer enables quality-of-service features, and A3Cube claim the fabric enables a cluster of 10000 nodes to be represented in a single image without congestion.

The aim for A3Cube will be primarily in HFT, genomics, oil/gas exploration and real-time data analytics.  Prototypes for merchants are currently being worked on, and it is expected that two versions of network cards and a 1U switch based on the technology will be available before July.

The new IP from A3Cube is kept hidden away, but the logic points towards device enumeration and the extension of the PCIe root complex of a cluster of systems.  This is based on the quote regarding PCIe 3.0 incompatibility based on the different device enumeration in that specification.  The plan is to build a solid platform on PCIe 4.0, which puts the technology several years away in terms of non-specialized deployment.

As many startups, the process for A3Cube is to now secure venture funding.  The approach to Ronniee Express is different to that of PLX who are developing a direct PCIe interconnect for computer racks.

A3Cube’s webpage on the technology states the fabric uses a combination of hardware and software, while remaining application transparent.  The product combines multiple 20 or 40 Gbit/s channels, with the aim at petabyte-scale Big Data and HPC storage systems.

Information from Willem Ter Harmsel puts the Ronniee NIC system as a global shared memory container, with an in-memory network between nodes.  CPU/Memory/IO are directly connected, with 800-900 nanosecond latencies, and the ‘memory windows’ facilitates low latency traffic.

Using A3cube’s storage OS, byOS, and 40 terabytes of SSDs and the Ronniee Express fabric, five storage nodes were connected together via 4 links per NIC allowing for 810ns latency in any direction.  A3Cube claim 4 million IOPs with this setup.

Further, in interview by Willem and Anontella Rubicco shows that “Ronniee is designed to build massively parallel storage and analytics machines; not to be used as an “interconnection” as Infiniband or Ethernet.  It is designed to accelerate applications and create parallel storage and analytics architecture.”

(via AnandTech.com)

 

Virident vCache vs. FlashCache

Ease of basic installation. The setup process was simply a matter of installing two RPMs and running a couple of commands to enable vCache on the PCIe flash card (a Virident FlashMAX II) and set up the cache device with the command-line utilities supplied with one of the RPMs. Moreover, the vCache software is built in to the Virident driver, so there is no additional module to install. FlashCache, on the other hand, requires building a separate kernel module in addition to whatever flash memory driver you’ve already had to install, and then further configuration requires modification to assorted sysctls. I would also argue that the vCache documentation is superior. Winner: vCache.

Ease of post-setup modification / advanced installation. Many of the FlashCache device parameters can be easily modified by echoing the desired value to the appropriate sysctl setting; with vCache, there is a command-line binary which can modify many of the same parameters, but doing so requires a cache flush, detach, and reattach. Winner: FlashCache.

Operational Flexibility: Both solutions share many features here; both of them allow whitelisting and blacklisting of PIDs or simply running in a “cache everything” mode. Both of them have support for not caching sequential IO, adjusting the dirty page threshold, flushing the cache on demand, or having a time-based cache flushing mechanism, but some of these features operate differently with vCache than with FlashCache. For example, when doing a manual cache flush with vCache, this is a blocking operation. With FlashCache, echoing “1″ to the do_sync sysctl of the cache device triggers a cache flush, but it happens in the background, and while countdown messages are written to syslog as the operation proceeds, the device never reports that it’s actually finished. I think both kinds of flushing are useful in different situations, and I’d like to see a non-blocking background flush in vCache, but if I had to choose one or the other, I’ll take blocking and modal over fire-and-forget any day. FlashCache does have the nice ability to switch between FIFO and LRU for its flushing algorithm; vCache does not. This is something that could prove useful in certain situations. Winner: FlashCache.

Operational Monitoring: Both solutions offer plenty of statistics; the main difference is that FlashCache stats can be pulled from /proc but vCache stats have to be retrieved by running the vgc-vcache-monitor command. Personally, I prefer “cat /proc/something” but I’m not sure that’s sufficient to award this category to FlashCache. Winner: None.

Time-based Flushing: This wouldn’t seem like it should be a separate category, but because the behavior seems to be so different between the two cache solutions, I’m listing it here. The vCache manual indicates that “flush period” specifies the time after which dirty blocks will be written to the backing store, whereas FlashCache has a setting called “fallow_delay”, defined in the documentation as the time period before “idle” dirty blocks are cleaned from the cache device. It is not entirely clear whether or not these mechanisms operate in the same fashion, but based on the documentation, it appears that they do not. I find the vCache implementation more useful than the one present in FlashCache. Winner: vCache.

Although nobody likes a tie, if you add up the scores, usability is a 2-2-1 draw between vCache and FlashCache. There are things that I really liked better with FlashCache, and there are other things that I thought vCache did a much better job with. If I absolutely must pick a winner in terms of usability, then I’d give a slight edge to FlashCache due to configuration flexibility, but if the GA release of vCache added some of FlashCache’s additional configuration options and exposed statistics via /proc, I’d vote in the other direction.

Disclosure: The research and testing conducted for this post were sponsored by Virident.

First, some background information. All tests were conducted on Percona’s Cisco UCS C250test machine, and both the vCache and FlashCache tests used the same 2.2TB Virident FlashMAX II as the cache storage device. EXT4 is the filesystem, and CentOS 6.4 the operating system, although the pre-release modules I received from Virident required the use of the CentOS 6.2 kernel, 2.6.32-220, so that was the kernel in use for all of the benchmarks on both systems. The benchmark tool used was sysbench 0.5 and the version of MySQL used was Percona Server 5.5.30-rel30.1-465. Each test was allowed to run for 7200 seconds, and the first 3600 seconds were discarded as warmup time; the remaining 3600 seconds were averaged into 10-second intervals. All tests were conducted with approximately 78GiB of data (32 tables, 10M rows each) and a 4GiB buffer pool. The cache devices were flushed to disk immediately prior to and immediately following each test run.

With that out of the way, let’s look at some numbers.

vCache vs. vCache – MySQL parameter testing

The first test was designed to look solely at vCache performance under some different sets of MySQL configuration parameters. For example, given that the front-end device is a very fast PCIe SSD, would it make more sense to configure MySQL as if it were using SSD storage or to just use an optimized HDD storage configuration? After creating a vCache device with the default configuration, I started with a baseline HDD configuration for MySQL (configuration A, listed at the bottom of this post) and then tried three additional sets of experiments. First, the baseline configuration plus:

innodb_read_io_threads = 16
innodb_write_io_threads = 16

We call this configuration B. The next one contained four SSD-specific optimizations based partially on some earlier work that I’d done with this Virident card (configuration C):

innodb_io_capacity = 30000
innodb_adaptive_flushing_method = keep_average
innodb_flush_neighbor_pages=none
innodb_max_dirty_pages_pct = 60

And then finally, a fourth test (configuration D) which combined the parameter changes from tests B and C. The graph below shows the sysbench throughput (tps) for these four configurations:
vcache_trx_params
As we can see, all of the configuration options produce numbers that, in the absence of outliers, are roughly identical, but it’s configuration C (shown in the graph as the blue line – SSD config) which shows the most consistent performance. The others all have assorted performance drops scattered throughout the graph. We see the exact same pattern when looking at transaction latency; the baseline numbers are roughly identical for all four configurations, but configuration C avoids the spikes and produces a very constant and predictable result.
vcache_response_params

vCache vs. FlashCache – the basics

Once I’d determined that configuration C appeared to produce the most optimal results, I moved on to reviewing FlashCache performance versus that of vCache, and I also included a “no cache” test run as well using the base HDD MySQL configuration for purposes of comparison. Given the apparent differences in time-based flushing in vCache and FlashCache, both cache devices were set up so that time-based flushing was disabled. Also, both devices were set up such that all IO would be cached (i.e., no special treatment of sequential writes) and with a 50% dirty page threshold. Again, for comparison purposes, I also include the numbers from the vCache test where the time-based flushing is enabled.
vcache_fcache_trx_params
As we’d expect, the HDD-only solution barely registered on the graph. With a buffer pool that’s much smaller than the working set, the no-cache approach is fairly crippled and ineffectual. FlashCache does substantially better, coming in at an average of around 600 tps, but vCache is about 3x better. The interesting item here is that vCache with time-based flushing enabled actually produces better and more consistent performance than vCache without time-based flushing, but even at its worst, the vCache test without time-based flushing still outperforms FlashCache by over 2x, on average.

Looking just at sysbench reads, vCache with time-based flushing consistently hit about 27000 per second, whereas without time-based flushing it averaged about 12500. FlashCache came in around 7500 or so. Sysbench writes came in just under 8000 for vCache + time-based flushing, around 6000 for vCache without time-based flushing, and somewhere around 2500 for FlashCache.
vcache_fcache_read_write

We can take a look at some vmstat data to see what’s actually happening on the system during all these various tests. Clockwise from the top left in the next graph, we have “no cache”, “FlashCache”, “vCache with no time-based flushing”, and “vCache with time-based flushing.” As the images demonstrate, the no-cache system is being crushed by IO wait. FlashCache and vCache both show improvements, but it’s not until we get to vCache with the time-based flushing that we see some nice, predictable, constant performance.
cpu-usage-all

So why is it the case that vCache with time-based flushing appears to outperform all the rest? My hypothesis here is that time-based flushing allows the backing store to be written to at a more constant and, potentially, submaximal, rate compared to dirty-page-threshold flushing, which kicks in at a given level and then attempts to flush as quickly as possible to bring the dirty pages back within acceptable bounds. This is, however, only a hypothesis.

vCache vs. FlashCache – dirty page threshold

Finally, we examine the impact of a couple of different dirty-page ratios on device performance, since this is the only parameter which can be reliably varied between the two in the same way. The following graph shows sysbench OLTP performance for FlashCache vs. vCache with a 10% dirty threshold versus the same metrics at a 50% dirty threshold. Time-based flushing has been disabled. In this case, both systems produce better performance when the dirty-page threshold is set to 50%, but once again, vCache at 10% outperforms FlashCache at 10%.

vcache-dirty_trx_params

The one interesting item here is that vCache actually appears to get *better* over time; I’m not entirely sure why that’s the case or at what point the performance is going to level off since these tests were all run for 2 hours anyway, but I think the overall results still speak for themselves, and even with a vCache volume where the dirty ratio is only 10%, such as might be the case where a deployment has a massive data set size in relation to both the working set and the cache device size, the numbers are encouraging.

Conclusion

Overall, the I think the graphs speak for themselves. When the working set outstrips the available buffer pool memory but still fits into the cache device, vCache shines. Compared to a deployment with no SSD cache whatsoever, FlashCache still does quite well, massively outperforming the HDD-only setup, but it doesn’t even really come close to the numbers obtained with vCache. There may be ways to adjust the FlashCache configuration to produce better or more consistent results, or results that are more inline with the numbers put up by vCache, but when we consider that overall usability was one of the evaluation points and combine that with the fact that the best vCache performance results were obtained with the default vCache configuration, I think vCache can be declared the clear winner.

Base MySQL & Benchmark Configuration

All benchmarks were conducted with the following:

sysbench ­­--num­-threads=32 ­­--test=tests/db/oltp.lua ­­--oltp_tables_count=32 \
--oltp­-table­-size=10000000 ­­--rand­-init=on ­­--report­-interval=1 ­­--rand­-type=pareto \
--forced­-shutdown=1 ­­--max­-time=7200 ­­--max­-requests=0 ­­--percentile=95 ­­\
--mysql­-user=root --mysql­-socket=/tmp/mysql.sock ­­--mysql­-table­-engine=innodb ­­\
--oltp­-read­-only=off run

The base MySQL configuration (configuration A) appears below:

#####fixed innodb options
innodb_file_format = barracuda
innodb_buffer_pool_size = 4G
innodb_file_per_table = true
innodb_data_file_path = ibdata1:100M
innodb_flush_method = O_DIRECT
innodb_log_buffer_size = 128M
innodb_flush_log_at_trx_commit = 1
innodb_log_file_size = 1G
innodb_log_files_in_group = 2
innodb_purge_threads = 1
innodb_fast_shutdown = 1
#not innodb options (fixed)
back_log = 50
wait_timeout = 120
max_connections = 5000
max_prepared_stmt_count=500000
max_connect_errors = 10
table_open_cache = 10240
max_allowed_packet = 16M
binlog_cache_size = 16M
max_heap_table_size = 64M
sort_buffer_size = 4M
join_buffer_size = 4M
thread_cache_size = 1000
query_cache_size = 0
query_cache_type = 0
ft_min_word_len = 4
thread_stack = 192K
tmp_table_size = 64M
server­id = 101
key_buffer_size = 8M
read_buffer_size = 1M
read_rnd_buffer_size = 4M
bulk_insert_buffer_size = 8M
myisam_sort_buffer_size = 8M
myisam_max_sort_file_size = 10G
myisam_repair_threads = 1
myisam_recover 

(Source: ssdperformanceblog.com)

Cray brings top supercomputer tech to businesses for a mere $500,000

A Cray XC30-AC server rack.

Cray, the company that built the world’s fastest supercomputer, is bringing its next generation of supercomputer technology to regular ol’ business customers with systems starting at just $500,000.

The new XC30-AC systems announced today range in price from $500,000 to roughly $3 million, providing speeds of 22 to 176 teraflops. That’s just a fraction of the speed of the aforementioned world’s fastest supercomputer, the $60 million Titan, which clocks in at 17.59 petaflops. (A teraflop represents a thousand billion floating point operations per second, while a petaflop is a million billion operations per second.)

But in fact, the processors and interconnect used in XC30-AC is a step up from those used to build Titan. The technology Cray is selling to smaller customers today could someday be used to build supercomputers even faster than Titan.

Titan uses a mix of AMD Opteron and Nvidia processors for a total of 560,640 cores, and uses Cray’s proprietary Gemini interconnect.

XC30-AC systems ship with Intel Xeon 5400 Series processors (CORRECTION: Cray had told us this comes with Xeon 5400, but the product documents say it actually comes with the Xeon E5-2600). It’s the first Intel-based supercomputer Cray is selling into smaller businesses, what it calls the “technical enterprise” market. (Cray’s previous systems for this market used AMD processors.) Perhaps more importantly, XC30-AC uses Aries, an even faster interconnect than the one found in Titan.

In short, it’s “the latest Intel architectures with the latest Cray interconnect” that will be installed in “future #1 and top 10 class systems,” Cray VP of Marketing Barry Bolding told Ars. “It’s like buying a car model where you’re getting exactly same engine you’re getting in a top-of-the-line BMW. The only thing that’s changing are some of the peripherals that make the system easier to fit into a data center and make it more affordable.” Compared to systems like Titan, Cray says XC30-AC has “physically smaller compute cabinets with 16 vertical blades per cabinet.”

Oil and gas firms or electronics companies performing complex simulations are among the potential customers for an XC30-AC supercomputer.

XC30-AC is a followup to the XC30 systems which are meant for larger customers and typically cost tens of millions of dollars. The “AC” refers to the fact that the smaller systems are air-cooled instead of water-cooled. The power requirements aren’t as immense, and using air cooling makes it easier to install in a wider range of data centers.

XC30 systems scale up to 482 cabinets and 185,000 sockets with more than a million processor cores. The XC30-AC goes from one to eight cabinets, with each holding 16 blades of eight sockets each for 128 sockets in each cabinet. With an Intel 8-core Xeon processor in each socket, that adds up to 1,024 sockets and as many as 8,192 processor cores in an eight cabinet-system. A single cabinet with about 30TB of usable storage and 128 sockets would cost about $500,000, while eight cabinet systems with 140TB of usable storage and 1,024 sockets hit the $3 million range.

To begin with, the XC30-AC supports only Intel Xeon processors with Sandy Bridge architecture. Those will be updated to server-class Ivy Bridge chips later on. Nvidia GPUs and Intel Xeon Phi chips will become available as co-processors by the end of the year, Cray said.

While XC30-AC systems will be smaller than traditional supercomputers, the Cray Aries interconnect makes it incredibly fast, Bolding said. He noted that Ethernet interconnects generally aren’t fast enough for the world’s fastest supercomputers. InfiniBand has really taken off, being used in about half of the top 100 systems and two of the top 10. But the top five systems in the world all use custom, proprietary interconnects such as Cray’s or IBM’s.

Enlarge / A look at the XC30 architecture.
Cray

Aries supports injection bandwidth of 10Gbps and 120 million get and put operations per second. “Injection bandwidth” is less than the full system’s bandwidth. As a paper on the interconnect’s architecture notes, the “global bandwidth of a full network exceeds the injection bandwidth—all of the traffic injected by a node can be routed to other groups.”

Latency is another key factor, with Aries providing point to point latency of less than a microsecond, Bolding said. Moreover, latency remains strong when a cluster is going full blast. “When a system is very busy and sending messages from one end of the machine to another across a fully loaded network where everything’s working at once, Cray’s latencies are literally almost as good as they are in point to point. They go up to around two or three microseconds,” Bolding said.

The speed allows memory to be shared across processors. “No matter how many nodes you have you can actually treat it as if it’s a shared memory machine, every node can talk to every other node, directly into the memory of that compute node,” Bolding said. “That’s something that is very powerful for certain types of applications and programming models.”

Aries also features more sophisticated network congestion algorithms than the previous generation, preventing messages from getting backed up during times of high usage.

As for software, XC30-AC comes with the SUSE-based Cray Linux Environment also used in Titan, allowing customers to run almost any Linux application, Bolding said. While some of Cray’s other systems are designed to run any form of Linux a customer wants, the XC30-AC comes with software optimized for the system. This allows it to be ready to go shortly after it comes out of the box, instead of requiring a week of setup.

Who will buy an entry-level supercomputer?

Cray isn’t the financial success it once was, with its latest earnings announcement showing a year-over-year drop in quarterly revenue from $112.3 million to $79.5 million. The company also experienced a net loss of $7.6 million. Cray fared better in fiscal 2012, with full-year revenue of $421.1 million and net income of $161.2 million.

High-performance computing revenue is on the rise, with supercomputing products ($500,000 and up) leading the way according to IDC. HPC and supercomputing revenue is growing faster than theserver market as a whole.

Cray is hoping to take its share of that revenue by selling both the smallest and largest supercomputer-class systems. While the XC30-AC was just announced today, it’s been shipping for a few weeks. Early customers include an unnamed “Fortune 100″ commercial electronics firm whose R&D department needs a powerful machine for simulations.

The oil and gas industry has a need for such machines to model oil fields. Biotechnology, engineering, and various manufacturing industries may provide interested customers as well, Cray says.

We’ve written about the trend of Amazon and other cloud services being used for supercomputing, with one-off jobs costing up to several thousand dollars an hour. Those are generally for customers that have only occasional need for a supercomputer, however. Many businesses would use a supercomputer often enough that owning one is more cost-efficient. Cray is betting a lot of Fortune 500 companies and universities that can’t afford giant clusters costing tens of millions of dollars will be interested in systems like the XC30-AC.

“The complexity of problems that mid-range customers, technical enterprise customers face today are becoming so complex that they do need a tightly integrated supercomputer,” Bolding said. “They can’t always get away with a more conventional Ethernet cluster.”

(via arstechnica.com)

 

Titan Knocks Off Sequoia as Top Supercomputer

In the battle of the DOE labs, Oak Ridge Lab’s Titan supercomputer has taken the title from the former TOP500 champ, Lawrence Livermore’s Sequoia. The GPU-charged Titan, using the new NVIDIA K20X-equipped XK7 blades from Cray, delivered 17.6 petaflops to Sequoia’s 16.3 petaflops on Linpack, the sole metric for TOP500 rankings.

Titan looks like it will also take the energy-efficiency title from Sequoia and the Blue Gene/Q platform. The Oak Ridge super delivers 2,120 megaflops/watt, besting Sequoia’s current mark of 2,100 megaflops/watt. The results, however, won’t be official until the Green500 list is announced later this week.

Despite being knocked out the top spot, IBM machines still claim 6 of the top 10 systems:

  1. 17.6 petaflops, Titan (Cray), United States
  2. 16.3 petaflops, Sequoia (IBM), United States
  3. 10.5 petaflops, K computer (Fujitsu), Japan
  4. 8.2 petaflops, Mira (IBM), United States
  5. 4.1 petaflops, JUQUEEN (IBM, Germany
  6. 2.9 petaflops, SuperMUC (IBM), Germany
  7. 2.7 petaflops, Stampede (Dell), United States
  8. 2.6 petaflops, Tianhe-1A (NUDT), China
  9. 1.7 petaflops, Fermi (IBM), Italy
  10. 1.5 petaflops, DARPA Trial Subset (IBM), United States

Although turnover was minimal, the aggregate performance at the top is growing rapidly. These systems now represent more than 68 petaflops; a year ago those top 10 machines encompassed just over 22 petaflops.

A nice chunk of that is thanks to Titan, of course, but the ORNL super also brings GPU-accelerated supercomputer back to the head of the list. The last time such a machine held that title was November 2010, when China’s Tianhe-1A system was the number one machine. Despite the ascendence of Titan, HPC accelerators still constitute a relatively small portion of the list — currently 62 systems.

But that’s four more than just six months ago, and with the launch of the teraflop accelerators this week from Intel (Knights Corner), NVIDIA (Kepler K20 GPUs), and AMD (FirePro S10000), those numbers will almost certainly grow. When you can buy a teraflop on a PCIe card for a few thousand dollars, it becomes a lot easier to string together a petaflop machine. While CPU-only supercomputers still have a lot of life in them, the smart money is on these vector-heavy coprocessors to expand the number of petaflop systems in the world.

Besides Titan, new to the top 10 are Dell’s Stampede and IBM’s DARPA Trial Subset machine. The Stampede machine, installed at the Texas Advanced Supercomputing Center (TACC) debuts Intel’s Knights Corner manycore accelerator, while IBM’s DARPA Trial Subset is an implementation of the Power7-based PERCS architecture, developed in conjunction with the High Productivity Computing Systems (HPCS) program. JUQUEEN is not new to the top 10, but tripled its capacity since June, moving it from number 8 to number 4.

Stampede could also make its way further up the list by next June. The TACC super is slated to reach 10 peak petaflops when the system is fully deployed in 2013, which should get the Linpack mark to about 6.7 petaflops. By then though, there is likely to be even more competition in the multi-petaflops realm.

On the interconnect front, InfiniBand-based supercomputers continues to steal share from Ethernet. Over the last six months, 15 InfiniBand systems were added, for a total of 226, while Ethernet lost 19 machines, reducing its share to 188. At the top of the list though, custom interconnects rule. On the top 10, there is but one that uses InfiniBand (Stampede); the rest employ custom interconnects of various stripes from Cray, IBM, Fujitsu, and China’s NUDT.

The one TOP500 element that remained fairly constant this time around was the geographical distribution of Linpack FLOPS. The US is still the dominant nation with 251 systems (down one from last June). China is in second place with 72 systems (down two from June). The European superpowers — UK, France and Germany have reach parity, more or less, with 24, 21, and 20 systems, respectively.

Perhaps the most significant on this latest list is the growth of petascale supercomputers, which currently constitutes the top 23 systems. That’s up from the top 10 just a year ago. It’s projected that by 2015, all 500 machines will be a petaflop or greater.

(via HPCwire.com)