Virident vCache vs. FlashCache

Ease of basic installation. The setup process was simply a matter of installing two RPMs and running a couple of commands to enable vCache on the PCIe flash card (a Virident FlashMAX II) and set up the cache device with the command-line utilities supplied with one of the RPMs. Moreover, the vCache software is built in to the Virident driver, so there is no additional module to install. FlashCache, on the other hand, requires building a separate kernel module in addition to whatever flash memory driver you’ve already had to install, and then further configuration requires modification to assorted sysctls. I would also argue that the vCache documentation is superior. Winner: vCache.

Ease of post-setup modification / advanced installation. Many of the FlashCache device parameters can be easily modified by echoing the desired value to the appropriate sysctl setting; with vCache, there is a command-line binary which can modify many of the same parameters, but doing so requires a cache flush, detach, and reattach. Winner: FlashCache.

Operational Flexibility: Both solutions share many features here; both of them allow whitelisting and blacklisting of PIDs or simply running in a “cache everything” mode. Both of them have support for not caching sequential IO, adjusting the dirty page threshold, flushing the cache on demand, or having a time-based cache flushing mechanism, but some of these features operate differently with vCache than with FlashCache. For example, when doing a manual cache flush with vCache, this is a blocking operation. With FlashCache, echoing “1″ to the do_sync sysctl of the cache device triggers a cache flush, but it happens in the background, and while countdown messages are written to syslog as the operation proceeds, the device never reports that it’s actually finished. I think both kinds of flushing are useful in different situations, and I’d like to see a non-blocking background flush in vCache, but if I had to choose one or the other, I’ll take blocking and modal over fire-and-forget any day. FlashCache does have the nice ability to switch between FIFO and LRU for its flushing algorithm; vCache does not. This is something that could prove useful in certain situations. Winner: FlashCache.

Operational Monitoring: Both solutions offer plenty of statistics; the main difference is that FlashCache stats can be pulled from /proc but vCache stats have to be retrieved by running the vgc-vcache-monitor command. Personally, I prefer “cat /proc/something” but I’m not sure that’s sufficient to award this category to FlashCache. Winner: None.

Time-based Flushing: This wouldn’t seem like it should be a separate category, but because the behavior seems to be so different between the two cache solutions, I’m listing it here. The vCache manual indicates that “flush period” specifies the time after which dirty blocks will be written to the backing store, whereas FlashCache has a setting called “fallow_delay”, defined in the documentation as the time period before “idle” dirty blocks are cleaned from the cache device. It is not entirely clear whether or not these mechanisms operate in the same fashion, but based on the documentation, it appears that they do not. I find the vCache implementation more useful than the one present in FlashCache. Winner: vCache.

Although nobody likes a tie, if you add up the scores, usability is a 2-2-1 draw between vCache and FlashCache. There are things that I really liked better with FlashCache, and there are other things that I thought vCache did a much better job with. If I absolutely must pick a winner in terms of usability, then I’d give a slight edge to FlashCache due to configuration flexibility, but if the GA release of vCache added some of FlashCache’s additional configuration options and exposed statistics via /proc, I’d vote in the other direction.

Disclosure: The research and testing conducted for this post were sponsored by Virident.

First, some background information. All tests were conducted on Percona’s Cisco UCS C250test machine, and both the vCache and FlashCache tests used the same 2.2TB Virident FlashMAX II as the cache storage device. EXT4 is the filesystem, and CentOS 6.4 the operating system, although the pre-release modules I received from Virident required the use of the CentOS 6.2 kernel, 2.6.32-220, so that was the kernel in use for all of the benchmarks on both systems. The benchmark tool used was sysbench 0.5 and the version of MySQL used was Percona Server 5.5.30-rel30.1-465. Each test was allowed to run for 7200 seconds, and the first 3600 seconds were discarded as warmup time; the remaining 3600 seconds were averaged into 10-second intervals. All tests were conducted with approximately 78GiB of data (32 tables, 10M rows each) and a 4GiB buffer pool. The cache devices were flushed to disk immediately prior to and immediately following each test run.

With that out of the way, let’s look at some numbers.

vCache vs. vCache – MySQL parameter testing

The first test was designed to look solely at vCache performance under some different sets of MySQL configuration parameters. For example, given that the front-end device is a very fast PCIe SSD, would it make more sense to configure MySQL as if it were using SSD storage or to just use an optimized HDD storage configuration? After creating a vCache device with the default configuration, I started with a baseline HDD configuration for MySQL (configuration A, listed at the bottom of this post) and then tried three additional sets of experiments. First, the baseline configuration plus:

innodb_read_io_threads = 16
innodb_write_io_threads = 16

We call this configuration B. The next one contained four SSD-specific optimizations based partially on some earlier work that I’d done with this Virident card (configuration C):

innodb_io_capacity = 30000
innodb_adaptive_flushing_method = keep_average
innodb_flush_neighbor_pages=none
innodb_max_dirty_pages_pct = 60

And then finally, a fourth test (configuration D) which combined the parameter changes from tests B and C. The graph below shows the sysbench throughput (tps) for these four configurations:
vcache_trx_params
As we can see, all of the configuration options produce numbers that, in the absence of outliers, are roughly identical, but it’s configuration C (shown in the graph as the blue line – SSD config) which shows the most consistent performance. The others all have assorted performance drops scattered throughout the graph. We see the exact same pattern when looking at transaction latency; the baseline numbers are roughly identical for all four configurations, but configuration C avoids the spikes and produces a very constant and predictable result.
vcache_response_params

vCache vs. FlashCache – the basics

Once I’d determined that configuration C appeared to produce the most optimal results, I moved on to reviewing FlashCache performance versus that of vCache, and I also included a “no cache” test run as well using the base HDD MySQL configuration for purposes of comparison. Given the apparent differences in time-based flushing in vCache and FlashCache, both cache devices were set up so that time-based flushing was disabled. Also, both devices were set up such that all IO would be cached (i.e., no special treatment of sequential writes) and with a 50% dirty page threshold. Again, for comparison purposes, I also include the numbers from the vCache test where the time-based flushing is enabled.
vcache_fcache_trx_params
As we’d expect, the HDD-only solution barely registered on the graph. With a buffer pool that’s much smaller than the working set, the no-cache approach is fairly crippled and ineffectual. FlashCache does substantially better, coming in at an average of around 600 tps, but vCache is about 3x better. The interesting item here is that vCache with time-based flushing enabled actually produces better and more consistent performance than vCache without time-based flushing, but even at its worst, the vCache test without time-based flushing still outperforms FlashCache by over 2x, on average.

Looking just at sysbench reads, vCache with time-based flushing consistently hit about 27000 per second, whereas without time-based flushing it averaged about 12500. FlashCache came in around 7500 or so. Sysbench writes came in just under 8000 for vCache + time-based flushing, around 6000 for vCache without time-based flushing, and somewhere around 2500 for FlashCache.
vcache_fcache_read_write

We can take a look at some vmstat data to see what’s actually happening on the system during all these various tests. Clockwise from the top left in the next graph, we have “no cache”, “FlashCache”, “vCache with no time-based flushing”, and “vCache with time-based flushing.” As the images demonstrate, the no-cache system is being crushed by IO wait. FlashCache and vCache both show improvements, but it’s not until we get to vCache with the time-based flushing that we see some nice, predictable, constant performance.
cpu-usage-all

So why is it the case that vCache with time-based flushing appears to outperform all the rest? My hypothesis here is that time-based flushing allows the backing store to be written to at a more constant and, potentially, submaximal, rate compared to dirty-page-threshold flushing, which kicks in at a given level and then attempts to flush as quickly as possible to bring the dirty pages back within acceptable bounds. This is, however, only a hypothesis.

vCache vs. FlashCache – dirty page threshold

Finally, we examine the impact of a couple of different dirty-page ratios on device performance, since this is the only parameter which can be reliably varied between the two in the same way. The following graph shows sysbench OLTP performance for FlashCache vs. vCache with a 10% dirty threshold versus the same metrics at a 50% dirty threshold. Time-based flushing has been disabled. In this case, both systems produce better performance when the dirty-page threshold is set to 50%, but once again, vCache at 10% outperforms FlashCache at 10%.

vcache-dirty_trx_params

The one interesting item here is that vCache actually appears to get *better* over time; I’m not entirely sure why that’s the case or at what point the performance is going to level off since these tests were all run for 2 hours anyway, but I think the overall results still speak for themselves, and even with a vCache volume where the dirty ratio is only 10%, such as might be the case where a deployment has a massive data set size in relation to both the working set and the cache device size, the numbers are encouraging.

Conclusion

Overall, the I think the graphs speak for themselves. When the working set outstrips the available buffer pool memory but still fits into the cache device, vCache shines. Compared to a deployment with no SSD cache whatsoever, FlashCache still does quite well, massively outperforming the HDD-only setup, but it doesn’t even really come close to the numbers obtained with vCache. There may be ways to adjust the FlashCache configuration to produce better or more consistent results, or results that are more inline with the numbers put up by vCache, but when we consider that overall usability was one of the evaluation points and combine that with the fact that the best vCache performance results were obtained with the default vCache configuration, I think vCache can be declared the clear winner.

Base MySQL & Benchmark Configuration

All benchmarks were conducted with the following:

sysbench ­­--num­-threads=32 ­­--test=tests/db/oltp.lua ­­--oltp_tables_count=32 \
--oltp­-table­-size=10000000 ­­--rand­-init=on ­­--report­-interval=1 ­­--rand­-type=pareto \
--forced­-shutdown=1 ­­--max­-time=7200 ­­--max­-requests=0 ­­--percentile=95 ­­\
--mysql­-user=root --mysql­-socket=/tmp/mysql.sock ­­--mysql­-table­-engine=innodb ­­\
--oltp­-read­-only=off run

The base MySQL configuration (configuration A) appears below:

#####fixed innodb options
innodb_file_format = barracuda
innodb_buffer_pool_size = 4G
innodb_file_per_table = true
innodb_data_file_path = ibdata1:100M
innodb_flush_method = O_DIRECT
innodb_log_buffer_size = 128M
innodb_flush_log_at_trx_commit = 1
innodb_log_file_size = 1G
innodb_log_files_in_group = 2
innodb_purge_threads = 1
innodb_fast_shutdown = 1
#not innodb options (fixed)
back_log = 50
wait_timeout = 120
max_connections = 5000
max_prepared_stmt_count=500000
max_connect_errors = 10
table_open_cache = 10240
max_allowed_packet = 16M
binlog_cache_size = 16M
max_heap_table_size = 64M
sort_buffer_size = 4M
join_buffer_size = 4M
thread_cache_size = 1000
query_cache_size = 0
query_cache_type = 0
ft_min_word_len = 4
thread_stack = 192K
tmp_table_size = 64M
server­id = 101
key_buffer_size = 8M
read_buffer_size = 1M
read_rnd_buffer_size = 4M
bulk_insert_buffer_size = 8M
myisam_sort_buffer_size = 8M
myisam_max_sort_file_size = 10G
myisam_repair_threads = 1
myisam_recover 

(Source: ssdperformanceblog.com)

Cray brings top supercomputer tech to businesses for a mere $500,000


A Cray XC30-AC server rack.

Cray, the company that built the world’s fastest supercomputer, is bringing its next generation of supercomputer technology to regular ol’ business customers with systems starting at just $500,000.

The new XC30-AC systems announced today range in price from $500,000 to roughly $3 million, providing speeds of 22 to 176 teraflops. That’s just a fraction of the speed of the aforementioned world’s fastest supercomputer, the $60 million Titan, which clocks in at 17.59 petaflops. (A teraflop represents a thousand billion floating point operations per second, while a petaflop is a million billion operations per second.)

But in fact, the processors and interconnect used in XC30-AC is a step up from those used to build Titan. The technology Cray is selling to smaller customers today could someday be used to build supercomputers even faster than Titan.

Titan uses a mix of AMD Opteron and Nvidia processors for a total of 560,640 cores, and uses Cray’s proprietary Gemini interconnect.

XC30-AC systems ship with Intel Xeon 5400 Series processors (CORRECTION: Cray had told us this comes with Xeon 5400, but the product documents say it actually comes with the Xeon E5-2600). It’s the first Intel-based supercomputer Cray is selling into smaller businesses, what it calls the “technical enterprise” market. (Cray’s previous systems for this market used AMD processors.) Perhaps more importantly, XC30-AC uses Aries, an even faster interconnect than the one found in Titan.

In short, it’s “the latest Intel architectures with the latest Cray interconnect” that will be installed in “future #1 and top 10 class systems,” Cray VP of Marketing Barry Bolding told Ars. “It’s like buying a car model where you’re getting exactly same engine you’re getting in a top-of-the-line BMW. The only thing that’s changing are some of the peripherals that make the system easier to fit into a data center and make it more affordable.” Compared to systems like Titan, Cray says XC30-AC has “physically smaller compute cabinets with 16 vertical blades per cabinet.”

Oil and gas firms or electronics companies performing complex simulations are among the potential customers for an XC30-AC supercomputer.

XC30-AC is a followup to the XC30 systems which are meant for larger customers and typically cost tens of millions of dollars. The “AC” refers to the fact that the smaller systems are air-cooled instead of water-cooled. The power requirements aren’t as immense, and using air cooling makes it easier to install in a wider range of data centers.

XC30 systems scale up to 482 cabinets and 185,000 sockets with more than a million processor cores. The XC30-AC goes from one to eight cabinets, with each holding 16 blades of eight sockets each for 128 sockets in each cabinet. With an Intel 8-core Xeon processor in each socket, that adds up to 1,024 sockets and as many as 8,192 processor cores in an eight cabinet-system. A single cabinet with about 30TB of usable storage and 128 sockets would cost about $500,000, while eight cabinet systems with 140TB of usable storage and 1,024 sockets hit the $3 million range.

To begin with, the XC30-AC supports only Intel Xeon processors with Sandy Bridge architecture. Those will be updated to server-class Ivy Bridge chips later on. Nvidia GPUs and Intel Xeon Phi chips will become available as co-processors by the end of the year, Cray said.

While XC30-AC systems will be smaller than traditional supercomputers, the Cray Aries interconnect makes it incredibly fast, Bolding said. He noted that Ethernet interconnects generally aren’t fast enough for the world’s fastest supercomputers. InfiniBand has really taken off, being used in about half of the top 100 systems and two of the top 10. But the top five systems in the world all use custom, proprietary interconnects such as Cray’s or IBM’s.


Enlarge / A look at the XC30 architecture.
Cray

Aries supports injection bandwidth of 10Gbps and 120 million get and put operations per second. “Injection bandwidth” is less than the full system’s bandwidth. As a paper on the interconnect’s architecture notes, the “global bandwidth of a full network exceeds the injection bandwidth—all of the traffic injected by a node can be routed to other groups.”

Latency is another key factor, with Aries providing point to point latency of less than a microsecond, Bolding said. Moreover, latency remains strong when a cluster is going full blast. “When a system is very busy and sending messages from one end of the machine to another across a fully loaded network where everything’s working at once, Cray’s latencies are literally almost as good as they are in point to point. They go up to around two or three microseconds,” Bolding said.

The speed allows memory to be shared across processors. “No matter how many nodes you have you can actually treat it as if it’s a shared memory machine, every node can talk to every other node, directly into the memory of that compute node,” Bolding said. “That’s something that is very powerful for certain types of applications and programming models.”

Aries also features more sophisticated network congestion algorithms than the previous generation, preventing messages from getting backed up during times of high usage.

As for software, XC30-AC comes with the SUSE-based Cray Linux Environment also used in Titan, allowing customers to run almost any Linux application, Bolding said. While some of Cray’s other systems are designed to run any form of Linux a customer wants, the XC30-AC comes with software optimized for the system. This allows it to be ready to go shortly after it comes out of the box, instead of requiring a week of setup.

Who will buy an entry-level supercomputer?

Cray isn’t the financial success it once was, with its latest earnings announcement showing a year-over-year drop in quarterly revenue from $112.3 million to $79.5 million. The company also experienced a net loss of $7.6 million. Cray fared better in fiscal 2012, with full-year revenue of $421.1 million and net income of $161.2 million.

High-performance computing revenue is on the rise, with supercomputing products ($500,000 and up) leading the way according to IDC. HPC and supercomputing revenue is growing faster than theserver market as a whole.

Cray is hoping to take its share of that revenue by selling both the smallest and largest supercomputer-class systems. While the XC30-AC was just announced today, it’s been shipping for a few weeks. Early customers include an unnamed “Fortune 100″ commercial electronics firm whose R&D department needs a powerful machine for simulations.

The oil and gas industry has a need for such machines to model oil fields. Biotechnology, engineering, and various manufacturing industries may provide interested customers as well, Cray says.

We’ve written about the trend of Amazon and other cloud services being used for supercomputing, with one-off jobs costing up to several thousand dollars an hour. Those are generally for customers that have only occasional need for a supercomputer, however. Many businesses would use a supercomputer often enough that owning one is more cost-efficient. Cray is betting a lot of Fortune 500 companies and universities that can’t afford giant clusters costing tens of millions of dollars will be interested in systems like the XC30-AC.

“The complexity of problems that mid-range customers, technical enterprise customers face today are becoming so complex that they do need a tightly integrated supercomputer,” Bolding said. “They can’t always get away with a more conventional Ethernet cluster.”

(via arstechnica.com)

 

Titan Knocks Off Sequoia as Top Supercomputer

In the battle of the DOE labs, Oak Ridge Lab’s Titan supercomputer has taken the title from the former TOP500 champ, Lawrence Livermore’s Sequoia. The GPU-charged Titan, using the new NVIDIA K20X-equipped XK7 blades from Cray, delivered 17.6 petaflops to Sequoia’s 16.3 petaflops on Linpack, the sole metric for TOP500 rankings.

Titan looks like it will also take the energy-efficiency title from Sequoia and the Blue Gene/Q platform. The Oak Ridge super delivers 2,120 megaflops/watt, besting Sequoia’s current mark of 2,100 megaflops/watt. The results, however, won’t be official until the Green500 list is announced later this week.

Despite being knocked out the top spot, IBM machines still claim 6 of the top 10 systems:

  1. 17.6 petaflops, Titan (Cray), United States
  2. 16.3 petaflops, Sequoia (IBM), United States
  3. 10.5 petaflops, K computer (Fujitsu), Japan
  4. 8.2 petaflops, Mira (IBM), United States
  5. 4.1 petaflops, JUQUEEN (IBM, Germany
  6. 2.9 petaflops, SuperMUC (IBM), Germany
  7. 2.7 petaflops, Stampede (Dell), United States
  8. 2.6 petaflops, Tianhe-1A (NUDT), China
  9. 1.7 petaflops, Fermi (IBM), Italy
  10. 1.5 petaflops, DARPA Trial Subset (IBM), United States

Although turnover was minimal, the aggregate performance at the top is growing rapidly. These systems now represent more than 68 petaflops; a year ago those top 10 machines encompassed just over 22 petaflops.

A nice chunk of that is thanks to Titan, of course, but the ORNL super also brings GPU-accelerated supercomputer back to the head of the list. The last time such a machine held that title was November 2010, when China’s Tianhe-1A system was the number one machine. Despite the ascendence of Titan, HPC accelerators still constitute a relatively small portion of the list — currently 62 systems.

But that’s four more than just six months ago, and with the launch of the teraflop accelerators this week from Intel (Knights Corner), NVIDIA (Kepler K20 GPUs), and AMD (FirePro S10000), those numbers will almost certainly grow. When you can buy a teraflop on a PCIe card for a few thousand dollars, it becomes a lot easier to string together a petaflop machine. While CPU-only supercomputers still have a lot of life in them, the smart money is on these vector-heavy coprocessors to expand the number of petaflop systems in the world.

Besides Titan, new to the top 10 are Dell’s Stampede and IBM’s DARPA Trial Subset machine. The Stampede machine, installed at the Texas Advanced Supercomputing Center (TACC) debuts Intel’s Knights Corner manycore accelerator, while IBM’s DARPA Trial Subset is an implementation of the Power7-based PERCS architecture, developed in conjunction with the High Productivity Computing Systems (HPCS) program. JUQUEEN is not new to the top 10, but tripled its capacity since June, moving it from number 8 to number 4.

Stampede could also make its way further up the list by next June. The TACC super is slated to reach 10 peak petaflops when the system is fully deployed in 2013, which should get the Linpack mark to about 6.7 petaflops. By then though, there is likely to be even more competition in the multi-petaflops realm.

On the interconnect front, InfiniBand-based supercomputers continues to steal share from Ethernet. Over the last six months, 15 InfiniBand systems were added, for a total of 226, while Ethernet lost 19 machines, reducing its share to 188. At the top of the list though, custom interconnects rule. On the top 10, there is but one that uses InfiniBand (Stampede); the rest employ custom interconnects of various stripes from Cray, IBM, Fujitsu, and China’s NUDT.

The one TOP500 element that remained fairly constant this time around was the geographical distribution of Linpack FLOPS. The US is still the dominant nation with 251 systems (down one from last June). China is in second place with 72 systems (down two from June). The European superpowers — UK, France and Germany have reach parity, more or less, with 24, 21, and 20 systems, respectively.

Perhaps the most significant on this latest list is the growth of petascale supercomputers, which currently constitutes the top 23 systems. That’s up from the top 10 just a year ago. It’s projected that by 2015, all 500 machines will be a petaflop or greater.

(via HPCwire.com)