Virident vCache vs. FlashCache

Ease of basic installation. The setup process was simply a matter of installing two RPMs and running a couple of commands to enable vCache on the PCIe flash card (a Virident FlashMAX II) and set up the cache device with the command-line utilities supplied with one of the RPMs. Moreover, the vCache software is built in to the Virident driver, so there is no additional module to install. FlashCache, on the other hand, requires building a separate kernel module in addition to whatever flash memory driver you’ve already had to install, and then further configuration requires modification to assorted sysctls. I would also argue that the vCache documentation is superior. Winner: vCache.

Ease of post-setup modification / advanced installation. Many of the FlashCache device parameters can be easily modified by echoing the desired value to the appropriate sysctl setting; with vCache, there is a command-line binary which can modify many of the same parameters, but doing so requires a cache flush, detach, and reattach. Winner: FlashCache.

Operational Flexibility: Both solutions share many features here; both of them allow whitelisting and blacklisting of PIDs or simply running in a “cache everything” mode. Both of them have support for not caching sequential IO, adjusting the dirty page threshold, flushing the cache on demand, or having a time-based cache flushing mechanism, but some of these features operate differently with vCache than with FlashCache. For example, when doing a manual cache flush with vCache, this is a blocking operation. With FlashCache, echoing “1″ to the do_sync sysctl of the cache device triggers a cache flush, but it happens in the background, and while countdown messages are written to syslog as the operation proceeds, the device never reports that it’s actually finished. I think both kinds of flushing are useful in different situations, and I’d like to see a non-blocking background flush in vCache, but if I had to choose one or the other, I’ll take blocking and modal over fire-and-forget any day. FlashCache does have the nice ability to switch between FIFO and LRU for its flushing algorithm; vCache does not. This is something that could prove useful in certain situations. Winner: FlashCache.

Operational Monitoring: Both solutions offer plenty of statistics; the main difference is that FlashCache stats can be pulled from /proc but vCache stats have to be retrieved by running the vgc-vcache-monitor command. Personally, I prefer “cat /proc/something” but I’m not sure that’s sufficient to award this category to FlashCache. Winner: None.

Time-based Flushing: This wouldn’t seem like it should be a separate category, but because the behavior seems to be so different between the two cache solutions, I’m listing it here. The vCache manual indicates that “flush period” specifies the time after which dirty blocks will be written to the backing store, whereas FlashCache has a setting called “fallow_delay”, defined in the documentation as the time period before “idle” dirty blocks are cleaned from the cache device. It is not entirely clear whether or not these mechanisms operate in the same fashion, but based on the documentation, it appears that they do not. I find the vCache implementation more useful than the one present in FlashCache. Winner: vCache.

Although nobody likes a tie, if you add up the scores, usability is a 2-2-1 draw between vCache and FlashCache. There are things that I really liked better with FlashCache, and there are other things that I thought vCache did a much better job with. If I absolutely must pick a winner in terms of usability, then I’d give a slight edge to FlashCache due to configuration flexibility, but if the GA release of vCache added some of FlashCache’s additional configuration options and exposed statistics via /proc, I’d vote in the other direction.

Disclosure: The research and testing conducted for this post were sponsored by Virident.

First, some background information. All tests were conducted on Percona’s Cisco UCS C250test machine, and both the vCache and FlashCache tests used the same 2.2TB Virident FlashMAX II as the cache storage device. EXT4 is the filesystem, and CentOS 6.4 the operating system, although the pre-release modules I received from Virident required the use of the CentOS 6.2 kernel, 2.6.32-220, so that was the kernel in use for all of the benchmarks on both systems. The benchmark tool used was sysbench 0.5 and the version of MySQL used was Percona Server 5.5.30-rel30.1-465. Each test was allowed to run for 7200 seconds, and the first 3600 seconds were discarded as warmup time; the remaining 3600 seconds were averaged into 10-second intervals. All tests were conducted with approximately 78GiB of data (32 tables, 10M rows each) and a 4GiB buffer pool. The cache devices were flushed to disk immediately prior to and immediately following each test run.

With that out of the way, let’s look at some numbers.

vCache vs. vCache – MySQL parameter testing

The first test was designed to look solely at vCache performance under some different sets of MySQL configuration parameters. For example, given that the front-end device is a very fast PCIe SSD, would it make more sense to configure MySQL as if it were using SSD storage or to just use an optimized HDD storage configuration? After creating a vCache device with the default configuration, I started with a baseline HDD configuration for MySQL (configuration A, listed at the bottom of this post) and then tried three additional sets of experiments. First, the baseline configuration plus:

innodb_read_io_threads = 16
innodb_write_io_threads = 16

We call this configuration B. The next one contained four SSD-specific optimizations based partially on some earlier work that I’d done with this Virident card (configuration C):

innodb_io_capacity = 30000
innodb_adaptive_flushing_method = keep_average
innodb_flush_neighbor_pages=none
innodb_max_dirty_pages_pct = 60

And then finally, a fourth test (configuration D) which combined the parameter changes from tests B and C. The graph below shows the sysbench throughput (tps) for these four configurations:
vcache_trx_params
As we can see, all of the configuration options produce numbers that, in the absence of outliers, are roughly identical, but it’s configuration C (shown in the graph as the blue line – SSD config) which shows the most consistent performance. The others all have assorted performance drops scattered throughout the graph. We see the exact same pattern when looking at transaction latency; the baseline numbers are roughly identical for all four configurations, but configuration C avoids the spikes and produces a very constant and predictable result.
vcache_response_params

vCache vs. FlashCache – the basics

Once I’d determined that configuration C appeared to produce the most optimal results, I moved on to reviewing FlashCache performance versus that of vCache, and I also included a “no cache” test run as well using the base HDD MySQL configuration for purposes of comparison. Given the apparent differences in time-based flushing in vCache and FlashCache, both cache devices were set up so that time-based flushing was disabled. Also, both devices were set up such that all IO would be cached (i.e., no special treatment of sequential writes) and with a 50% dirty page threshold. Again, for comparison purposes, I also include the numbers from the vCache test where the time-based flushing is enabled.
vcache_fcache_trx_params
As we’d expect, the HDD-only solution barely registered on the graph. With a buffer pool that’s much smaller than the working set, the no-cache approach is fairly crippled and ineffectual. FlashCache does substantially better, coming in at an average of around 600 tps, but vCache is about 3x better. The interesting item here is that vCache with time-based flushing enabled actually produces better and more consistent performance than vCache without time-based flushing, but even at its worst, the vCache test without time-based flushing still outperforms FlashCache by over 2x, on average.

Looking just at sysbench reads, vCache with time-based flushing consistently hit about 27000 per second, whereas without time-based flushing it averaged about 12500. FlashCache came in around 7500 or so. Sysbench writes came in just under 8000 for vCache + time-based flushing, around 6000 for vCache without time-based flushing, and somewhere around 2500 for FlashCache.
vcache_fcache_read_write

We can take a look at some vmstat data to see what’s actually happening on the system during all these various tests. Clockwise from the top left in the next graph, we have “no cache”, “FlashCache”, “vCache with no time-based flushing”, and “vCache with time-based flushing.” As the images demonstrate, the no-cache system is being crushed by IO wait. FlashCache and vCache both show improvements, but it’s not until we get to vCache with the time-based flushing that we see some nice, predictable, constant performance.
cpu-usage-all

So why is it the case that vCache with time-based flushing appears to outperform all the rest? My hypothesis here is that time-based flushing allows the backing store to be written to at a more constant and, potentially, submaximal, rate compared to dirty-page-threshold flushing, which kicks in at a given level and then attempts to flush as quickly as possible to bring the dirty pages back within acceptable bounds. This is, however, only a hypothesis.

vCache vs. FlashCache – dirty page threshold

Finally, we examine the impact of a couple of different dirty-page ratios on device performance, since this is the only parameter which can be reliably varied between the two in the same way. The following graph shows sysbench OLTP performance for FlashCache vs. vCache with a 10% dirty threshold versus the same metrics at a 50% dirty threshold. Time-based flushing has been disabled. In this case, both systems produce better performance when the dirty-page threshold is set to 50%, but once again, vCache at 10% outperforms FlashCache at 10%.

vcache-dirty_trx_params

The one interesting item here is that vCache actually appears to get *better* over time; I’m not entirely sure why that’s the case or at what point the performance is going to level off since these tests were all run for 2 hours anyway, but I think the overall results still speak for themselves, and even with a vCache volume where the dirty ratio is only 10%, such as might be the case where a deployment has a massive data set size in relation to both the working set and the cache device size, the numbers are encouraging.

Conclusion

Overall, the I think the graphs speak for themselves. When the working set outstrips the available buffer pool memory but still fits into the cache device, vCache shines. Compared to a deployment with no SSD cache whatsoever, FlashCache still does quite well, massively outperforming the HDD-only setup, but it doesn’t even really come close to the numbers obtained with vCache. There may be ways to adjust the FlashCache configuration to produce better or more consistent results, or results that are more inline with the numbers put up by vCache, but when we consider that overall usability was one of the evaluation points and combine that with the fact that the best vCache performance results were obtained with the default vCache configuration, I think vCache can be declared the clear winner.

Base MySQL & Benchmark Configuration

All benchmarks were conducted with the following:

sysbench ­­--num­-threads=32 ­­--test=tests/db/oltp.lua ­­--oltp_tables_count=32 \
--oltp­-table­-size=10000000 ­­--rand­-init=on ­­--report­-interval=1 ­­--rand­-type=pareto \
--forced­-shutdown=1 ­­--max­-time=7200 ­­--max­-requests=0 ­­--percentile=95 ­­\
--mysql­-user=root --mysql­-socket=/tmp/mysql.sock ­­--mysql­-table­-engine=innodb ­­\
--oltp­-read­-only=off run

The base MySQL configuration (configuration A) appears below:

#####fixed innodb options
innodb_file_format = barracuda
innodb_buffer_pool_size = 4G
innodb_file_per_table = true
innodb_data_file_path = ibdata1:100M
innodb_flush_method = O_DIRECT
innodb_log_buffer_size = 128M
innodb_flush_log_at_trx_commit = 1
innodb_log_file_size = 1G
innodb_log_files_in_group = 2
innodb_purge_threads = 1
innodb_fast_shutdown = 1
#not innodb options (fixed)
back_log = 50
wait_timeout = 120
max_connections = 5000
max_prepared_stmt_count=500000
max_connect_errors = 10
table_open_cache = 10240
max_allowed_packet = 16M
binlog_cache_size = 16M
max_heap_table_size = 64M
sort_buffer_size = 4M
join_buffer_size = 4M
thread_cache_size = 1000
query_cache_size = 0
query_cache_type = 0
ft_min_word_len = 4
thread_stack = 192K
tmp_table_size = 64M
server­id = 101
key_buffer_size = 8M
read_buffer_size = 1M
read_rnd_buffer_size = 4M
bulk_insert_buffer_size = 8M
myisam_sort_buffer_size = 8M
myisam_max_sort_file_size = 10G
myisam_repair_threads = 1
myisam_recover 

(Source: ssdperformanceblog.com)

Dynamo Sure Works Hard

We tend to think of working hard as a good thing. We value a strong work ethic and determination is the face of adversity. But if you are working harder than you should to get the same results, then it’s not a virtue, it’s a waste of time and energy. If it’s your business systems that are working harder than they should, it’s a waste of your IT budget.

Dynamo based systems work too hard. SimpleDB/DynamoDB, Riak, Cassandra and Voldemort are all based, at least in part, on the design first described publicly in the Amazon Dynamo Paper. It has some very interesting concepts, but ultimately fails to provide a good balance of reliability, performance and cost. It’s pretty neat in that each transaction allows you dial in the levels of redundancy and consistency to trade off performance and efficiency. It can be pretty fast and efficient if you don’t need any consistency, but ultimately the more consistency you want the more have to pay for it via a lot of extra work.

Network Partitions are Rare, Server Failures are Not

… it is well known that when dealing with the possibility of network failures, strong consistency and high data availability cannot be achieved simultaneously. As such systems and applications need to be aware which properties can be achieved under which conditions.

For systems prone to server and network failures, availability can be increased by using optimistic replication techniques, where changes are allowed to propagate to replicas in the background, and concurrent, disconnected work is tolerated. The challenge with this approach is that it can lead to conflicting changes which must be detected and resolved. This process of conflict resolution introduces two problems: when to resolve them and who resolves them. Dynamo is designed to be an eventually consistent data store; that is all updates reach all replicas eventually.

- Amazon Dynamo Paper

The Dynamo system is a design that treats the probability of a network switch failure as having the same probability of machine failure, and pays the cost with every single read. This is madness. Expensive madness.

Within a datacenter, the Mean Time To Failure (MTTF) for a network switch is one to two orders of magnitude higher than servers, depending on the quality of the switch. This is according to data from Google about datacenter server failures, and the publish numbers of the MTBF of Cisco switches (There is a subtle difference between MTBF and MTTF, but for our purposes we can treat them the same)

It is claimed that when W + R > N you can get consistency. But it’s not true, because without distributed ACID transactions, it’s never possible to achieve W > 1 atomically.

Consider W=3, R=1 and N=3. If a network failure or more likely a client/app tier failure (hardware, OS or process crash) happens during the writing of data, it’s possible for only replica A to receive the write, with a lag until the cluster notices and syncs up. Then another client with R = 1 can do two consecutive reads, getting newer data first from a node A, and older data next from node B for the same key. But you don’t even need a failure or crash, once the first write occurs there is always a lag for the next server(s) to receive the write. It’s possible for a fast client to do the same read 2 times again, getting a newer version from one server, then an older version from another.

What is true is that if R > N / 2, then you get consistency where it’s not possible to read in a newer value, then a subsequent read get’s an older value.

For the vast majority of applications, it’s okay for a failure leading to temporary unavailability. Amazon believes it’s shopping cart is so important to capture writes it’s worth the cost of quorum reads, or inconsistency. Perhaps. But the problems and costs multiply. If you are doing extra reads to achieve high consistency, then you are putting extra load on each machine, requiring extra server hardware and extra networking infrastructure to provide the same baseline performance. All of this can increase the frequency of a component failure and increases operational costs (hardware, power, rack space and the personnel to maintain it all).

A Better Way?

What if a document had 1 master and N replicas to write to, but only a single master to read from? Clients know based on the document key and topology map which machine serves as the master. That would make the reads far cheaper and faster. All reads and writes for a document go to the same master, with writes replicated to replicas (which also serve as masters for other documents, each machine is both a master and replica).

But, you might ask, how do I achieve strong consistency if the master goes down or becomes unresponsive?

If when that happens, the cluster also notices the machine is unresponsive or too slow and removes it out of the cluster and fails over to a new master. Then the client tries again and has a successful read.

But, you might ask, what if the client asks the wrong server for a read?

If all machines in the cluster know their role and only one machine in the cluster can be a document master at any time, and the cluster manager (a regular server node elected by Paxos consensus) makes sure to remove the old master, and then assign the new master, and then tell the client about the new topology. Then the client updates it’s topology map, and retries at the new master.

But, you might ask, what if the topology has changed again, and the client again asks to read from the wrong server?

Then this wrong server will let the client know. The client will reload the topology maps, and re-request from the right server. If the right master server isn’t really right any more because of another topology change, it will reload and retry again. It will do this as many times as necessary, but typically it happens only once.

But, you might ask, what if there is a network partition, and the client is on the wrong (minor) side of the partition, and reads from a master server that doesn’t know it’s not a master server anymore?

Then it gets a stale read. But only for a little while, until the server itself realizes it’s no longer in heartbeat contact with the majority of the cluster. And partitions like this are the among the rarest form a cluster failure. It will require a network failure, and for the client to be on the wrong side of the partition.

But, you might ask, what if there is a network partition, and the client is on the wrong (smaller) side of the partition, and WRITES to a server that doesn’t know it’s not a master server anymore?

Then the write is lost. But if the client wanted true multi-node durability, then the write wouldn’t have succeeded (the client would timeout waiting for replicas(s) to receive the update) and the client wouldn’t unknowingly lose data.

What I’m describing is the Couchbase clustering system.

Let’s Run Some Numbers

Given the MTTF of a server, how much hardware and how quickly must the cluster failover to a new master and still meet our SLAs requirements vs a Dynamo based system?

Let’s start with some assumptions:

We want to achieve 4000 transactions/sec with 3 node replication factor. Our load mix is 75% reads/25% writes.

We want to have some consistency, so that we don’t read newer values, then older values, so for Dynamo:

R = 2, W = 2, N = 3

But for Couchbase:

R = 1, W = 1, N = 3

This means for a Dynamo style cluster, the load will be:
Read transactions/sec: 9000 reads (reads spread over 3 nodes)
Write transactions/sec: 3000 writes (writes spread over 3 nodes)

This means for a Couchbase style cluster, the load will be:
Read transactions/sec: 3000 reads (each document read only on 1 master node, but all document masters evenly spread across 3 nodes)
Write transaction/sec: 3000 writes (writes spread over 3 nodes)

Let’s assume both systems are equally as reliable at the machine level. Google’s research indicates in their datacenter each server has a MTTF of 3141 hrs or 2.7 failures per year. Google also reports a rack failure (usually power supply) of 10.2 years, roughly 30x a reliable as a server, so we’ll ignore that to make the analysis simpler. (This is from Googles paper studying server failures here)

The MTBF of Cisco network switch is published at 54,229 hrs on the low end, to 1,023,027 hrs on the high end. For our purposes, we’ll ignore switch failures, since the failures affects availability and consistency of both system about the same, and it’s 1 to 2 orders of magnitude rarer than a server failure. (This is from a Cisco product spreadsheet here)

Assume we want to meet a latency SLA 99.9% of the time (the actual latency SLA threshold number doesn’t matter here).

On Dynamo, that means each node can fail SLA the 1.837% of the time. Because it queries 3 nodes, but only uses the values from the first 2 nodes and the chances of SLA failure are the same across nodes, the formula is different (only two must meet the SLA):

0.0001 = (3 − 2 * P) * P ^ 2

or:

P = 0.001837

On Couchbase, if a master node fails, it must recognize it and fail it out. Given Google’s MTTF failure above and it can fail out a node in 30 secs, and let’s say it will take 4.5 minutes for it warm up the RAM cache, given 2.7 failures/year with 5 minutes of downtime for each before a failover completes, then queries will fail 0.00095% of time due to node failure.

For Couchbase meet the same SLA:

0.0001 = P(SlaFail) + P(NodeFail) - (P(SlaFail) * P(NodeFail))

0.0001 = P(SlaFail) + 0.0000095 - (P(SlaFail) * 0.0000095)

0.0001 ~= 0.00009 + 0.0000095 - (0.00009 * 0.0000095)

Note: Some things I’m omitting from the analysis are when a Dynamo node fails the lower latency requirement from meeting the SLA for 2 nodes vs. 3 (it would drop from 1.837% to ~0.05%), and also the increased work on the remaining servers when a Couchbase server fails. Both are only temporary and go away when a new server is added back and initialized in the cluster, and shouldn’t change the numbers significantly. Also there is the time to add in a new node and rebalance load on it. At Couchbase we work very hard to make that as fast and efficient as possible. I’ll assume Dynamo systems do the same, that the cost is the same and omit it, though I think we are the leaders in rebalance performance.

With this analysis, a Couchbase node can only fail it’s SLA 0.9% of the requests, and a Dynamo node can fail it 1.837%. Sounds good for Dynamo, but it must do for 2X the throughput per node on 3x the data, and with 2x the total network traffic. And for very low latency response times (our customers often want sub-millisecond latency) typically meeting the SLA means a DBMS must a large amount of relevant data and metadata in RAM, because there is a huge cost for random disk fetches on latency. With disk fetches 2 orders of magnitude slower on SSDs (100x), and 4 orders of magnitude slower on HDDs (10000x) the disk accesses pile up faster without much more RAM, so do the latencies.

So each Dynamo node can fail it’s SLA at a higher rate is very small win when it will still need to keep nearly 3X the working set ready in memory because each node will be serving 3x the data at all times for read requests (it can fail it’s SLA slightly more often, so it’s actually about 2.97x the necessary RAM), and will use 2x the network capacity.

Damn Dynamo, you sure do work hard!

Now Couchbase isn’t perfect either, far from it. Follow me on twitter @damienkatz. I’ll be posting more about the Couchbase shortcomings and capabilities, and technical roadmap soon.

(via planeterlang.org)

NoSQL Benchmark Compares Aerospike, Cassandra, Couchbase and MongoDB

A recent set of benchmarks compares Aerospike, Cassandra, Couchbase and MongoDB to see how they fare when it comes to insert throughput, maximum throughput, latency and behavior during a failover.

Thumbtack Technology has released two benchmark whitepapers with results from the comparison of several key-value stores: Ultra-High Performance NoSQL Benchmarking: Analyzing Durability and Performance Tradeoffs (PDF) and NoSQL Failover Characteristics: Aerospike, Cassandra, Couchbase, MongoDB (PDF). Both benchmarks attempted to test “consumer-facing applications which require extremely high throughput and low latency, and whose information can be represented using a key-value schema.”

Thumbtack used an improved version of Yahoo! Cloud Serving Benchmark (YCSB) one that is supposed to overcome some limitations reached when using very high volumes and multiple clients. The YCSB changes have been documented in the first whitepaper and committed back to the community.

The NoSQL databases tested were AerospikeCassandraCouchbase (1.8 and 2.0), andMongoDB. The first is a commercial product, and the last is a document data store not a key-value store, but it was included because “in our experience clients often consider it for similar kinds of applications.” All databases were optimized using recommendations from the vendors backing them. The test systems used SSD storage rather than rotating disks. The whitepapers contain detailed information regarding the methodology used, the client and workload configuration, the hardware configuration, etc.

Thumbtack acknowledged having “strategic and/or commercial relationships with Aerospike, Couchbase, and 10gen” and the hardware used was rented from Aerospike.

We are including some of the results of the benchmarks.

Insert Throughput

The databases were loaded with initial working sets using YCSB’s loading routing which performs a number of inserts. Couchbase had good results when the working sets were loaded into memory, but it had problems loading on SSD when Couchbase 1.8 did not finish the operation and a smaller set and asynchronous mode had to be used for Couchbase 2.0. That explains the gradient blue used for it. Aerospike came second.

image

Maximum Throughput

This test used a “a strong durability model, using a dataset that, when replicated, would be significantly larger than the server’s RAM. This test is intended to model usage for transactional data that requires strong durability guarantees.”

Couchbase does not appear on the graphic because it could not complete the test using synchronous replication.

image

When asynchronous replication was applied, the result in memory was:

image

Latency/Throughput

The benchmarks also measured read and update latency at different levels of traffic. The next graphic includes a full view and a zoomed one for each of those:

image

Failover

Thumbtack attempted to see what happens when a node goes down, a hardware failure being simulated:

image

The downtime was also measured, i.e. the time needed for the cluster the become responsive again after a failure, all databases showing reasonable values:

image

The Thumbtack benchmarks contains more results for different cases but were not included here.

Another NoSQL benchmark was published in October 2012 when Cassandra, HBase, MongoDB, and Riak were compared. MySQL was also included in those tests for a reference against SQL technologies.Another NoSQL benchmark was published in October 2012 when Cassandra, HBase, MongoDB, and Riak were compared. MySQL was also included in those tests for a reference against SQL technologies.

(via infoq.com)