ScaleDB: A NewSQL Alternative to Amazon RDS and Oracle RAC with 10x Performance Gain

In the world of relational databases, MySQL and Oracle RAC (Real Application Cluster) own a significant segment of the market. Oracle RAC owns the majority of the enterprise market, while MySQL gained significant popularity amongst many of the Web 2.0 sites.

Today, both of those databases are owned by Oracle (MySQL was acquired by Sun which was later acquired by Oracle).

The following diagrams show the enterprise database marketshare covered by Gartner and Cloud Database market share covered by Jalistic – a Java PaaS provider.

The acquisition of MySQL by Oracle raised concern over the future of the project due to the inherent conflict of interest between its two database products. Oracle RAC is the company’s main “cash cow” product, while MySQL competes for the same audience.

Shortly after Oracle’s acquisition of MySQL, the open source database was forked by one of its original founders into a new project named MariaDB. MariaDB was established to provide an alternative development and support option to MySQL and is now becoming the default database of choice of RedHat.

MySQL vs Oracle RAC Clustering Models

The two databases take a fairly different approach to scalability.

Oracle RAC is based on a shared storage model. With Oracle, the data is broken into strips that are spread across multiple devices, and multiple database servers operate concurrently and in sync over the (shared) data.

MySQL, on the other hand, does not use a shared data model. With MySQL, a given data set is managed by a single server (a model called shared nothing). With MySQL, scaling and High Availability (HA) is achieved by managing copies of the data. As only a single machine can update the data, this mode can only scale-up by adding more capacity to the machine that owns the database. As machines have limits to capacity yet must keep up with large amounts of data or many users, the database needs to be broken into several independent databases (a process called sharding). However, sharding is a complex process, is not dynamic and requires assumptions on data distribution, data growth and the queries. For these reasons, sharding is not possible with many applications. In particular, a cloud deployment is expected to be elastic to dynamically support changing needs and user behaviors.

For High Availability, the MySQL shared nothing approach uses Primary/Backup model with a single master and a backup node (called slave) that manage a copy of the data. Each update to the master requires that the same update will be executed on the slave. The primary and slave nodes require some handshake protocol to determine who is the master and sync the changes of the data. The master node performs updates/writes to the persistent file-system and the level of High Availability is set by the DBA, who decides if the slave can lag behind or needs to confirm the duplicate updates within each transaction.

For scaling, this model can use the slave as a read replica (or make additional copies of the data), a method called horizontal scaling in which read requests are spread across the different copies of the data. (However, all the writes need to go to the master and then be reflected on the replicas/slaves.)

Relational Databases on the Cloud

The high-end position of Oracle RAC, the low-cost and open source nature of MySQL, along with the adoption of the cloud as the infrastructure platform led to a vastly different method of deployment of databases in the cloud.

Oracle RAC took an approach similar to the old mainframe: to produce a pre-engineered appliance (called Exadata) that comes with the software and hardware integrated. That approach was specifically aimed at existing customers of Oracle RAC who needed a quick resolution to their scalability needs without redesigning their applications. Plugging a more optimized stack helped to push the scalability bar without changing the existing applications that rely on the database.

Amazon launched RDS, which is basically an on-demand version of the MySQL database. This approach fits nicely with the existing users of MySQL who are looking for a more affordable way to run their database in the cloud.

The main limitation of the Amazon RDS approach is that it inherits the limitations of MySQL and thus provides a limited and relatively complex read-scalability and doesn’t provide a good solution for write-scalability other than the scale-up approach.

ScaleDB – a NewSQL Alternative to Amazon RDS and Oracle RAC

ScaleDB is a NewSQL database that takes an approach similar to the read/write scalability model of Oracle RAC and implements it as a platform that transparently supports the MySQL (or MariaDB) database. As a result, existing MySQL/MariaDB applications can leverage the platform without any changes – they use MySQL or MariaDB as the front end engine which connects to the ScaleDB platform that provides a new and more concurrent database and storage services. This approach makes it possible to run multiple MySQL or MariaDB server instances against a distributed storage in a shared data model (similar to Oracle RAC).

The diagram below shows in high level how ScaleDB works.

Each Database Node runs a standard MariaDB database instance with ScaleDB plugged as the database engine and as an alternative storage and index device.

Scaling of the data and the index is done by processing requests concurrently from multiple database instances and leveraging the ScaleDB distributed storage tier (at the storage/file-system level) where data and index elements are spread evenly across multiple storage nodes of the cluster. Read more on how ScaleDB works.

ScaleDB vs Other NewSQL Databases

Most of the NewSQL databases are based on MySQL as the engine behind their implementation. The main difference between many NewSQLs and ScaleDB is that NewSQL databases brings the NoSQL approach into the SQL world, where ScaleDB takes the Oracle RAC shared storage approach to scale the database.

ScaleDB can deliver write/read scale while offering close to 100% compatibility, whereas in many of the alternative NewSQL approaches, scaling would often involve significant changes in the data model and queries semantics.

Benchmark – ScaleDB vs Amazon RDS

To demonstrate the difference between a standard cloud deployment of MySQL and a ScaleDB deployment and in order to find whether the ScaleDB approach can live up to its promise we conducted a benchmark comparing Amazon RDS scalability for write/read workloads with that of ScaleDB. We tested a dataset that does not fit to the memory (RAM) of a single machine and used  the largest machines offered by Amazon. We required that scaling would be dynamic and that all types of queries would be supported. These requirements made sharding a no-go option.

The benchmark is based on the popular YCSB – Yahoo Benchmark as the benchmarking framework.

The results of that benchmark are shown in the graphs below.

Both illustrate a relatively flat scalability with Amazon RDS and a close to linear scalability on the ScaleDB front.

Benchmark environment:

  • Benchmark Framework - YCSB – Yahoo Benchmark
  • Cloud environment: Amazon
  • Machine Type: Extra Large
  • Data Element Size (per row) – 1k
  • Data Capacity: 50GB
  • Zones – Single zone.
  • RDS was set with 1000 provisional ios
  • ScaleDB cluster setup – 2 database nodes, 2 data volumes (4 machines -data is striped over 2 machines and each machine had a mirror).
  • ScaleDB MySQL engine – MariaDB

Running ScaleDB on OpenStack and Other Clouds with Cloudify

The thing that got me excited about the project is that it serves as a perfect fit for many of our enterprise customers who are using Cloudify for moving their existing applications to their private cloud without code change. Those customers are looking for a simple way to scale their applications and many of them run today on Oracle RAC for that purpose.

The move of enterprises to a cloud-based environment also drives the demand for a more affordable approach to handle their database scaling, which is provided by ScaleDB.

On the other hand, setting up a database cluster can be a fairly complex task.

By creating a Cloudify recipe for ScaleDB, we could remove a large part of that complexity and set up an entire ScaleDB database and storage environment through a single click.

In this way we can run ScaleDB on demand as we would with RDS and scale on both read and write as with Oracle Exadata, only in a more affordable fashion.

References

1. Cloudify Recipe for running Scale DB

2. Yahoo Benchmark

3. ScaleDB Architecture white paper

4. ScaleDB user manual

(Via Nati Shalom’s Blog)

 

Under the hood: MySQL Pool Scanner (MPS)

Facebook has one of the largest MySQL database clusters in the world. This cluster comprises many thousands of servers across multiple data centers on two continents.

Operating a cluster of this size with a small team is achieved by automating nearly everything a conventional MySQL Database Administrator (DBA) might do so that the cluster can almost run itself. One of the core components of this automation is a system we call MPS, short for “MySQL Pool Scanner.”

MPS is a sophisticated state machine written mostly in Python. It replaces a DBA for many routine tasks and enables us to perform maintenance operations in bulk with little or no human intervention.

A closer look at a single database node
Every one of the thousands of database servers we have can house a certain number of MySQL instances. An instance is a separate MySQL process, listening on a separate port with its own data set. For simplicity, we’ll assume exactly two instances per server in our diagrams and examples.

The entire data set is split into thousands of shards, and every instance holds a group of such shards, each in its own database schema. A Facebook user’s profile is assigned to a shard when the profile is created and every shard holds data relating to thousands of users.

It’s more easily explained by a diagram of a single database server:

 

Every instance has a few copies of its data on other instances that are hosted on different servers, and usually in a different data center. This is done to achieve two main goals:

1.    High Availability – If a server goes down, we have the data available elsewhere, ready to be served.
2.    Performance – Different geographical regions have their own replicas so that reads are served                  locally.
The way we achieve this is through simple MySQL master/slave replication. Every instance is part of a replica set. A replica set has a master and multiple slaves. All writes to a replica set must occur on the master. The slaves subscribe to a stream of these write events, and the events are replayed on them as soon as they arrive. Since the master and the slaves have nearly identical data, a read can occur on any one of the instances in the replica set.

Here is a diagram of a simple replica set, where each server hosts only one instance, and the other instance is empty (we call these spares):

 


A server is essentially a container for instances, so in reality things can get much more complicated.
For example, a single server hosting a master instance may also be hosting a slave instance for a different master, like so:

 

There are two important “building block” operations MPS relies on:

1.    Creating a copy/placing a server

The first building block is an operation that creates a copy of an instance on a different host. We use a modified version of Xtrabackup to perform most copy operations. A replacement is the same operation if we remove an instance after a copy successfully completes.

First, the system allocates a new spare instance for the operation. We choose one of the slaves or the master and copy its data to the freshly allocated spare instance. This diagram shows a replacement operation, where an instance is removed when the copy is complete:

 

2.    Promoting a master instance

The second building block is the action of promoting a different instance to be the master in a replica set.
During a promotion, we select a target for the promotion, stop writes to the replica set, change the slaves to replicate from the new master, and resume writes. In the diagram, we show a deletion operation in which the old master is discarded after the promotion is completed successfully. For simplicity, the replica set below consists of only three instances:

 

These two operations (which are usually complex procedures for most companies running MySQL) have been completely automated to a point where MPS can run them hundreds or thousands of times a day in a fast and safe manner, without any human intervention.

Host management and states
Now that we’ve got the basics out of the way, we can dive into more abstract concepts that utilize these building blocks.

MPS works with a repository that holds the current state and metadata for all our database hosts, and current and past MPS copy operations. This registry is managed by the database servers themselves so that it scales with the database cluster and MPS doesn’t need a complex application server installation. MPS itself is in fact stateless, running on its own pool of hosts and relying on the repository for state management. States are processed separately and in parallel.

When a server “wakes up” in a datacenter (for example, a fresh rack has been connected and provisioned), it will start running a local agent every few minutes. This agent performs the following steps:

  1. Collect data about itself. (Where am I? What hardware do I have? What software versions am I running?)
  2. Triage the host for problems. (Did I wake up in an active cluster? Are my disks OK? Are my flash cards healthy?)
  3. Make sure the server is registered and contains up-to-date metadata in the central repository.
  4. On the first run, place instances on the server in an initial “reimage” state if there is no current record of this server. This is where new servers start their lives in MPS.
  5. So every few minutes, every healthy server “checks in” to this central repository and updates how it’s doing, allowing things like data use and system health to be kept in sync.

The smallest unit MPS manages at the moment is an instance. Each instance can be in various states. The important states are as follows:

  • Production: Instance is serving production traffic.
  • Spare: Instance is ready to be copied to or allocated to some other task.
  • Spare allocated: Instance has been chosen as the target for a copy, and a copy is in progress.
  • Spare deallocated: Temporary triaging state. Instance has been removed from production and is pending triaging and cleanup. No instances stay here for more than a few minutes.
  • Drained: The instance is not being used, and is being reserved for testing, data center maintenance, etc. An operator intervention is required to take a host out of this state.
  • Reimage: Servers with all instances in this state are being reimaged or are in the process of being repaired. Servers in this state are handed off and managed by a co-system called Windex, which was discussed in a previous post.

An instance may move between states due to MPS executing an action or an operator intervention. This state diagram shows the main states and the actions that may cause an instance to move between those states.

 

The diagram above describes only a small subset of possible paths an instance may take in MPS. The state changes described here are the ones that result from simple copy and maintenance operations. There are many other reasons for instances to change states, and hardcoding all the options and checks would create software that is difficult and complex to maintain. Meet “problems,” another fundamental concept in MPS.

A “problem” is a property that is attached to an instance. If all instances on a host have this problem, we consider it to be attached to the server itself. Another way to think of problems is like tags. MPS consults a decision matrix that helps it make decisions about instances with a specific problem. It is basically a map between tuples: (state, problem) – (action, state).

It is easier to understand with some examples:

  • (production, low-space) – (replace, spare deallocated): Replace an instance in production with limited space, moving it to a different server.
  • (spare de-allocated, old-kernel) – (move, reimage): If an instance happened to move through this state, it has no production data on it, so why not reimage it?
  • (production, master-in-fallback-location) – (promote, production): We should promote this master instance to the correct location, and leave the instance in the production state.

The various states and “problems” in MPS allow us to create a flexible and maintainable infrastructure to manage a server’s life cycle.

Examples of common failure resolution and maintenance operations
In a large data center, there are tens or hundreds of server failures a day. Here are a few examples of common day-to-day failures that MPS takes care of without human intervention:

  • Broken slave instances are detected and disabled until they are replaced in the background.
  • Broken master instances are demoted so that healthy replicas take the place of their fallen brethren and get replaced in the background.
  • Instances on servers that might run out of space due to growth are moved to underutilized servers.

With thousands of servers, site-wide maintenance tasks like upgrading to a new kernel, changing partition sizes, or upgrading firmware on controllers become very complex. The same goes for localized operations such as moving some racks or allocating test servers for our engineering teams. Here are some common maintenance operations an operator can ask MPS to perform with a single command:

  • Drain any number of database racks for maintenance and take them out of production. Most such operations complete in less than 24 hours.
  • Re-image thousands of machines (to perform kernel upgrades, for example) at a specified concurrency. MPS will replace each machine and then send it to Windex.
  • Allocate any number of spares to be used for a new project or testing. Want 200 servers to run tests? No problem.
  • Create a copy of the entire Facebook data set at a new data center at a specified concurrency–building out our new Lulea data center, for example!

Automating away the mundane tasks with MPS allow us to better scale the number of servers we manage, and frees up the MySQL Operations team to work on more exciting challenges.

(source: Facebook Engineering)

Need Help With Database Scalability? Understand I/O

This is a guest post by Zardosht Kasheff, Software Developer at Tokutek, a storage engine company that delivers 21st-Century capabilities to the leading open source data management platforms.

As software developers, we value abstraction. The simpler the API, the more attractive it becomes. Arguably, MongoDB’s greatest strengths are its elegant API and itsagility, which let developers simply code.

But when MongoDB runs into scalability problems on big data, developers need to peek underneath the covers to understand the underlying issues and how to fix them. Without understanding, one may end up with an inefficient solution that costs time and money. For example, one may shard prematurely, increasing hardware and management costs, when a simpler replication setup would do. Or, one may increase the size of a replica set when upgrading to SSDs would suffice.

This article shows how to reason about some big data scalability problems in an effort to find efficient solutions.

Defining the Issues

First, let’s define the application. We are discussing MongoDB applications. That means we are addressing a document-store database that supports secondary indexes and shardedclusters. In the context of other NoSQL technologies, such as Riak or Cassandra, we may discuss these I/O bottlenecks differently, but this article just focuses on the properties ofMongoDB.

Second, what does the application do? Are we processing transactions on-line (OLTP) or are we doing analytical processing (OLAP)? For this article, we are discussing OLTP applications. OLAP applications have big data challenges that MongoDB may or may not be able to address, but this article focuses on OLTP applications.

Third, what’s big data? By big data, we mean that we are accessing and using more data than we can fit in RAM on a single machine. As a result, if the data resides on one server, then most of it must reside on disk, and require I/O to access. Note that we are not discussing scenarios where the database is large, but the data accessed or used (sometimes called the “working set”) is small. An example would be storing years of data, but the application only frequently accesses the last day’s worth.

Fourth, what are the limiting factors in OLTP applications with big data? In short: I/O. Hard disk drives do at most hundreds of I/O’s per second. RAM, on the other hand, accesses data millions of times per second. The disparity in these limits causes I/O a bottleneck for big data applications.

Lastly, how do we solve our I/O bottlenecks? With analytical thinking. Formulas and direct instructions can get us most of the way there, but a lasting solution requires understanding. Users must look at the I/O characteristics of their application and make design decisions to best fit those characteristics.

Cost Model

To solve I/O bottlenecks, we first need to understand what database operations induce I/O.

One can argue that MongoDB, and many other databases, underneath all of the bells and whistles, perform three basic operations:

● Point Query: Find a single document. Given the location of a document somewhere either on disk or in memory, retrieve the document. On big data, where the document is likely not in memory, this operation probably causes an I/O.

● Range Query: Find some number of documents in succession in an index. This is generally a MUCH more effective operation than a point query, because the data we are looking for is packed together on disk and brought into memory with very few I/Os. A range query used to retrieve 100 documents may induce 1 I/O, whereas 100 point queries to retrieve 100 documents may induce 100 I/O’s.

● Write: Write a document to the database. For traditional databases such asMongoDB, this may cause I/O. For databases with write-optimized data structures, such as TokuMX, this induces very little I/O. Unlike traditional databases, write-optimized data structures are able to amortize the I/O performed against many inserts.

Understanding the I/O implications of these three basic operations leads to understanding the I/O used by MongoDB statements made against a database. MongoDB takes these three basic operations and builds four basic user level operations:

● Inserts. This writes a new document to the database.

● Queries. Using an index on a collection, this does a combination of range queries and point queries. If the index is a covering index or a clustering index (conceptually the same as TokuDB for MySQL’s clustering index), then the query is likely doing just range queries. If not, then a combination of range queries and point queries are used. I explain these concepts in an indexing talk. The talk uses MySQL, but all of the concepts apply to MongoDB and TokuMX as well.

● Updates and Deletes. These are a combination of queries and writes. A query is used to find the documents to be updated or deleted, and then writes are used update or remove the found documents.

Now that we understand the cost model, to resolve I/O bottlenecks, the user needs tounderstand where the application induces I/O. This is where we need to break some abstraction and peek at how the database behaves. Does the I/O come from queries? If so, how are the queries behaving that is causing the I/O? Does it come from updates? If it comes from updates, is it coming from the query used in the update or the insert used in the update? Once the user understands what is inducing the I/O, steps can be taken to resolve the bottleneck.

Assuming we understand the I/O characteristics of the application, we can talk about several approaches to addressing them. The approach I like to take is this: first attack the problem with software, and when that is not enough, then attack the problem with hardware. Software is cheaper and easier to maintain.

Possible Software Solutions

A possible software solution is one where we change the software or the application to reduce I/O usage. Here are possible solutions for different bottlenecks

Problem: Inserts Causing Too Much I/O.

Possible Solution: Use a write optimized database, such as TokuMX. One of TokuMX’sbenefits is drastically reducing the I/O requirements of writes in databases by using Fractal Trees indexes.

Problem: Queries Causing Too Much I/O.

Possible Solutions: Use better indexing. Reduce point queries by using range queries instead.

In my talk, “Understanding Indexing”, I explain how to reason about indexes to reduce I/O for queries. It’s difficult to summarize the talk in one paragraph, but the gist is as follows. One can reduce the I/O of the application by avoiding doing individual point queries to retrieve each document. To do this, we use covering or clustering indexes that smartly filter the documents analyzed by the query, and can report results using range queries.

Better indexing may not be sufficient. If you have an OLTP application and your queries are essentially point queries (because they retrieve very few documents), then even with proper indexes, you may still have an I/O bottleneck. In such cases, a hardware solution is probably necessary.

Also, additional indexes increase the cost of insertions, as each insertion must keep the indexes up to date as well, but write-optimized databases mitigate that cost. This is where we need to approach the problem with an understanding of our application. For some applications, this is practical, and for others it is not.

Problem: Updates/Deletes Cause Too Much I/O

Solution: Combine the solutions above.

Updates and deletes are tricky in that they are a combination of queries and inserts. Improving their performance requires a good understanding of the cost of the operation. Which part of the operation induces I/O? Is it the query? If so, one can try to improve the indexing. Is it the write? Is it both? Based on what part is inducing the I/O, we apply the solutions above.

One mistake many make is taking a write-optimized database such as TokuMX and expect it to eliminate I/O bottlenecks of updates and deletes without changing any of the indexing. A write-optimized database is not enough. The implicit query within an update/delete must be handled as well.

Possible Hardware Solutions

As mentioned above, when software solutions are not enough we look to hardware. Let’s look at these possibilities, and analyze their benefits and drawbacks:

● Buy more memory to hopefully get more, if not all, of your working set into memory.

● Increase your IOPS by moving to an SSD.

● Buy more machines and move to a scaled out solution. This can be:

○ Read scaling via replication

○ Sharding

Buying more memory

RAM is expensive, and there is only so much RAM one can put on a single machine. Simply put, if data is large enough, keeping it in RAM is not a feasible option. This solution may work for a wide range of applications, but our focus here is tackling applications that cannot do this.

Moving to SSDs

Making the storage device an SSD is a practical solution for improving throughput. If I/O is the limiting factor of your application, an increase in IOPS (I/Os per second) naturally leads to an increase in throughput. This may be simple and practical.

The cost model of SSDs is different than the cost model of hard disks. SSDs dramatically improve your I/O throughput, but are not cheap. They are smaller, cost more, and wear out faster. Therefore, the cost per GB of SSDs is quite higher than the cost per GB of hard disks. To keep costs down, data compression becomes a key.

So, the cost of the hardware increases, but the cost of managing the application does not.

Read scaling via replication

Read scaling with replication is effective for applications where queries are the bottleneck. The idea is as follows:

● Use replication to store multiple copies of your data on separate machines

● Distribute reads across the machines, thereby improving read throughput

By taking the reads that used to bottleneck on one machine and spreading them out, more resources are available for the application to either handle more concurrent queries with the same write workload or to increase the write workload on the application.

If inserts, updates, or deletes are your bottleneck, then replication may not be very effective, because the write work is duplicated on all servers that are added to the replica set. The machine that takes the data input (called the master in MySQL or primary in MongoDB) will still have the same bottleneck.

Sharding

Sharding partitions your data across different replica sets based on a shard key. Different replica sets in the cluster are responsible for ranges of values in the shard key space. So, an application’s write throughput is increased by spreading the write workload across separate replica sets in a cluster. For high write workloads, sharding can be very effective.

By partitioning the data by ranges in the shard key space, queries that use the shard key can effectively do range queries on a few shards, making such queries very efficient. If one makes the shard key a hash, then all range queries must run on all shards in the cluster, but point queries on the shard key run on single shards.

Because MongoDB is schema-less and does not support joins, sharding is elegant and relatively easy to use. If the solutions above are not enough to handle your application and more resources are required, then one can argue that sharding is inevitable.

Nevertheless, sharding is arguably the most heavyweight solution and has a high cost. For starters, your entire hardware budget is multiplied. You do not just add machines to a shardedsetup, you add entire replica sets. You need to add and manage config servers. Due to these costs, one should really consider if sharding is truly necessary. Usually, any one of the solutions presented above are cheaper.

Another big challenge is selecting a shard key. A good shard key has the following properties:

● Many (if not all) of your queries use the shard key. Any query that does not use it must run on all shards. This can be a pain point.

● The shard key should do a good job of distributing writes to different replica sets in the cluster. If all writes are directed to the same replica set in the cluster, then that replica set becomes a bottleneck for writes, just as it was in a non-sharded setup. This makes something like a timestamp a very bad shard key.

These requirements are not always easy to fill. Sometimes, a good shard key does not exist, making sharding ineffective.

In conclusion, many solutions may work, but none is always guaranteed, not even sharding. This is why understanding the characteristics of the application is crucial. These solutions are tools. How best to use the tools, is up to the user.

(source HighScalability.com)