This is the one liner Amazon is using to describe their NoSQL database solutionDynamoDB. In the early days of TreeCrunch we decided to make the decision to run all of our services in the cloud. After looking at many different providers we finally chose Amazon Web Services as the main beneficiary … of our credit card. The reasons:
- We had previous experience with AWS.
- AWS provides many convenient services (EC2, S3, Route53, VPC, etc.).
- AWS costs are somewhat manageable.
- Amazon keeps innovating with AWS.
- Our AWS account manager takes good care of us.
We are in the Big Data business. We had to think about planning for worst/best case scenarios around data collection. People’s responses and our data can take on practically any written form. Considering our largest possible customer at the time, adding a buffer on top,predicting a (totally unrealistic) 100% response rate, considering some people might write short phrases, others longer paragraphs and collecting all responses in Unicode UTF-8; we ended up with an estimated maximum data volume of 5 GB per campaign. As we wanted to be able to access that data – all of it – in a short period of time to run our analysis against, we had to really think about which data base to use.
DynamoDB was our choice of database and it’s worked great for us so far. When Amazon announced availability of their DynamoDB services in Singapore, we also climbed on board right away. At TreeCrunch we strive to provide the best service to our customers at as fast response times as possible. While DynamoDB has been great choice for us so far, it has some deficiencies that we’ve actually run into. Though, the database is easily scalable in terms of sizing (it actually does that automatically), it is not that easily scalable in the matter of read/write capacities. DynamoDB charges us on a per space consumed and guaranteed read/write capacity throughput for each table. For example, if you want to read all data really quickly, the read capacity setting needs to be high enough. If it is too low, DynamoDB will throw an error and you have to deal with it. Of course, setting up with a higher read capacity involves more costs.
There is no easy dynamic automatic scaling mechanism provided by Amazon at the moment. It is possible to modify the capacity units where you can raise them at any time, but you cannot lower them more than once within 24 hours. (Small tip: the counter actually resets itself at 00:00 UTC and not 24 hours after the last time a capacity unit had been lowered). So there is a limitation of how we can dynamically scale the required performance up when we need it and back down when we don’t. Amazon provides of course an API which we can utilize to find out how many capacity units we have consumed so far.
When we analyze a large data set of responses, we need to have fast access to all data at that point in time. Besides that, we also run the campaign and when people submit their responses our database needs to be ready to scale virtually indefinitely. So we had to create our own scaling mechanism for DynamoDB.
TreeCrunch’s DynamoDB Scaling Concept:
- DynamoDB will report usage of ConsumedReadCapacity andConsumedWriteCapacity to Amazon’s CloudWatch.
- CloudWatch alarms are set up to trigger actions.
- Once an alarms occurs (all read or write capacity units consumed) a message is being sent via Amazon’s SNS (Simple Notification Service) to a server that takes care of all services.
- Utilizing the AWS SDK we will modify the capacity units.
Sounds simple so far, right? The issue comes in place when DynamoDB does not report anything to CloudWatch which occurs when nothing is happening on DynamoDB. The report is then called “Insufficient Data”. When read or write is being done on DynamoDB, we might suddenly get a very high reading causing us to start scaling which we might not necessarily need. DynamoDB reports at current states and not continuous data reading to CloudWatch (on the contrary EC2 instances report a constant CPU load which is easy to measure). This can cause an issue we call “flipping” where values can be too high or too low and scaling actions might overdo what we actually want to achieve. On top of that there are constraints about what we can do. Any new capacity value (increase or decrease) must be:
- At least current value + 10%
- At most current value * 2
TreeCrunch’s DynamoDB Scaling Implementation:
- If DynamoDB has no action, the metric will show “Insufficient Data”. We can simply consider this as the consumed capacity being zero.
- To reduce flipping alarms (high consumption reading vs. insufficient data), we can average the value over a longer period of time, say 5 minutes. Even if there is only one data point within the 5 minutes range, there are still sufficient data points to calculate the value.
Using a 1 minute average with an evaluation period of 5 gives us the same effect. This allows us to setup a trigger time to 7 or 9 minutes.
Flipping alarms are actually useful because once an alarm has reached the ALARM state, it will only execute the action once. In some cases, we actually want the alarm to trigger more than once.
- The minimum increase of 10% and max of 100% means we have to trigger the alarm multiple times before reaching our desired value, thus, an alarm that moves in relation to a new trigger value is required.
- We need to periodically check and decrease the capacity, any ordinary alarm won’t help because during the time that alarm has been triggered, the table may not be ready to be scaled down yet.
While the following illustration is fairly basic, it does provide a bit of visual insight in how that procedure works.
The great about Amazon is: they are interested in how we use their services. Therefore, we are constantly exchanging ideas and concerns with them. Though, we are not sure if the current DynamoDB scaling methodology works in all cases, we have implemented it and will observe over time how it works for us. We have also submitted our approach to Amazon for their internal review.