Backing up DynamoDB tables with Hippolyte – our open source solution
Ocado Technology uses a wide range of Amazon AWS services for our day to day operations, as well as the microservices backbone powering our Ocado Smart Platform (OSP).
We’re also heavy users of Amazon’s NoSQL database, DynamoDB. Having a fully managed database solution has allowed many development teams to get on with the work they care about, knowing they can scale when they need to in production. In the infrastructure department it means we no longer have to worry about disks filling up, patching or database downtime.
However, moving over to a managed NoSQL solution has had its own challenges; we’ve had to be be far more cognisant of provisioned throughput and the associated cost implications, tuning up and down to meet our consumption needs. We were also used to having recent snapshots of tables at our disposal, and for us, this became an important feature we wanted to retain as we moved into the cloud.
Motivation for a Backup Solution
Since DynamoDB is a fully managed service, and supports cross-region replication, you may wonder why you even need to backup data in the first place. If you’re running production applications on AWS then you probably already have a lot of confidence in the durability of data in services like DynamoDB or S3.
Our motivation for building this was to protect against application or user error; no matter how durable Amazon’s services are, they won’t protect you from unintended updates and deletes. Historical snapshots of data and state also provide additional value, allowing you to restore to a separate table and compare against live data.
Introducing Hippolyte from Ocado Technology
Hippolyte is an at-scale, point-in-time backup solution for DynamoDB that we’ve built ourselves to solve this problem. It is designed to handle frequent, recurring backups of large numbers of tables, scale read throughput and batch together backup jobs over multiple EMR clusters.
We chose Amazon Data Pipeline as a tool to create, manage and run our backup tasks. Originally we started with directly initialising EMR clusters to extract the contents from tables and loading them in S3, however as you grow to a larger number of tables and clusters this becomes more of a pain to manage and supervise. Data Pipeline helps with orchestration, automatic retries for failed jobs, and the potential to make use of SNS notifications for successful or failed EMR tasks.
We also use AWS Lambda to schedule and monitor jobs. This is responsible for dynamically generating Data Pipeline templates based on configuration and discovered tables, and modifying table throughputs to reduce the duration of the backup job.
Part of the role of our scheduling Lambda function is to attempt to optimally assign DynamoDB tables to individual EMR clusters. Since new tables may be created each day and size may grow significantly, this optimisation is performed each night during the scheduling step. By default, each data pipeline only supports 100 objects, but through a service limits increase we’ve raised this to 500. This means each pipeline can support 166 tables, as each table requires three Data Pipeline objects:
Two additional objects are then needed; the pipeline configuration and an EmrCluster node. (166 x 3 + 2 = 500). As well as this hard limit, we also want every backup to run between 12:00 AM and 7:00 AM. We can work out how long each pipeline will take to complete by starting with some static values
- EMR cluster bootstrap (10 min)
- EMR activity bootstrap (1 min)
From there we can calculate how long each table will take to backup. This can be done with the following formula:
Where Size is the table size in bytes, RCU is the provisioned Read Capacity Units for a given table and ConsumedPercentage is what proportion of this capacity the backup job will use. Since each EMR cluster will run backup jobs sequentially and we have limits to the number of tables and length of time, we can pack each pipeline with tables until one of those two constraints is met.
Another challenge we had was that some tables are either too large to backup during the window of time with their provisioned read capacity. Here we derive the ratio between the expected backup duration and what is desired, and increase our read capacity units by this ratio. We can also increase the percentage of provisioned throughput we consume while preserving the original amount needed for the application. Typically, since we are paying for cluster and capacity by the hour, it’s rarely worth reducing the total expected duration.
Right now we’re using Hippolyte in production to handle the backup of almost all our tables, and it’s a huge improvement over previous solutions in this area. There still some future improvements we have in mind such as intelligently choosing between increasing read capacity verses added more EMR clusters to increase the speed, and using Spot Instances to reduce cost and increase speed further.
The project is now available on GitHub at https://github.com/ocadotechnology/hippolyte, so feel free to have a look for yourself if you’re interested in contributing.
Alex Howard Whitaker, Cloud Services Engineer