After selecting Ceph as our default storage platform, we had to learn how to use it so, like many companies out there, we created a small trial environment to test it out with a select few applications. When we were happy with the feature set, we created a small production pilot to see if there was real interest to use it internally in a real world scenario.
We quickly found that not only was there interest, people were so excited we had to upgrade the solution within the first week! As this was a small production pilot, it was designed to hold only around 500GB of data; our next big challenge was how to grow beyond that.
One of the early applications was sending massive amounts of short-lived information from our automated and real-time warehouses and our small test environment wasn’t able to cope with the load. Thankfully, the whole idea behind Ceph is to scale out and grow easily, and it was just a matter of adding more resources.
In addition, we spent the first few weeks learning how to tune the cluster, improve its performance, and understanding more about how the system worked, which meant the cluster spent more time in maintenance mode.
As time went by, we created better resource strategies such as keeping the metadata in fast storage while the bulkier storage was moved to slower disks. We also kept live upgrading the cluster to newer versions of Ceph that provided newer features and resources. Eventually the pilot evolved into our main cluster; we originally started with version Bobtail, then later upgraded to Cuttlefish, Dumpling, Emperor, Firefly, Giant, Hammer and Infernalis. Today, our cluster is running Jewel. Between upgrading to the major versions above, we also applied a few minor updates.
Originally, we started with four nodes, then quickly grew to six, eight, 12 and then 16. Nowadays, Ceph is running on 72 nodes and none of them are the original 16. To make things even more exciting, we’re running Ceph on a version of Linux that wasn’t available at the time we started, and is soon to be upgraded again.
We started the pilot with a design target of hundreds of GB, and have now expanded to hundreds of TB. I really can’t remember the last time I participated in a project where the sizing grew by three orders of magnitude!
All of this upgrading and updating has been done live, with real dependencies on our warehouses and supporting continuous deployment of applications without any downtimes or disruptions. We’re at the point now where we can seamlessly update our cluster with VMs running on it such that only the people doing the work realise what’s going on.
Luis Periquito, Unix Team Lead