Being the world’s largest online-only supermarket means Ocado eats big data for breakfast. Since its inception more than three years ago, the data team at Ocado Technology has been finding ever more efficient ways to manage Ocado’s digital footprint.
One way to achieve this goal was to be at the forefront of adopting cloud technologies. This article aims to offer a brief overview of how the data team tackled a major project to move all of Ocado’s on-premise data to the cloud. There have been several important lessons we’ve learned along the way and I’d like to use this opportunity to share a few of them with you.
The main motivation for starting this project was threefold:
- Reducing costs: the old, on-premise stack was expensive to upgrade and maintain
- Gaining more performance: we were hoping to achieve more elastic scaling based on demand
- Data centralisation: we wanted to remove siloing of data between different departments and business divisions.
The project was initially resourced using our own internal data team; we felt confident the team had the required skills to do an initial proof of concept. We then used a third party provider who adopted a rinse and repeat approach based on our work.
From the start, we had a clear idea of when we could declare the project completed: all data from our on-prem analytics databases had to be migrated into the cloud into Google Cloud Storage or, ideally, BigQuery. This target would allow us to further exploit technologies like DataProc or TensorFlow on Google Cloud Machine Learning. Throughout the migration project, we could also easily quantify the benefit this move to the cloud was bringing as the cost of work (the humans and the system) was very obvious.
We found there was no need to involve other parts of the business initially, and treated the project as a fixed-scope piece of work. However, as it evolved, we reevaluated the possibility of getting other teams involved so we could have a more inclusive, business-wide approach once the technology was well understood.
The ultimate desire was to move this project into the product stream to support the parallel streaming of data into the cloud. The prioritisation of these streams was handled by a product owner who also engaged with a steering group that took into account the current business needs.
We also set up a data curation team that would help business owners classify their data and land it in appropriate storage areas with correct access levels/retention, especially with Privacy Shield and GDPR. The data curation team also worked with the other teams to define the meaning of the data and create a set of business definitions.
Moving data around is not difficult, but assuring its quality is. How could we convince our stakeholders that the data in the cloud was indeed the same as that which they trusted on-premise? When it came to the quality of data, we implemented QA in several ways:
- We validated that the source database and the cloud were in alignment.
- Only certain data stores were classified as clean and assured
- Data sources were prepared in Tableau to expose clean data
- Those sources were validated with business users as they landed so that issues could be identified
At the end of the project, we were able to develop a series of processes that were production ready and supported through our technology teams.
Since adopting the Google Cloud Platform, we’ve reduced storage costs to a tenth, increased our storage capacity over twenty times and improved performance by hundreds of times compared to our previous approach of hosting data on-premise. Furthermore our development cycles on the data in the cloud has been significantly reduced as we implemented on demand computation power which allows us to experiment and iterate with much less latency and friction. Our initial results show how a cloud-first strategy can really bring benefits to the business, and we look forward to working with other like-minded retailers through our cloud-based Ocado Smart Platform.
To learn more about how Ocado Technology adopted BigQuery and other Google Cloud services, please register for this webcast.
Dan Nelson, Head of Data