Ceph logo

After selecting Ceph as our default storage platform, we had to learn how to use it so, like many companies out there, we created a small trial environment to test it out with a select few applications. When we were happy with the feature set, we created a small production pilot to see if there was real interest to use it internally in a real world scenario.

We quickly found that not only was there interest, people were so excited we had to upgrade the solution within the first week! As this was a small production pilot, it was designed to hold only around 500GB of data; our next big challenge was how to grow beyond that.

One of the early applications was sending massive amounts of short-lived information from our automated and real-time warehouses and our small test environment wasn’t able to cope with the load. Thankfully, the whole idea behind Ceph is to scale out and grow easily, and it was just a matter of adding more resources.

In addition, we spent the first few weeks learning how to tune the cluster, improve its performance, and understanding more about how the system worked, which meant the cluster spent more time in maintenance mode.

As time went by, we created better resource strategies such as keeping the metadata in fast storage while the bulkier storage was moved to slower disks. We also kept live upgrading the cluster to newer versions of Ceph that provided newer features and resources. Eventually the pilot evolved into our main cluster; we originally started with version Bobtail, then later upgraded to Cuttlefish, Dumpling, Emperor, Firefly, Giant, Hammer and Infernalis. Today, our cluster is running Jewel. Between upgrading to the major versions above, we also applied a few minor updates.

Originally, we started with four nodes, then quickly grew to six, eight, 12 and then 16. Nowadays, Ceph is running on 72 nodes and none of them are the original 16. To make things even more exciting, we’re running Ceph on a version of Linux that wasn’t available at the time we started, and is soon to be upgraded again.

We started the pilot with a design target of hundreds of GB, and have now expanded to hundreds of TB. I really can’t remember the last time I participated in a project where the sizing grew by three orders of magnitude!

All of this upgrading and updating has been done live, with real dependencies on our warehouses and supporting continuous deployment of applications without any downtimes or disruptions. We’re at the point now where we can seamlessly update our cluster with VMs running on it such that only the people doing the work realise what’s going on.

Luis Periquito, Unix Team Lead

April 19th, 2017

Posted In: Blog

Tags: , , , , ,

Cloud image

Being the world’s largest online-only supermarket means Ocado eats big data for breakfast. Since its inception more than three years ago, the data team at Ocado Technology has been finding ever more efficient ways to manage Ocado’s digital footprint.

One way to achieve this goal was to be at the forefront of adopting cloud technologies. This article aims to offer a brief overview of how the data team tackled a major project to move all of Ocado’s on-premise data to the cloud. There have been several important lessons we’ve learned along the way and I’d like to use this opportunity to share a few of them with you.

 

'Growth in data' diagram

The main motivation for starting this project was threefold:

  • Reducing costs: the old, on-premise stack was expensive to upgrade and maintain
  • Gaining more performance: we were hoping to achieve more elastic scaling based on demand
  • Data centralisation: we wanted to remove siloing of data between different departments and business divisions.

The project was initially resourced using our own internal data team; we felt confident the team had the required skills to do an initial proof of concept. We then used a third party provider who adopted a rinse and repeat approach based on our work.

From the start, we had a clear idea of when we could declare the project completed: all data from our on-prem analytics databases had to be migrated into the cloud into Google Cloud Storage or, ideally, BigQuery. This target would allow us to further exploit technologies like DataProc or TensorFlow on Google Cloud Machine Learning. Throughout the migration project, we could also easily quantify the benefit this move to the cloud was bringing as the cost of work (the humans and the system) was very obvious.

'BigQuery performance' diagram

We found there was no need to involve other parts of the business initially, and treated the project as a fixed-scope piece of work. However, as it evolved, we reevaluated the possibility of getting other teams involved so we could have a more inclusive, business-wide approach once the technology was well understood.

The ultimate desire was to move this project into the product stream to support the parallel streaming of data into the cloud. The prioritisation of these streams was handled by a product owner who also engaged with a steering group that took into account the current business needs.

We also set up a data curation team that would help business owners classify their data and land it in appropriate storage areas with correct access levels/retention, especially with Privacy Shield and GDPR. The data curation team also worked with the other teams to define the meaning of the data and create a set of business definitions.

Moving data around is not difficult, but assuring its quality is. How could we convince our stakeholders that the data in the cloud was indeed the same as that which they trusted on-premise? When it came to the quality of data, we implemented QA in several ways:

  • We validated that the source database and the cloud were in alignment.
  • Only certain data stores were classified as clean and assured
  • Data sources were prepared in Tableau to expose clean data
  • Those sources were validated with business users as they landed so that issues could be identified

At the end of the project, we were able to develop a series of processes that were production ready and supported through our technology teams.

Since adopting the Google Cloud Platform, we’ve reduced storage costs to a tenth, increased our storage capacity over twenty times and improved performance by hundreds of times compared to our previous approach of hosting data on-premise. Furthermore our development cycles on the data in the cloud has been significantly reduced as we implemented on demand computation power which allows us to experiment and iterate with much less latency and friction. Our initial results show how a cloud-first strategy can really bring benefits to the business, and we look forward to working with other like-minded retailers through our cloud-based Ocado Smart Platform.

To learn more about how Ocado Technology adopted BigQuery and other Google Cloud services, please register for this webcast.

Dan Nelson, Head of Data

March 28th, 2017

Posted In: Blog

Tags: , , , , , , ,

Ocado van

Smart cars need smarter roads

One of the many exciting parts of starting my new job at Ocado Technology has been the ability to go on a buddy route. For those who don’t work at Ocado, the buddy route offers new employees the option of accompanying one of our drivers on a delivery run.

Experiencing first-hand what it takes to get an order from an Ocado warehouse to a customer’s doorstep helped me form a few opinions when it comes to the future of transportation that I’d like to share with you.

Intelligent transportation at Ocado

Many people know Ocado from the brightly colored vans that travel the UK from one home to the next. However, not everyone gets the chance to ride in these vans or (literally) go under the hood (or bonnet, for my British readers).

Each Ocado van houses not only the different compartments needed for storing groceries but also a vast collection of sensors and embedded computing devices that stream information to the cloud in real-time.

Ocado Technology engineers have transformed these vans into a living, breathing network of IoT nodes that collects vast amounts of data about the UK transportation infrastructure. For example, these low-power embedded sensors constantly measure wheel speed, fuel consumption, engine revs, gear changes, braking and cornering speeds, bumps in the road, temperature, and other useful data. When correlated to the map of public roads in the UK, this information helps the Ocado Technology data science team figure out optimal routes for delivery so that drivers can actually fulfill the one-hour slot promise to customers.

The need for an infrastructure upgrade

After looking closely at various simulation models based on our route data, I believe we need to act more methodically about how we make cars smarter.

Implementing advanced computer vision capabilities is a step in the right direction for manufacturers looking to improve road safety. However, the self-driving cars of tomorrow may quickly find themselves stuck on the same congested roads we often experience today if the infrastructure doesn’t get a major upgrade as well.

If we want to fully realize the dream of drastically reducing (or even eliminating) road congestion in Europe and beyond, the computing we embed into our cars must be mirrored by a similar bump in the intelligence of our roads.

At Ocado, we’ve been building an infrastructure of connected vans which enables us to find the optimal routes I described earlier. Imagine if more (or all) cars driving on our public roads would have these sensors on board; we could then extend this concept to a larger scale and guide vehicles automatically on the best routes available. A simple sprinkle of smartness could make a big difference when it comes to driverless vehicles.

Routing map

Finally, driverless vehicles are going to have to be way smarter because they will have to share the roads with cars driven by humans. We therefore need to develop new communications protocols that enable cars (driverless or not) to talk to each other and to the environment around them (e.g. traffic lights). One example of such a protocol is the 4G-based network we’ve built for the robots in our warehouse; its properties could be extended to handle the low-latency vehicle-to-vehicle (V2V) communications protocols needed by the automotive market.

The economics of self-driving trucks

Whenever someone mentions Steven Spielberg, people automatically think of movies such as E.T., Jaws, Saving Private Ryan, Jurrasic Park or more recently Lincoln. For me, one of the definitive Spielberg classics remains Duel, a 1971 film where a businessman is relentlessly pursued by the malevolent driver of a truck, resulting in 74 minutes of cinema glory.

What makes the movie particularly interesting for me is the way Spielberg cleverly suggests the truck has a will and intelligence of its own – making it perhaps the first self-driving truck to be captured on film. Fast forward to several years ago, almost everyone involved in commercial transportation starts to get serious about the prospect of self-driving trucks.

As someone who’s worked in the past with companies developing technology for autonomous vehicles and is now part of an organization that relies heavily on transportation to grow its business, I believe there are indeed many benefits to deploying self-driving trucks on our roads.

Ocado and transportation

Before I get to the above topic however, I’d like to give you a short overview of our operational model to set the scene for the second part of this article.

Before we can get customers’ groceries into Ocado delivery vans, they first need to be shipped to one of our warehouses – we call them Customer Fulfilment Centers (or CFCs, for short). CFCs are where large orders get split into smaller deliveries thanks to a great team of dedicated people and a high degree of automation.

For our suppliers, the most common way of sending large quantities of products to our warehouses is to employ commercial trucks.

In addition to the CFCs, we also use our own trucks to transport products to smaller local distribution centers called spokes. You can think of a spoke as a very small warehouse where a large batch of orders comes in and then immediately gets distributed to our smaller vans.

Delivery trucks and vans

The diagram below shows an overview of our entire distribution model and covers some of the points I’ve touched on in my introduction:

Last Mile infographic

You can imagine that between goods coming into our warehouses and our warehouses sending larger orders to spokes, quite a few miles need to be travelled before our van drivers knock on your door to hand you the order.

Truck meets technology

Any company involved in long-distance haulage can attest to the inherent inefficiency of single-truck deliveries. One way to improve haulage management is to organize vehicles in fleet-type formations: the leading truck determines the fleet’s route and speed while the others receive instructions through a low latency wireless connection.

Even though self-driving trucks would have a high degree of automation on board, human drivers would still be able to assume control under certain conditions (e.g. if they would need to enter or exit the platoon formation).

A perfect analogy to describe the relationship between humans and self-driving trucks would be the fly-by-wire feature present in most aircraft today where the pilot assumes manual control only in exceptional circumstances while the computers handle most of the hard work.

The advantage of having such a convoy is that trucks drive at consistent speeds and on optimized routes, which would help relieve congestion on many European roads.

Self-driving truck convoys can benefit their human drivers too. Compared to the daily commute of regular motorists, trucks are driven mostly on highways for hours on end, making for a very uneventful and tiring journey for the person behind the wheel; many truck drivers are away from home for extended periods of time and can lead a very sedentary lifestyle.

Finally, self-driving truck convoys could dramatically improve road safety by reducing the number and severity of accidents caused by commercial vehicles.

Challenges under the hood

Before we get caught in the self-driving hype, there are still quite a few challenges we need to address first. Perhaps the biggest hurdle is the regulation needed to go from a few trials in remote areas to deploying these automated vehicles at a large scale. This will likely take years since routes can cover multiple countries; we therefore need to achieve consistency between the traffic codes and regulations of each territory in a region in order to implement a unified fleet.

Secondly, we need to develop better connectivity protocols that deliver the extended range, reliability and low latency required by the automotive market. These protocols must be able to handle a comprehensive list of common and corner case situations such as sudden changes in the road layout. At Ocado, we have developed a low latency system in collaboration with Cambridge Consultants: it works over the 4G standard and enables us to quickly coordinate more than 1,000 robots in a split of a second – you can read more about this project here.

Finally, companies need to be aware of the public perception when it comes to computer-controlled machines. There are potential security implications related to using these vehicles for other purposes than those they were originally designed for, including as weapons (hence my original reference to Duel in the introduction).

Overall, I think it’s important to remember that we are still some years away from fully automated vehicles becoming a familiar sight on our roads. In the meantime, more self-driving trials will probably get underway in Europe so try to keep your composure if the next time you look in your rearview mirror, no one appears to be behind the wheel.

It’s either that, or Steven Spielberg is working on Duel 2.

Are drones the answer to faster home delivery?

I remember a time when the word drone conjured memories of sitting next to a random stranger before a concert only to hear them talk endlessly to their friends about how much they loved the headliner, went to every single show they had every played, and bought every piece of merchandise they ever sold – complete with photographic evidence.

However, browse through the headlines dominating the news cycle of today and you will see the word drone mentioned for entirely different reasons. Indeed, everyone from Intel and Qualcomm to Amazon and Walmart is talking up drones nowadays.

Adding to the hype, a recent US federal ruling has made it possible for commercial drones to be used over populated areas without the need for a pilot’s license.

This sent drone enthusiasts on a PR mission to convince the general public that deliveries via drones are (a) imminent and (b) a very good idea. But beyond a few publicity stunts aimed at getting shoppers excited about the prospect of burritos falling from the sky, few have provided compelling answers to justify why drones should be the absolute future for home deliveries. In fact, if anyone took the time to speak with any serious drone manufacturer or business user, they would hear about a comprehensive laundry list of safety implications that need to be addressed before commercial drones can be used around people.

The last mile

Ocado has a large team of engineers working on route optimization for our delivery vans. In retail-speak, this part of the chain is called the last mile.

Previously used in telecommunications to refer to the final segment of the network that delivers services to end-users, the last mile is used by the retail industry to describe the final part of the supply chain that makes it possible for customers to receive their orders. Experts often describe the last mile as the most expensive, least efficient, and most problematic part of the overall delivery process.

The apparent logic behind drone deliveries is that they will solve many of the headaches associated with e-commerce, including the eternal inefficiency of the last mile process.

However, many choose to stay silent about (or blissfully ignore) two essential metrics associated with home deliveries: route density and drop size. These are incredibly important when it comes to the entire delivery process, regardless of whether you’re sending goods using a van or a drone.

Route density is the number of drop offs for a given delivery route; the drop size is the number of items delivered to each customer along a route. A look at the latest statistics from our analytics department shows that a typical Ocado customer spends £110 per order on average and our logistics department currently achieves 166 deliveries per van per week. Given that most customers place an order once per week, a delivery typically includes tens of products weighing several kilograms altogether.

Most drones struggle to carry anything above a couple of kilograms and have a limited range of 10-15 miles; that’s good enough for a burrito or a USB stick but not suitable for a crate of ambient, chilled and frozen products flown tens of kilometers from a delivery center into your backyard. These limitations affect both route density and the drop size and mean vans still have the edge over drones when it comes to last mile deliveries for the foreseeable future.

That doesn’t mean that the technology to lift goods into the air doesn’t have immediate applications for grocery deliveries; it’s just that drones will be part of the solution, and not the solution. A company might choose to handle small, top-up or ad-hoc type orders using drones for example (imagine something the size of your lunch being flown in via drone) while larger, weekly orders will still be delivered using the more familiar van method.

Ocado is leading the retail industry in efficiency when it comes to the last mile process and other logistics operations. We are constantly evaluating new ways in which we could extend our leadership position, including the use of drones and other types of robotics inside and outside of our warehouses.

February 9th, 2017

Posted In: Blog

Tags: , , , , , , ,

Contact Centre Agent

Being the world’s largest online-only grocery supermarket with over 500,000 active customers means we get the opportunity to interact with people all across the UK on a daily basis. Ocado prides itself on offering the best customer service in the industry which is one of the many reasons why our customers keep coming back.

Since Ocado doesn’t have physical stores, there are mainly two ways our customers and our employees interact directly. The first (and probably most common) is when our drivers deliver the groceries to the customers’ doorsteps; the second is when customers call or email us using our contact center based in the UK.

Today we’re going to tell you a bit more about how a customer contact center works and how Ocado is making it smarter.

The customer contact center

On the surface, Ocado operates the kind of contact center most people are already familiar with; we provide several ways for our customers to get in touch, including social media, a UK landline number, and a contact email.

Contact Centre

Customers can email, tweet or call Ocado

When it comes to emails, we get quite a variety of messages: from general feedback and redelivery requests to refund claims, payment or website issues – and even new product inquiries.

Getting in touch with a company can sometimes feel cumbersome. To make the whole process nice and easy for our customers, we don’t ask them to fill in any forms or self-categorise their emails. Instead, all messages gets delivered into a centralised mailbox no matter what they contain.

Contact Centre

Ocado customer service representatives filtering customer emails

However, a quick analysis of the classes of emails mentioned above reveals that not all of them should be treated with the same priority. In an old-fashioned contact centre, each email would be read and categorised by one of the customer service representatives and then passed on to the relevant department.

This model has a few major flaws: if the business starts scaling up quickly, customer service representatives may find it challenging to keep up, leading to longer delays which will anger customers. In addition, sifting through emails is a very repetitive task that often causes frustration for contact centre workers.

Clearly there must be a better way!

Machine learning to the rescue

Unbeknownst to many, Ocado has a technology division of 1000+ developers, engineers, researchers and scientists working hard to build an optimal technology infrastructure that revolutionises the way people shop online. This division is called Ocado Technology and includes a data science team that constantly finds new ways to apply machine learning and AI techniques to improve the processes related to running retail operations and beyond.

After analysing the latest research on the topic, the data science team discovered that machine learning algorithms can be adapted to help customer centres cope with vast amounts of emails.

The diagram below shows how we created our AI-based software application that helps our customer service team sort through the emails they receive daily.

Cloud computing model

The new AI-enhanced contact centre at Ocado

One of the fields related to machine learning is natural language processing (NLP), a discipline that combines computer science, artificial intelligence, and computational linguistics to create a link between computers and humans. Let’s use an email from a recent customer as an example to understand how we’ve deployed machine learning and NLP in our contact centres:

Example of feedback

The machine learning model identifies that the email contains general feedback and that the customer is happy

The software solution we’ve built parses through the body of the email and creates tags that help contact cenre workers determine the priority of each email. In our example, there is no immediate need for a representative to get in touch; the customer is satisfied with their order and has written a message thanking Ocado for their service.

We strive to deliver the best shopping experience for all our 500,000 + active customers. However, working in an omni channel contact centre can be challenging, with the team receiving thousands of contacts each day via telephone, email, webchat, social media and SMS. The new software developed by the Ocado Technology data science team will help the contact centre filter inbound customer contacts faster, enabling a quicker response to our customers which in turn will increase customer satisfaction levels. – Debbie Wilson, contact centre operations manager

In the case of a customer raising an issue about an order, the system detects that a representative needs to reply to the message urgently and therefore assigns the appropriate tag and colour code.

Data science at Ocado, using Google Cloud Platform and TensorFlow

This new ML-enhanced contact centre demonstrates how Ocado is using the latest technologies to make online shopping better for everyone.

Ocado was able to successfully deploy this new product in record time as a result of the close collaboration between three departments: data science, contact centre systems, and quality and development. Working together allowed us to share data and update models quickly, which we could then deploy in a real-world environment. Unlike a scientific demonstration where you’re usually working with a known set of quantities, the contact centre provided a much more dynamic scenario, with new data arriving constantly. – Pawel Domagala, product owner, last mile systems

Our in-house team of data scientists (check out our job openings here) trained the machine learning model on a large set of past emails. During the research phase, the team compared different architectures to find a suitable solution: convolutional neural networks (CNNs), long short term memory networks (LSTMs) and others. Once the software architecture was created, the model were then implemented using the TensorFlow library and the Python programming language.

TensorFlow and Python logos

Python is the de-facto most popular programming language in the data science community and provides the syntax simplicity and expressiveness capabilities we were looking for.

TensorFlow is a popular open-source machine learning toolkit that scales from research to production. TensorFlow is built around data flow graphs that can easily be constructed in Python, but the underlying computation is handled in C++ which makes it extremely fast.

We’re thrilled that TensorFlow helped Ocado adapt and extend state-of- the-art machine learning techniques to communicate more responsively with their customers. With a combination of open-source TensorFlow and Google Cloud services, Ocado and other leading companies can develop and deploy advanced machine learning solutions more rapidly than ever before. – Zak Stone, Product Manager for TensorFlow on the Google Brain Team

Understanding natural language is a particularly hard problem for computers. To overcome this obstacle, data scientists need access to large amount of computational resources and well-defined APIs for natural language processing. Thanks to the Google Cloud Platform, Ocado was able to use the power of cloud computing and train our models in parallel. Furthermore, Ocado has been an early adopter of Google Cloud Machine Learning (now available to all businesses in public beta) as well as the Cloud Natural Language API.

Google Cloud Platform logo

If you want to learn more about the technologies presented above, check out this presentation from Marcin Druzkowski, senior software engineer at Ocado Technology.

Make sure you also have a look at our Ocado Smart Platform for an overview of how Ocado is changing the game for online shopping and beyond.

October 13th, 2016

Posted In: Blog

Tags: , , , , , , , , ,

Globe of food

Although several analysts have recently downplayed their predictions for the consumer side of the Internet of Things market, IoT adoption in the enterprise segment is currently experiencing a boom.

One of the areas where IoT is set to make a huge impact is the online grocery retail sector. This comes at a time when more consumers are starting to understand the benefits of shopping online.

 

For example, Ocado has an active customer base that counts over 500,000 users; in addition, we’ve noticed that customers tend to stay loyal to Ocado over time thanks to a combination of great customer service and an easy-to-use shopping platform.

However, we believe there are several areas where IoT is helping us improve efficiency, reduce waste, and enhance the shopping experience for our customers. The two examples mentioned below illustrate some of the projects we’re actively working on and the initial results we’ve achieved thanks to the amazing team of engineers working at Ocado Technology.

Warehouse robots communicating over 4G

The Ocado Smart Platform (OSP) represents the most important breakthrough in online grocery retail. One of the many innovations implemented by the OSP is the use of robots for collecting customers’ groceries; you can find a diagram of how that works below:

Image of the hive

To make such a complex system of software and hardware function correctly, we needed a new kind of communications protocol to enable thousands of robots to rapidly communicate over a wireless network. We’ve therefore partnered with Cambridge Consultants to build a wireless system like no other.

This new network is based on the same underlying technology that connects your 4G mobile phone to the internet but operates in a different spectrum that allows thousands of machines to talk to each other at the same time. Each robot integrates a radio chip that connects to a base station capable of handling over 1,000 requests at a time. A typical grocery warehouse can thus use up to 20 base stations to create a small army of connected robots on a mission to ensure that your delivery gets picked in a record time of less than five minutes.

Moreover, since this system uses an unlicensed part of the radio spectrum, it could potentially be deployed for many other IoT applications that require low latency communications between thousands of devices. In addition, it can be deployed quickly too, as there’s no need to submit any form of paperwork related to standards compliance.

Equipping delivery vans with IoT sensors

We employ a large fleet of vans to deliver orders from Customer Fulfilment Centres (CFCs) to Ocado customers who purchase their groceries online. In order to manage this fleet efficiently, we equipped our delivery vans with a range of IoT sensors logging valuable information such as the vehicle’s location, wheel speed, engine revs, braking, fuel consumption, and cornering speed.

Vans at our Park Royal spoke

The vans then stream back this data in real time and also in greater granularity when they return to their CFCs. Ocado engineers then feed the data into our routing systems so the routes we drive tomorrow will hopefully be even better than the ones we drove today. We can also direct vans to park at the best possible location for a given time of day and take into account factors such as the current day of the week or school holidays.

At a time when inner city pollution is a growing health concern, reducing fuel consumption is not only a wise business decision but also an easy way to cut back on our carbon footprint. Furthermore, having a fleet of connected vehicles that is constantly exploring every corner of the UK enables us to gather lots of useful mapping information, including potential traffic jams and road closures.

This information could then be shared with other connected cars and help drivers manage their journeys more effectively. An example of such an initiative is the recent partnership between Mobileye, GM, Volkswagen and Nissan to create a set of crowdsourced maps that acts as the digital infrastructure for the self-driving cars of the future.

Alex Voica, Technology Communications Manager

September 22nd, 2016

Posted In: Blog

Tags: , , , , , , , , , , , , ,

Christofer Backlin

Some time ago I needed to schedule a weekly BigQuery job that involved some statistical testing. Normally I do all statistical work in R, but since our query scheduler wasn’t capable of talking to R I decided to try a pure BigQuery solution (rather than go through the hassle of calling R from a dataflow). After all, most statistical testing just involves computing and comparing a few summary statistics to some known distribution, so how hard could it be?

It did in fact turn out to be just as easy as I had hoped for, at least for the binomial test that I needed. The summary statistics were perfectly simple to calculate in SQL and the binomial distribution could be calculated using a user defined function (UDF). The solution is presented and explained below, and at the very end there’s also a section on how to implement other tests.

Binomial testing using a UDF

Let’s recap the maths behind the one-sided binomial test before looking at the code. Given that an event we want to study happened in k out of n independent trials, we want to make an inference about the probability p of observing the event. Under the null hypothesis we assume that p = p0 and under the alternative hypothesis we assume that p < p0. The probability of observing k or fewer events under the null hypothesis, i.e. the p-value, is calculated in the following way:

One-sided binomial test

This can be expressed as the UDF below. It includes a few tricks to deal with the fact that the JavaScript flavour used by BigQuery lacks many common mathematical functions, like fast and accurate distribution functions. The binomial distribution function is calculated by performing the multiplications as additions in logarithmic space to get around over and underflow problems. The base change is needed to get around the fact that both the expand log10 functions were missing.

/*
* Binomial test for BigQuery
*
* Description
*
*    Performs an exact test of a simple null hypothesis that the probability of
*    success in a Bernoulli experiment is `p` with an alternative hypothesis
*    that the probability is less than `p`.
*
* Arguments
*
*    k   Number of successes.
*    n   Number of trials.
*    p   Probability of success under the null hypothesis.
*
* Details
*
*    The calculation is performed as a cumulative sum over the binomial
*    distribution. All calculations are done in logarithmic space since the
*    factors of each term are often very large, causing variable overflow and
*    NaN as a result.
*
* Example
* 
*    SELECT
*      id,
*      pvalue
*    FROM
*      binomial_test(
*        SELECT
*          *
*        FROM
*          (SELECT "test1" AS id,   100 AS total,   10 AS observed,    3 AS expected),
*          (SELECT "test2" AS id,  1775 AS total,    4 AS observed,    7 AS expected),
*          (SELECT "test3" AS id, 10000 AS total, 9998 AS observed, 9999 AS expected)
*      )
* 
* References
* 
*     https://en.wikipedia.org/wiki/Binomial_distribution
*     https://cloud.google.com/bigquery/user-defined-functions
*
* Author
*
*     Christofer Backlin, https://github.com/backlin
*/
function binomial_test(k, n, p){
  if(k < 0 || k > n || n <= 0 || p < 0 || p > 1) return NaN;
  // i = 0 term
  var logcoef = 0;
  var pvalue = Math.pow(Math.E, n*Math.log(1-p)); // Math.exp is not available
  // i > 0 terms
  for(var i = 1; i <= k; i++) {
    logcoef = logcoef + Math.log(n-i+1) - Math.log(i);
    pvalue = pvalue + Math.pow(Math.E, logcoef + i*Math.log(p) + (n-i)*Math.log(1-p));
  }
return pvalue;
}

// Function registration
bigquery.defineFunction(
  // Name used to call the function from SQL
  'binomial_test', 
  // Input column names
[
    'id',
    'observed',
    'total',
    'probability'
  ],
  // JSON representation of the output schema
  [
    { name: 'id', type: 'string' },
    { name: 'pvalue', type: 'float' }
  ],
  // Function definition
  function(row, emit) {
    emit({
        id: row.id,
        pvalue: binomial_test(row.observed, row.total, row.probability)
    })
  }
);

Demonstration on a toy example

To demonstrate the UDF let’s use it for figuring out something we all have wondered about at some point or another: Which Man v Food challenge was really the hardest? Each challenge of the show was presented together with some rough stats from previous attempts by other contestants. Compiled in table form the data looks something like this (download the complete dataset as a CSV file or SQL statement):

Row City Challenge Attempts Successes
1 San Antonio Four horsemen 100 3
2 Las Vegas B3 burrito 140 2
3 Charleston Spicy tuna handroll 475 8
4 San Francisco Kitchen sink challenge 150 4

Just dividing the number of successes with the number of attempts isn’t a very good strategy since some challenges have very few attempts. To take the amount of data into consideration we’ll instead rank them by binomial testing p-values (assuming that there is no bias in the performance of the challengers that seek out any particular challenge). Here’s the SQL you need to apply the test above:

SELECT
  Challenge, Attempts, Successes,
  pvalue,
  RANK() OVER (ORDER BY pvalue) Difficulty
FROM binomial_test(
  SELECT
    id,
    total,
    bserved,
    sum_observed/sum_total probability
  FROM (
    SELECT
        Challenge id,
        Attempts total,
        Successes observed,
        SUM(Attempts) OVER () sum_total,
        SUM(Successes) OVER () sum_observed
        FROM tmp.man_v_food
        WHERE Attempts > 0
  )
) TestResults
JOIN (
  SELECT
    Challenge,
    Attempts,
    Successes
  FROM tmp.man_v_food
) ChallengeData
ON TestResults.id == ChallengeData.Challenge
;
Row Challenge Attempts Successes pvalue Difficulty
1 Shut-up-juice 4000 64 1.123E-56 1
2 Stuffed pizza 442 2 8.548E-12 2
3 Johnny B Goode 1118 30 1.875E-10 3
4 Spicy tuna handroll 475 8 1.035E-7 4
5 Mac Daddy pancake 297 4 6.094E-6 5

An alternative way to tackle the problem – that is related and arguably better – is to infer and compare the success probability of each challenge. We can do this by finding the posterior probability distribution of the success probability q using Bayesian inference and extract the median and a 95% credible interval from it. Using Bayes’ theorem we have that

Using Bayesian inference

For computational simplicity we’ll choose uniform priors, turning the fraction to the right into a normalisation constant. Thus we arrive at the following expression for calculating any a-quantile qa of the posterior, which is the continuous analogue of the expression for the binomial test defined above:

Uniform priors.png

Implemented as a UDF (code here) we can use the following query to infer the success probability:

SELECT
  Challenge, Attempts, Successes,
  q_lower, q_median, q_upper
FROM (
  SELECT
    id, q_lower, q_median, q_upper
  FROM bayesian_ci(
    SELECT
        Challenge id,
        Attempts total,
        Successes observed
  FROM tmp.man_v_food
  WHERE
    Attempts IS NOT NULL
    AND Attempts > 0
  )
) TestResults
JOIN (
  SELECT
    Challenge,
    Attempts,
    Successes
  FROM tmp.man_v_food
) ChallengeData
ON TestResults.id == ChallengeData.Challenge
ORDER BY q_median
;
Row Challenge Attempts Successes q_lower q_median q_upper
1 Stuffed pizza challenge 442 2 5.529E-4 0.00618 0.0157
2 Mac Daddy pancake 297 4 0.00514 0.0157 0.0341
3 Shut-up-juice 4000 64 0.0115 0.0162 0.0213
4 Spicy tuna handroll 475 8 0.00814 0.0182 0.0330
5 B3 burrito 140 2 0.00399 0.0189 0.0503

Implementation of other tests

The binomial tests lends itself particularly well for UDF implementation because it is easy to implement the binomial distribution. Similar examples include Fisher’s exact test and the Poisson test. However, many commonly used tests do not fall into this category.

Tests whose null distribution is considerably harder to implement include Student’s t-test, the χ2 test, and Wilson’s test of proportions. For those you are probably better off using Dataflow or Spark, but if you desperately want to you can use BigQuery alone. In that case you need to precalculate the distribution functions for every degree of freedom you might want to use, store in a separate table, calculate summary statistics and degrees of freedom for each test, join the two, and compare (just like in the good ol’ days of Neyman, Pearson, and Fisher!). If you go down this route you might want to use a theta join.

Non-parametric test, like Wilcoxon’s rank sum and signed rank tests, require yet another approach because they use all the data points to define the distribution. To use them you must aggregate all data points of each test into a single row and pass it to the UDF. This is because UDFs cannot operate on multiple rows simultaneously (more info). Note that in order to do so you’ll have to use aggregation functions that are only available in stardard SQL (ARRAY_AGG and friends), but not in legacy SQL, which is still the default. Also note that standard SQL is still in beta and that UDFs are wrapped in a different way.

Christofer Backlin, Data Scientist

August 3rd, 2016

Posted In: Blog

Tags: , , , , , , , , ,

Ali Major

In this blog I’ll show you how to create simple, easy-to-follow roadmaps for the whole team to buy into (and stick to!).

My roadmap philosophy is ‘plan to achieve specified deliverables over a defined time period. This plan is dynamic, and goal and data driven (measurable)’.

A few key points:

  • My roadmap is NOT completed in isolation but involves other Data Department Product Owners (POs), the Business and my development teams.
  • The Data Department roadmaps span a three month period (we call them seasonal roadmaps).
  • My roadmaps must take into account: legacy, new development, and support across multiple stakeholders.
  • It’s dynamic, kept up to date, and not hidden away.

My approach

Step 1: Time to think

In the month prior to the start of the new roadmap, I ponder what is it we should be working on to meet our Data Department, clients and Ocado Technology requirements. I create a draft and run it by my development team lead and the Business representative/s. This is all high level and, importantly, not the specifics of how we will do the work.

Step 2: Stop pondering, draw circles

After we, the Data Department POs, finish pondering, we schedule a circle session (I love this exercise). We each explain what we think we should be working on and jot this on a board in one or two words.

Where we have the same theme – for example, monitoring – we circle this with a coloured marker. Each colour represents one of our teams. We also discuss dependencies and mark these accordingly. At the end of the exercise we can gauge:

  • Range of work proposed and if we are trying to cover too much
  • Common themes e.g. we have lots of circles around monitoring
  • How many of these themes are dependent on other teams

Circle session

Step 3: Finalise and socialise

Next is to tweak and socialise the draft roadmaps with the development teams and relevant stakeholders.

During the finalisation process I do four more very important things:

  1. A retrospective on how we performed against our previous roadmap, and what lessons-learned we should concentrate on during this new roadmap.
  2. Story Mapping to break into epic and stories, the ‘how’ of achieving the goals and features (the Post-it note company must love the take up of Agile…).

Roadmap goals

Story mapping

  1. Once I have prioritised the themes, the team then rank the epics and stories into order for working on. Then we estimate time based on a comparison to other epics.

I disagree that this is a non agile approach. They are very rough estimations, done quickly. We will in the future groom the epics and stories to a Ready state when we get closer to needing to work on them. The estimation provides us with an idea of what we reasonably think we can commit to on this roadmap, for example we are asking the question can we do story C within the three months taking into account story A and B?

  1. Lastly we jot down story acceptance criterias AND we play some games throughout the process. I suggest Pictionary as it lets you know who not to work on your UI!

Encountered Problems

Sometimes I am in the difficult position where I know one of our key features is lacking detail to adequately break into epics and stories. And if it’s not clearly defined, we shouldn’t work on it, right?

In reality, sometimes you just need to get on with it, so below are approaches I’ve taken in the past to mitigate the issue. Remember to always keep the stakeholders in the loop, and the team must agree.

  1. Add feature to roadmap and allocate a percentage of time during the three months. We postpone breaking the requirement into epics and stories until I have the missing information. Then we storymap, estimate what we think we can fit in, and update the roadmap because (repeat after me) the roadmap is dynamic.
  2. Add to roadmap with target metrics but agree between all stakeholders that this may not be achieved. As more information comes to light we update the roadmap.
  3. Leave it off the roadmap, with stakeholder agreement to add it to the next roadmap instead.

A dynamic roadmap should not mean mission-creep

By ‘dynamic’ I mean it stays relevant. It is a living document and kept up to date. I use the roadmap as a tool to push back, but I don’t become inflexible. The highest priority tasks can always be done.

My roadmap means I can identify the opportunity cost (e.g. feature X won’t be completed), and I can measure what changed and what was impacted during and at the end of the three months. It is a balancing act.

What’s crucial is communication, communication and, one more time, communication. I make a call based on discussions with senior managers, the business and the development team. It is also imperative that once the roadmap is updated it is socialised again.

Only three months? What about the long term vision?

Focusing on longer terms goals, but delivery in small chunks with metrics to measure against, works best for us. It allows flexibility where technology or goals are shifting.

I know many disagree that a roadmap should be less than six or even twelve months. It’s important that my roadmap goals are linked to the Data Department’s long term roadmap or the long term Business requirements. Both my products and the Data Department have far reaching vision statements, so we know if we are heading down the wrong path.

What my roadmap looks like

I use Roman Pichler’s Go Template. It is simple and doesn’t come with preconceptions like a Gantt chart does.

I was so proud when, for the first time, I could cut and paste directly from the Data Department long term goals roadmap into my seasonal roadmap! It seems for many POs their roadmaps don’t have a direct correlation with the next level above, or the level above that etc.

Road map

Roman Pichler source on roadmaps: http://www.romanpichler.com/blog/working-go-product-roadmap/

To wrap up

A roadmap is a great tool, but you need to find what style and process works for your team. Keep it dynamic and refer back to it each sprint planning.

Measure your success and enjoy crossing out deliverables as they are done.

Ali Major, Product Owner

Read more Product Owner advice from Ali

January 5th, 2016

Posted In: Blog

Tags: , , , , ,

Matt Whelan

My team builds simulations of physical systems. Our work falls into 3 categories: experimental, tactical, and operational.

At the experimental end, we build simulations and design tools for new technologies and warehouse layouts, along with prototype control algorithms.

Tactically, we try out proposed changes to our warehouse topologies in silico and perform ROI analysis. We create and mine large data sets so we can spot and remove risk from our growth strategy.

Operationally, we pipe streams of production data into 3D visualisations, originally developed for playing back simulations, allowing real-time monitoring of our live control systems.

We get to work on some pretty bold conceptual projects because, when working at such a massive scale (last year our operation turned over £1billion), even seemingly small percentage efficiency savings mean serious money to the business.

I read a lot about how the more theoretical aspects of computing – things that interested me in the subject in the first place – aren’t as important in the ‘real world’ of enterprise software development. There are big players in all kinds of industries getting left behind because they shy away from AI, robotics, and large scale automation. I think we’re really lucky that we get to spend our time creating novel path searches, travelling salesman solvers, discrete optimisers and the like, and it gives us an edge over our competitors in a fierce market.

The team is a real mixed bag of interests and hobbies. We have a physics doctor, a swing dancer, and a gaming software expert for starters. One thing we all have in common is that we’re unfazed by scale – an attitude which pervades Ocado Technology – and all looking to be the person with the big idea.

The beauty of the environment we’re in is that we can prove how big that idea is before millions are spent on building it.

If that sounds like a team you want to be a part of, these are the positions we’re recruiting for now:

Full Stack Django/Celery Software Engineer

Java Software Engineer (SE2) – Simulation

Senior Java Software Engineer – Simulation

Matt Whelan, Simulation Research Team Leader

September 16th, 2015

Posted In: Blog

Tags: , , , , , , , , ,

Ben and Mike

In computer science, data is often modelled in a hierarchical tree-based structure, with a root node and subtrees of child nodes:

Table 1

Within Ocado Technology, a number of our systems contain data that follows this structure, and it’s commonly stored in traditional relational databases, such as Postgres. It’s a trivially easy structure for a relational database to store (a table with a foreign key to itself, normally called the parent field) – and it’s also trivial to retrieve the children or the parent for a given node. This is easily represented with a Django model and query code such as the following:

    class Node(models.Model):
        name = models.CharField(max_length=24)
        parent = models.ForeignKey('self', related_name='children')

    >>> node = Node.objects.get(name='C')
    >>> node.children.all()
    [<Node: D>, <Node: E>]
    >>> node.parent
    <Node: A>

However, in order to fetch the entire subtree, recursion is required – this is far from ideal, as it will result in many queries (effectively one per node), especially for deep trees. This is where a closure table can help.

What is a closure table?

A closure table allows us to obtain all the children (descendants) or parents (ancestors) of a given node. It does this by building an additional auxiliary table containing the relationship between each parent and child, regardless of the tree depth between the two. For example, the closure table for the above diagram will contain the following entries:

Parent Child
A A
A B
A C
A D
A E
B B
C C
C D
C E
D D
E E

These relationships can be represented on the diagram above like so:

Table 2

The purple arrows represent the rows in the closure table, while the black lines represent the real relationships stored by the foreign key in our original database table.

This auxiliary table allows us to use a single query to fetch all the ancestors or descendants for a given node – it’s a simple SELECT query, fetching all the rows where either the child or parent field are the node you’re querying.

The table needs to be constructed and maintained as your original data structure evolves over time. Every time a node is inserted into the tree, or a relationship is removed, the closure table needs to be updated.

To illustrate adding a new node into the main tree, we’ll use the following diagram. We are adding the subtree containing the nodes F, G and H as a child of node E.

Subtree

To add a the new subtree into the main tree, you need to first find all the parents of node F’s intended parent (a query on the closure table where child is E), and all the children of F (a query on the closure table where parent is F). In this case, the two queries will return A, C, E and F, G, H. Note that E is returned when querying for E’s parents, and F is returned when querying for F’s children – this is due to the closure table also containing self-referential entries; a relationship between each node and itself. This is because the final step when adding a new subtree requires the creation of parent-child relationships in the closure table between all the nodes in the product of those two returned lists:

A C E
F A->F C->F E->F
G A->G C->G E->G
H A->H C->H E->H

The closure table now contains all the information required for the full tree. Querying for the descendants of node A will now return B, C, D, E, F, G and H. Querying for the ancestors of node G will return F, E, C and A.

Removing a node (or, rather, removing a relationship) requires the removal of all relationships in the closure table that traverse that node. For example, to disconnect the subtree we just added by removing link E->F, we need to remove all the relationships we added above. The idea is simple. An entry in the closure table X->Y is removed if there is both:

  • an entry in the closure table of the form X->E
  • an entry in the closure table of the form F->Y]

It turns out the query to find all such nodes is relatively simple. We filter on the closure table and (taking the first part as an example) we look for entries with a parent that has an entry where the child is E, and similarly for the second query.

Often, a closure table will also store the depth of each relationship. This allows the query to easily be limited to a certain distance from the node.

Implementing Closure Tables in Django

As we mentioned above, there are a number of systems within Ocado Technology that store tree-based data in a relational database fronted by applications written in the Django framework. In order to make this simple for us, we wrote a utility that automatically creates and maintains the closure tables for your models.

Today, we’re open sourcing that library, django-closuretree. It was built with the goal of adding the power of a closure table to your models as transparently as possible, providing some convenience methods to take advantage of the efficient querying possible with closure tables. To extend the Django example from the very beginning, we can rewrite our model to the following:

from closuretree.models import ClosureModel
class Node(ClosureModel):
name = models.CharField(max_length=24)
parent = models.ForeignKey('self', related_name='children')
(The parent field name is configurable in django-closuretree)

Note that we’re no longer inheriting from django.db.models.Model, but instead from closuretree.models.ClosureModel. For the majority of cases, this is the only change required to build and maintain the closure table for you. The query to return all subtree children for the given node would be:

>>> Node.objects.get(name='A').get_descendants()
<Node: B>, <Node: C>, <Node: D>, <Node: E>]
This will only execute one query, and is therefore far more efficient than recursively querying for each node’s direct children. The same is true for the parents of the node:

>>> Node.objects.get(name='E').get_ancestors()
[<Node: C>, <Node: A>]

For more advanced information on how to use django-closuretree in your own projects, read the documentation, or browse the code on Github. But how is this implemented in the Django framework? What does extending from closuretree.models.ClosureModeldo? It turns out that this wasn’t trivial.

Closure tables require an additional table for each model that stores a tree-based structure. In Django, tables map to models – but we didn’t want the user to have to define an extra model to create the closure table. In fact, inheriting from ClosureModel creates this additional model on the fly, by using Python’s ability to define classes dynamically using type. The ClosureModel base class actually uses a special metaclass (a Python class that is used to create other classes – note, classes, not objects) that hijacks the creation of your model’s class, and adds an additional django.db.models.Model class into your module that defines the Django model representing the closure table for your model. This happens on the import of your models.py, and therefore by the time that Django comes to create your database tables (when you run syncdb, for example), the closure table model is already there – and Django simply views it as another model that you yourself have defined.

This table is simply a model that contains three fields: parent (a foreign key to your model); child (also a foreign key to your model); and depth (an integer representing the depth of the relationship).

Note: Django allows you to subclass models from other models. In this case, the closure table is only created for the base model, not any models that inherit from it – it’s unnecessary for any of the subclassed models, as they are essentially extensions of the base model.

That takes care of the creation of the closure table in the database, but that’s not much use unless this table is properly populated and maintained as data is inserted and removed from the tree. To do this, the ClosureModel that you inherited your model from overrides Django’s default save and delete methods to update the closure table whenever data changes. (This means that if you override save and delete yourself, you definitely must call the super versions of the methods, otherwise the closure table won’t properly reflect the state of your data – but you knew that already, right?)

On the save of a model instance, the ClosureModel will update the closure table as appropriate. If this is a new model (one that has never been saved before), the self-referential relationship is created (an instance of the closure table model is instantiated and saved with parent and child pointing to the current model instance, and depth set to 0), and the other relationships are instantiated in the same way (by using the method shown above – creating relationships for the product of the intended parent’s parents and the current object’s children).

If the model instance already existed when save was called, we don’t need to create the self-referential relationship (it will already be in the closure table). Instead, we only need to update the relationships in the closure table if the parent of this instance has changed. How do we know that? The ClosureModel class overrides Python’s magic method __setattr__, and keeps an eye on the parent field1 on your model. If this is ever changed, an attribute is set on your instance (storing the reference to the old parent, if there was one) that is later checked in the save method. If the parent has changed, then the save method now deletes all the relationships with the old parent, and creates all the relationships with the new parent.

ClosureModel’s delete method simply removes all relationships from the closure table that traverse the instance you’re deleting, using the logic shown above. Django’s ORM makes this pretty painless and allows us to do it with a single queryset filter.

The ClosureModel class provides methods such as getchildren, getdescendants and get_ancestors that use the closure table associated with your model to efficiently query for the objects you require. All transparently provided for you, just by subclassing from ClosureModel instead of Django’s db.models.Model – whoever said there’s no such thing as a free lunch?

Implementing closure tables in Django required (or allowed, who knows… perhaps we just like these things!) us to use some of the more advanced features of Python. The closure table model is created on the fly using Python’s dynamic class creation function type. Python’s metaclasses are often looked upon as magic, but they’re really just a clever way of dynamically manipulating the way that classes are created at run time – and these are also used to create the closure table model on the fly.

It’s always interesting to discover more about the power of the language you’re using, and stretch the boundaries of your knowledge of the frameworks you use. Check out the code for django-closuretree on Github, and see for yourself how we’ve implemented the power of closure tables in Django.

Ben Cardy, Network Engineer, and Mike Bryant, Systems Administrator

1There’s a configurable option that allows you to specify a sentinel attribute that’s watched instead. This is particularly useful in one of our applications where the parent relationship isn’t direct. That is, AModel->BModel->CModel->AModel. We define a property parent that traverses this relationship, and the sentinel attribute is the foreign key to BModel.

February 6th, 2015

Posted In: Blog

Tags: , , , ,

Scroll Up