Ocado van

The story of how to quickly check your team’s practices while having a lot of fun.

How do firefighters assess their readiness for an emergency situation? How do they ensure that everyone knows what to do in case of a fire?

The best way to check this is to organise maneuvers similar to a real life situation.

Firefighters train regularly to check and improve their procedures. They are inducting new members, they are testing new equipment, and they are building trust to each other.
Nothing can teach you more than real practice.

Like firefighters, data scientists need to perform as a team. This means introducing ways of working to new joiners, making use of the best tools and techniques they have at their disposal, and knowing each other’s skills and personalities.

 

In this article we would like to describe how we organised an internal Kaggle-like competition to test and assess our data science procedures.

Our small data science competition

Recently, we organised a machine learning competition at Ocado Technology to check how well equipped our data science teams were to solve real problems under pressure. We invited data scientists from our five offices (i.e. Kraków, Wrocław, Sofia, Barcelona, and Hatfield) to our headquarters in the UK. We ordered some pizzas and started a hackday.

We formed into teams and adopted only one rule: if two data scientists are working on the same team in real life, they cannot work with each other during the hackday. We wanted to encourage people to get to know each other.

We decided to use a Kaggle style competition. For people who are not familiar with Kaggle, it’s a competition where business problems, data, and evaluation metrics are defined by the organisers. Participants then have to build ‘only’ the corresponding machine learning models.

Our problem

The goal of the competition was to predict the total time-at-door for Ocado delivery vans.  

We wanted to know how much time it would take to deliver groceries to a particular address. Ocado is using these delivery times to plan the van routes with more certainty and thereby open up more one-hour time slots for customers to choose from.

Ocado vans

Similar to Kaggle, we prepared some baselines and a leaderboard where we showed the best solutions; this gave participants additional motivation to build something better than everyone else. At the end of the competition, the teams presented their findings and models. We learned a lot about our data, our practices, and ourselves.

It was a great day so we wrapped things up with the customary pint in the pub.

Five lessons learned after the Kaggle competition at Ocado

You can easily apply these lessons in your data science team or data department:

 

    1. Hackdays are a great chance to socialise

There’s nothing like a competition to get people from different offices working together and therefore getting to know each other. People can learn about their strengths and weaknesses. We found problem-solving to be a great team-building exercise. After the event, we created a survey which confirmed that people indeed had a lot of fun.

    2. Machine learning models are only the tip of the iceberg

As an organiser, you need to choose the problem wisely: it cannot be too difficult to solve in one day but still should be challenging. You have to define the evaluation metrics, gather data, split it into training and test sets, write down the rules etc. During his presentation at NIPS 2016, Ben Hamner (CTO of Kaggle) confirmed that his employees invest hundreds of hours in properly setting up the competitions behind the scenes. In all data science projects, only 5-10% of the time is spent on modeling.

    3. Data science is all about iterations

During the competition, some teams over-complicated their models: they tried to check too many things at the same time and overestimated what is feasible to do during one day. At the end of the day, only working models really matter (all teams had plenty of ideas on what they would have liked to check but ran out of time).

It works pretty similarly in real life. We’ve written about this here as well.

Practice and contests like this can show your team the benefit of iterative work.

    4. Domain knowledge can make all the difference

Rather than trying a more complicated model, it’s better to first invest energy into understanding the metrics, analysing data, checking the distribution and outliers. The team that won the competition used their knowledge about Ocado’s business to improve their model. In real life, very often domain knowledge is essential.

    5. Improve your engineering practices

Python and R are two of the most popular programming languages for data scientists.

To work effectively, you need to know your tools very well, including programming languages and frameworks. If you want to rapidly check hypotheses or add new variables, you cannot be blocked by technologies.

This hackday showed us that we need to work harder on unifying our technology stack and adjusting the induction process to ensure that everyone can easily get the data, make the analysis or model and share their results with the rest of the team.

 

As we’ve seen, a one day hackday event can provide a very useful health check for your team. You can check how people organise their work, what tools they are using and how they are working to solve problems. But hackdays can be beneficial not only for data science or engineering teams; management teams can use them to decide training budgets, investment in tools and technology, or for forming new teams. We therefore strongly encourage you to involve your managers or team leaders in these events as much as possible.

Try to conduct similar competitions in your company. We assure you that you will learn much more than you’re expecting while having a lot of fun.

Lukas Innig, Marcin Druzkowski

May 18th, 2017

Posted In: Blog

Tags: , , , ,

Contact Centre

From idea to production system – the story of how an NLP project in the Ocado contact centre improved reply times by up to 4x. Also, ten tips for other Data Science teams.

 

A few months ago, we described on our blog how machine learning (ML) improved efficiency in our contact centre. Today we would like to tell you how we built this system, what we have learned along the way, and how we were able to reduce response times for customer emails by up to 4x.

Email dump cartoon

Presenting the problem

Imagine that you are a manager of a sizeable contact center that is getting a few thousand customer emails on a daily basis. Your customers typically contact you about very different things. For example:

  • John wants to give feedback about how polite his driver was
  • Matthew asks for a refund because his product was damaged
  • Alice informs us that she isn’t at home so the delivery won’t be successful
  • Jane want to thank-you for her great first delivery

As a manager you need to decide: 1. How long can an email wait in the queue without a response? 2. Is Alice’s request more important than John’s feedback ?

There are no easy answers for these questions. All contact center managers need to deal with these problems and Ocado is no exception.

Offering a solution

Imagine that you have a system that assigns appropriate tags based on the content of an email like in the example below:

Customer service email quote

Later, another function determines the priority of that email (and how quickly you should react) based on tags returned from the machine learning model. In our contact center, the tag cloud included labels such as Feedback, Food issue, Spam, Damaged item, Voucher, Quality and a few others.

You may wonder why we have split this process into two steps, rather than classify priority directly?

This was one of our lessons learned. When you are building machine learning models for a real business, you need to take into account that the business will change, priorities will shift, and incorporating these variables into your model is always a bad idea. To be agile, you need to give your business a lot of flexibility.

Here is an example of how assigning priorities would work:

Chart assigning email prioties

In our proposed solution, the contact center manager can decide that emails tagged “Thank you” (generally sent by happy customers) are not as important as “Payment issue”-type emails which must be answered in a matter of minutes.

First things first

Before we started gathering data, we wanted to ensure that we all understood the domain of the problem correctly. Nothing beats hands-on experience so we switched off our computers and spent a day in the contact center to understand what work there really looks like. That experience was funny and very useful in hindsight; it helped us build relationships with many colleagues we hadn’t interacted with before and visualize their problems in greater detail.

To determine the success of this project, we defined a clear business goal: to minimize the amount of time which urgent emails need to wait in the queue before receiving a response.

At the end of project we wanted to see the following pattern appear on the contact center dashboard.

Graph showing decline in email queues

From a machine learning perspective, this problem is a classic multilabel text classification. In multilabel problems, evaluating solutions quickly often implies computing a single aggregate measure that combines the measures for individual labels. We decided to use the commonly known F1 score, apply it to every label and average the results (this approach is known as macro averaging).

The dataset

Ocado maintains a large dataset of inbound emails that has been manually categorised by our contact centre advisors over the course of several years; this gave us over one million training examples for our multilabel classification. We couldn’t use the data in its raw format, however; some emails contained confidential data like phone numbers, postal or email addresses and customer names. Before we did anything with the data, we had to anonymise it. The process of deleting personal information is a very complex task and could be the topic of a standalone blog post.

Building the machine learning model

Before building any machine learning model, it’s always worth creating a simple heuristic baseline to benchmark against. With our particular problem, we had a set of 19 sparsely distributed tags; if we always choose only the most common label or predict at random, our F1 score will be around 0.05.

We started the modeling phase with a Logistic Regression model on a Bag of Words representation of the data. This simple solution achieved an F1 score close to 0.35 and helped us ensure that all parts of the system worked properly so that we could later focus purely on improving the accuracy of the model. A neural network was an obvious choice to accomplish this. We decided to evaluate two different neural net architectures: the Convolutional Neural Network (CNN) and the Recurrent Neural Network. We found recurrent architectures such as GRUs and LSTMs harder to train and very close in terms of performance to CNNs (but not better). Although a bit surprising, our findings are probably a reflection on the simplicity of our problem: usually each tag is associated with a presence or absence of some particular phrases so we don’t especially need to learn long-term dependencies like LSTMs do.

Below you can find the structure of our neural network which consists of a word embedding layer, two parallel convolutional layers, and a max pooling over the entire text followed by two fully connected layers; for each layer we applied batch normalization. In order to speed up the training we used word2vec embeddings as an initialization to our word embedding layer.

Structure of neural network

The whole architecture is surprisingly shallow. It was trained with a sigmoid cross entropy loss for around 20 epochs over our dataset and gives a production performance of around 0.8 f-macro.

You can read more about text classification from the following list of useful papers:

Deploying the model into production

Many of recent papers, articles, blog posts on machine learning focus only on improving the accuracy of a model. It’s worth emphasizing that modeling is only one of many steps in a data science project, and there are other steps that are equally important for the project to be successful.

A model which does not work on production is worth nothing.

From the first day you embark on a data science project, you should think about how you will expose your model – the sooner, the better. There are many reasons why a project can fail during deployment into a production environment.

We found three top reasons why this might happen:

  • Using the wrong technologies
  • Forgetting about software engineering practices
  • The lack of monitoring and support

Using the wrong technologies

To be sure that the incorrect use of technology will not block your deployment, you need to choose your platforms and tools wisely. It’s worth using technology which can be easily moved between environments and modes (i.e the code remains the same during training, prediction and serving)

We have decided to build our models in TensorFlow and deploy them in Google Cloud Machine Learning. TensorFlow allows you to specify the architecture in a high-level Python API and have those models run on distributed computing systems, including GPUs. Google Cloud Machine Learning provides managed services that enable you to easily expose your ML model as a REST API.

TensorFlow logo

Forgetting about software engineering practices

When you focus on building the best machine learning model, it’s very easy to forget that you write normal code. There is no magic to this: software engineering best practices will help make your code easier to maintain. For a software engineer’s perspective on data science, please have a look at this presentation.

Monitoring and support

At Ocado, we believe that teams work better when they are self-sufficient (as they don’t need to wait for other teams). Thanks to technologies like TensorFlow and Google Cloud Machine Learning, data scientists can also write and support production code. We feel we have ownership of the whole solution i.e data-product, machine learning model, dashboards, alerting policies etc.

Dashboard screenshotA screenshot from the production dashboard built with Google Data Studio

Reaping the benefits

Thanks to this project, we were able to significantly boost the efficiency of the customer centre. For example, we found that 7% of all inbound messages did not require a reply; this meant that our customer service advisors could spend more time working on more high-priority tasks.

Because the machine learning model automatically categorises emails, we have access to information quicker than ever before and can react much faster to sudden spikes in customer issues.

The project has also had an impact on the overall customer experience: urgent emails are being responded even four times faster than before.

Final remarks

We would love to hear your feedback about this article and project. If you have any questions or comments, feel free to drop us a line on social media.

If you enjoyed this article, spread the love around:

Share on Twitter

Share on Facebook

Thank you!

Marcin Druzkowski

Maciej Mnich and other data scientists from Ocado Technology contributed to this article

April 10th, 2017

Posted In: Blog

Tags: , , , , , , , ,

Contact Centre Agent

Being the world’s largest online-only grocery supermarket with over 500,000 active customers means we get the opportunity to interact with people all across the UK on a daily basis. Ocado prides itself on offering the best customer service in the industry which is one of the many reasons why our customers keep coming back.

Since Ocado doesn’t have physical stores, there are mainly two ways our customers and our employees interact directly. The first (and probably most common) is when our drivers deliver the groceries to the customers’ doorsteps; the second is when customers call or email us using our contact center based in the UK.

Today we’re going to tell you a bit more about how a customer contact center works and how Ocado is making it smarter.

The customer contact center

On the surface, Ocado operates the kind of contact center most people are already familiar with; we provide several ways for our customers to get in touch, including social media, a UK landline number, and a contact email.

Contact Centre

Customers can email, tweet or call Ocado

When it comes to emails, we get quite a variety of messages: from general feedback and redelivery requests to refund claims, payment or website issues – and even new product inquiries.

Getting in touch with a company can sometimes feel cumbersome. To make the whole process nice and easy for our customers, we don’t ask them to fill in any forms or self-categorise their emails. Instead, all messages gets delivered into a centralised mailbox no matter what they contain.

Contact Centre

Ocado customer service representatives filtering customer emails

However, a quick analysis of the classes of emails mentioned above reveals that not all of them should be treated with the same priority. In an old-fashioned contact centre, each email would be read and categorised by one of the customer service representatives and then passed on to the relevant department.

This model has a few major flaws: if the business starts scaling up quickly, customer service representatives may find it challenging to keep up, leading to longer delays which will anger customers. In addition, sifting through emails is a very repetitive task that often causes frustration for contact centre workers.

Clearly there must be a better way!

Machine learning to the rescue

Unbeknownst to many, Ocado has a technology division of 1000+ developers, engineers, researchers and scientists working hard to build an optimal technology infrastructure that revolutionises the way people shop online. This division is called Ocado Technology and includes a data science team that constantly finds new ways to apply machine learning and AI techniques to improve the processes related to running retail operations and beyond.

After analysing the latest research on the topic, the data science team discovered that machine learning algorithms can be adapted to help customer centres cope with vast amounts of emails.

The diagram below shows how we created our AI-based software application that helps our customer service team sort through the emails they receive daily.

Cloud computing model

The new AI-enhanced contact centre at Ocado

One of the fields related to machine learning is natural language processing (NLP), a discipline that combines computer science, artificial intelligence, and computational linguistics to create a link between computers and humans. Let’s use an email from a recent customer as an example to understand how we’ve deployed machine learning and NLP in our contact centres:

Example of feedback

The machine learning model identifies that the email contains general feedback and that the customer is happy

The software solution we’ve built parses through the body of the email and creates tags that help contact cenre workers determine the priority of each email. In our example, there is no immediate need for a representative to get in touch; the customer is satisfied with their order and has written a message thanking Ocado for their service.

We strive to deliver the best shopping experience for all our 500,000 + active customers. However, working in an omni channel contact centre can be challenging, with the team receiving thousands of contacts each day via telephone, email, webchat, social media and SMS. The new software developed by the Ocado Technology data science team will help the contact centre filter inbound customer contacts faster, enabling a quicker response to our customers which in turn will increase customer satisfaction levels. – Debbie Wilson, contact centre operations manager

In the case of a customer raising an issue about an order, the system detects that a representative needs to reply to the message urgently and therefore assigns the appropriate tag and colour code.

Data science at Ocado, using Google Cloud Platform and TensorFlow

This new ML-enhanced contact centre demonstrates how Ocado is using the latest technologies to make online shopping better for everyone.

Ocado was able to successfully deploy this new product in record time as a result of the close collaboration between three departments: data science, contact centre systems, and quality and development. Working together allowed us to share data and update models quickly, which we could then deploy in a real-world environment. Unlike a scientific demonstration where you’re usually working with a known set of quantities, the contact centre provided a much more dynamic scenario, with new data arriving constantly. – Pawel Domagala, product owner, last mile systems

Our in-house team of data scientists (check out our job openings here) trained the machine learning model on a large set of past emails. During the research phase, the team compared different architectures to find a suitable solution: convolutional neural networks (CNNs), long short term memory networks (LSTMs) and others. Once the software architecture was created, the model were then implemented using the TensorFlow library and the Python programming language.

TensorFlow and Python logos

Python is the de-facto most popular programming language in the data science community and provides the syntax simplicity and expressiveness capabilities we were looking for.

TensorFlow is a popular open-source machine learning toolkit that scales from research to production. TensorFlow is built around data flow graphs that can easily be constructed in Python, but the underlying computation is handled in C++ which makes it extremely fast.

We’re thrilled that TensorFlow helped Ocado adapt and extend state-of- the-art machine learning techniques to communicate more responsively with their customers. With a combination of open-source TensorFlow and Google Cloud services, Ocado and other leading companies can develop and deploy advanced machine learning solutions more rapidly than ever before. – Zak Stone, Product Manager for TensorFlow on the Google Brain Team

Understanding natural language is a particularly hard problem for computers. To overcome this obstacle, data scientists need access to large amount of computational resources and well-defined APIs for natural language processing. Thanks to the Google Cloud Platform, Ocado was able to use the power of cloud computing and train our models in parallel. Furthermore, Ocado has been an early adopter of Google Cloud Machine Learning (now available to all businesses in public beta) as well as the Cloud Natural Language API.

Google Cloud Platform logo

If you want to learn more about the technologies presented above, check out this presentation from Marcin Druzkowski, senior software engineer at Ocado Technology.

Make sure you also have a look at our Ocado Smart Platform for an overview of how Ocado is changing the game for online shopping and beyond.

October 13th, 2016

Posted In: Blog

Tags: , , , , , , , , ,

Paul at the meetup

Last week Ocado Technology had the pleasure of being invited to speak at the Data Science Festival organised at Google’s London headquarters in Soho. I was very lucky to be among the 200+ participants in the audience and would like to share with you a few insights from the Data Science Festival meetup as well as some information about how Ocado Technology uses machine learning to improve customer service and the overall efficiency of our Customer Fulfilment Centres (CFCs).

The meetup began with an introduction from Binesh Lad, head of retail for Google Cloud Platform UK & Ireland at Google. He talked briefly about how Google is rapidly expanding its cloud offering, offering Coca Cola, Best Buy, CCP Games (makers of EVE Online) and others as examples of customers using the Google Cloud Platform.

Binesh then jokingly played a video that introduced Google’s new, very exciting and definitely real product: the Actual Cloud (an April Fool’s prank that went viral a few months ago).

The second speaker of the evening was Paul Clarke, CTO at Ocado Technology. Paul offered a few quick facts about Ocado and how we have made online grocery shopping a reality over the last decade.

He then gave a few examples of how IoT, robotics and machine learning can be used together to improve the efficiency of warehouse operations and route optimisation for vans. Everyone in the audience was blown away by a sequence of short clips showing robots roaming around our new automation-based CFC in Andover, a real-time visualisation of the CFC in Dordon, and a live map of the vans delivering orders to Ocado customers in the UK.

Slide on screen of the new warehouse grid

Paul then moved to the second part of his presentation where he outlined how IoT is an unstoppable force that will usher in the true democratisation of hardware and software. Ocado Technology is already working on several IoT-related projects and is constantly adopting new ways of thinking into its product development cycles based on the innovation that is spurring in the IoT community.

Closing the evening off was Marcin Druzkowski, senior software engineer at Ocado Technology.

Marcin offered his perspective on data science and how Ocado is applying software engineering principles like code versioning, code testing and review, and continuous improvement to machine learning.

Marcin on stage

He also provided some useful tips for TensorFlow developers and outlined tools such as Git, Docker, Jupyter used by his team when dealing with data science. Finally, Marcin offered an example of how Ocado Technology is using data science to analyze customer emails and improve its customer service by using machine learning.

Marcin presenting a slide on TensorFlow

After the event was over, I had the opportunity to chat with some of the people in the audience over beers and (free!) pizza. Many said it was definitely an amazing presentation (a few said it was one of the best data science meetups they’ve attended so far!) and were very excited to learn that Ocado Technology is a pioneer in machine learning and data science.

Alex Voica, Technology Communications Manager

 

September 1st, 2016

Posted In: Blog

Tags: , , , , , , , , ,

Christofer Backlin

Some time ago I needed to schedule a weekly BigQuery job that involved some statistical testing. Normally I do all statistical work in R, but since our query scheduler wasn’t capable of talking to R I decided to try a pure BigQuery solution (rather than go through the hassle of calling R from a dataflow). After all, most statistical testing just involves computing and comparing a few summary statistics to some known distribution, so how hard could it be?

It did in fact turn out to be just as easy as I had hoped for, at least for the binomial test that I needed. The summary statistics were perfectly simple to calculate in SQL and the binomial distribution could be calculated using a user defined function (UDF). The solution is presented and explained below, and at the very end there’s also a section on how to implement other tests.

Binomial testing using a UDF

Let’s recap the maths behind the one-sided binomial test before looking at the code. Given that an event we want to study happened in k out of n independent trials, we want to make an inference about the probability p of observing the event. Under the null hypothesis we assume that p = p0 and under the alternative hypothesis we assume that p < p0. The probability of observing k or fewer events under the null hypothesis, i.e. the p-value, is calculated in the following way:

One-sided binomial test

This can be expressed as the UDF below. It includes a few tricks to deal with the fact that the JavaScript flavour used by BigQuery lacks many common mathematical functions, like fast and accurate distribution functions. The binomial distribution function is calculated by performing the multiplications as additions in logarithmic space to get around over and underflow problems. The base change is needed to get around the fact that both the expand log10 functions were missing.

/*
* Binomial test for BigQuery
*
* Description
*
*    Performs an exact test of a simple null hypothesis that the probability of
*    success in a Bernoulli experiment is `p` with an alternative hypothesis
*    that the probability is less than `p`.
*
* Arguments
*
*    k   Number of successes.
*    n   Number of trials.
*    p   Probability of success under the null hypothesis.
*
* Details
*
*    The calculation is performed as a cumulative sum over the binomial
*    distribution. All calculations are done in logarithmic space since the
*    factors of each term are often very large, causing variable overflow and
*    NaN as a result.
*
* Example
* 
*    SELECT
*      id,
*      pvalue
*    FROM
*      binomial_test(
*        SELECT
*          *
*        FROM
*          (SELECT "test1" AS id,   100 AS total,   10 AS observed,    3 AS expected),
*          (SELECT "test2" AS id,  1775 AS total,    4 AS observed,    7 AS expected),
*          (SELECT "test3" AS id, 10000 AS total, 9998 AS observed, 9999 AS expected)
*      )
* 
* References
* 
*     https://en.wikipedia.org/wiki/Binomial_distribution
*     https://cloud.google.com/bigquery/user-defined-functions
*
* Author
*
*     Christofer Backlin, https://github.com/backlin
*/
function binomial_test(k, n, p){
  if(k < 0 || k > n || n <= 0 || p < 0 || p > 1) return NaN;
  // i = 0 term
  var logcoef = 0;
  var pvalue = Math.pow(Math.E, n*Math.log(1-p)); // Math.exp is not available
  // i > 0 terms
  for(var i = 1; i <= k; i++) {
    logcoef = logcoef + Math.log(n-i+1) - Math.log(i);
    pvalue = pvalue + Math.pow(Math.E, logcoef + i*Math.log(p) + (n-i)*Math.log(1-p));
  }
return pvalue;
}

// Function registration
bigquery.defineFunction(
  // Name used to call the function from SQL
  'binomial_test', 
  // Input column names
[
    'id',
    'observed',
    'total',
    'probability'
  ],
  // JSON representation of the output schema
  [
    { name: 'id', type: 'string' },
    { name: 'pvalue', type: 'float' }
  ],
  // Function definition
  function(row, emit) {
    emit({
        id: row.id,
        pvalue: binomial_test(row.observed, row.total, row.probability)
    })
  }
);

Demonstration on a toy example

To demonstrate the UDF let’s use it for figuring out something we all have wondered about at some point or another: Which Man v Food challenge was really the hardest? Each challenge of the show was presented together with some rough stats from previous attempts by other contestants. Compiled in table form the data looks something like this (download the complete dataset as a CSV file or SQL statement):

Row City Challenge Attempts Successes
1 San Antonio Four horsemen 100 3
2 Las Vegas B3 burrito 140 2
3 Charleston Spicy tuna handroll 475 8
4 San Francisco Kitchen sink challenge 150 4

Just dividing the number of successes with the number of attempts isn’t a very good strategy since some challenges have very few attempts. To take the amount of data into consideration we’ll instead rank them by binomial testing p-values (assuming that there is no bias in the performance of the challengers that seek out any particular challenge). Here’s the SQL you need to apply the test above:

SELECT
  Challenge, Attempts, Successes,
  pvalue,
  RANK() OVER (ORDER BY pvalue) Difficulty
FROM binomial_test(
  SELECT
    id,
    total,
    bserved,
    sum_observed/sum_total probability
  FROM (
    SELECT
        Challenge id,
        Attempts total,
        Successes observed,
        SUM(Attempts) OVER () sum_total,
        SUM(Successes) OVER () sum_observed
        FROM tmp.man_v_food
        WHERE Attempts > 0
  )
) TestResults
JOIN (
  SELECT
    Challenge,
    Attempts,
    Successes
  FROM tmp.man_v_food
) ChallengeData
ON TestResults.id == ChallengeData.Challenge
;
Row Challenge Attempts Successes pvalue Difficulty
1 Shut-up-juice 4000 64 1.123E-56 1
2 Stuffed pizza 442 2 8.548E-12 2
3 Johnny B Goode 1118 30 1.875E-10 3
4 Spicy tuna handroll 475 8 1.035E-7 4
5 Mac Daddy pancake 297 4 6.094E-6 5

An alternative way to tackle the problem – that is related and arguably better – is to infer and compare the success probability of each challenge. We can do this by finding the posterior probability distribution of the success probability q using Bayesian inference and extract the median and a 95% credible interval from it. Using Bayes’ theorem we have that

Using Bayesian inference

For computational simplicity we’ll choose uniform priors, turning the fraction to the right into a normalisation constant. Thus we arrive at the following expression for calculating any a-quantile qa of the posterior, which is the continuous analogue of the expression for the binomial test defined above:

Uniform priors.png

Implemented as a UDF (code here) we can use the following query to infer the success probability:

SELECT
  Challenge, Attempts, Successes,
  q_lower, q_median, q_upper
FROM (
  SELECT
    id, q_lower, q_median, q_upper
  FROM bayesian_ci(
    SELECT
        Challenge id,
        Attempts total,
        Successes observed
  FROM tmp.man_v_food
  WHERE
    Attempts IS NOT NULL
    AND Attempts > 0
  )
) TestResults
JOIN (
  SELECT
    Challenge,
    Attempts,
    Successes
  FROM tmp.man_v_food
) ChallengeData
ON TestResults.id == ChallengeData.Challenge
ORDER BY q_median
;
Row Challenge Attempts Successes q_lower q_median q_upper
1 Stuffed pizza challenge 442 2 5.529E-4 0.00618 0.0157
2 Mac Daddy pancake 297 4 0.00514 0.0157 0.0341
3 Shut-up-juice 4000 64 0.0115 0.0162 0.0213
4 Spicy tuna handroll 475 8 0.00814 0.0182 0.0330
5 B3 burrito 140 2 0.00399 0.0189 0.0503

Implementation of other tests

The binomial tests lends itself particularly well for UDF implementation because it is easy to implement the binomial distribution. Similar examples include Fisher’s exact test and the Poisson test. However, many commonly used tests do not fall into this category.

Tests whose null distribution is considerably harder to implement include Student’s t-test, the χ2 test, and Wilson’s test of proportions. For those you are probably better off using Dataflow or Spark, but if you desperately want to you can use BigQuery alone. In that case you need to precalculate the distribution functions for every degree of freedom you might want to use, store in a separate table, calculate summary statistics and degrees of freedom for each test, join the two, and compare (just like in the good ol’ days of Neyman, Pearson, and Fisher!). If you go down this route you might want to use a theta join.

Non-parametric test, like Wilcoxon’s rank sum and signed rank tests, require yet another approach because they use all the data points to define the distribution. To use them you must aggregate all data points of each test into a single row and pass it to the UDF. This is because UDFs cannot operate on multiple rows simultaneously (more info). Note that in order to do so you’ll have to use aggregation functions that are only available in stardard SQL (ARRAY_AGG and friends), but not in legacy SQL, which is still the default. Also note that standard SQL is still in beta and that UDFs are wrapped in a different way.

Christofer Backlin, Data Scientist

August 3rd, 2016

Posted In: Blog

Tags: , , , , , , , , ,

Scroll Up