People around a computer

These insights on how (and why) to do code reviews well are based on my own experiences.

Code reviews require a lot of different approaches depending on the situation, and it’s impossible to define them all. But while a completely definitive guide is probably impossible, with these recommendations, you should be able to steer your own course.

Required tool: development skills

To fix a hinge in your cupboard you need a screwdriver and a knowledge of how hinges work. You can try to fix it without a screwdriver but the outcome probably won’t be great… Code reviews are about ensuring code quality. If you don’t know how to write good code you simply cannot recognise bad code.

Similarly, like a screwdriver, your development skills can always be improved. Most advanced hinge fixers also use an electric drill. Unfortunately, improving development skills takes much more time and effort than just buying an electric drill, but you should continue to get better. That’s the only way to become a great code reviewer.

So, while reading the rest of the post, remember than even the best approach to code reviews is not enough if don’t have your basic tool: a knowledge of how to write good code.

The right approach

Agree purpose and time

You need to agree: why you are doing the review; how much time you have.

There are two basic things that code reviews are about: ensuring quality of code, and finding bugs. Although these two things should go together, some edge cases may require a different approach. For example, if the code you’re reviewing was well-tested and you know there aren’t any big bugs, you can concentrate more on its quality.

The second thing I mentioned was time. You could say that a good code review cannot be time-restricted. That’s true, but sometimes you don’t have a choice.

Whatsmore, if you spend too long on code reviews, sooner or later you’ll get tired, resent doing them, and the quality of your code reviews will dramatically decrease. A good team should have a few reviewers who split time for reviews between them.

Take time to understand

Don’t start reading every line before you have grasped the general concept of the code you’re reviewing. Here you can do one of two things, both of which I’ve found useful:

Firstly, you can talk with the person who created the review. They can tell you about the problems encountered and the approaches taken in order to solve them.

Secondly, you can take the completely opposite approach. At the beginning, don’t talk with author at all, and check if you can understand the code just by looking through it quickly. If it is good code you should be able to understand it just by reading class and function names. If you cannot understand it, that means that something is wrong with the code on highest level of abstraction.

Only when you know what the code you’re reviewing is about and how it was split into smaller units, can you start reviewing these units.

Know what makes good code

The header of this section is a bit of a cheat because I believe there is no good measurement of code quality. However, there is one common factor when comes to a definition of good code: good code is code which you can easily understand.

Scan the code

This does not constitute a full code review, but I believe you can say a lot about code quality simply by scanning it. Analysing how long functions and classes are, seeing how much indentation the code has, how many public and private methods a class has, can tell you a lot.

Although it wasn’t a great practice, I’ve sometimes had to do a few code reviews in a very restricted amount of time. Just by scanning through the code I was able to quickly find multiple places to improve it.

Also, scanning code can be good at the end of a review. You will be surprised how many things you can find that way, even though you thought you had finished reviewing.


Not sure if and how some part of the code should be changed? Share your concerns with the author (or other team members). In a lot of situations, analysing the problem together will produce a much better solution than you would have come up with alone.

Stack Exchange image

The wrong approach

Lack of care

Funny as it may sound, I think the most common reason for a poor review is simply lack of care. If the developer sees a code review as a box-ticking exercise, they won’t take any care and, frankly, they’d be more productive checking their Facebook updates.


Somewhat controversially, this also comes up surprisingly often. The reviewer feels as if they’re providing negative feedback, and are too scared or hesitant to point out errors out of respect for the code’s author.

Point scoring

Going into the code review, the mentality (of both reviewer and author) should be one of learning. The reviewer should constructively help the author learn from their mistakes, not treat the process as an opportunity for point-scoring.

Mind Browser image

Reliance on a checklist

There is a tendency to improve code reviews by creating checklists, but these should be used as recommendations rather than standards. They are helpful to emphasise common practices within a company, but if applied too rigidly can hamper a developer’s ability to solve problems the way they think best.

Too much attention on small details

Yes, reviewers should be scrupulous, but remember that small details aren’t the most important aspect of good code. For example, the poor naming of a variable is not very important if the function it’s used in is named well. You should point these things out, but don’t spend so much time on them you get bored of the review and stop focussing on the errors that actually matter.

Over emphasising performance

‘Premature optimisation is the route of all evil’, as the saying goes. Taking every possibility to reduce code execution by a microsecond, and as a result compromising code readability and changeability, is one of the worst things you can do.

Knowing how to write very fast code is a different topic, and in most cases you don’t need extremely fast programs. Your code should perform well, but normally this just means avoiding mistakes that will slow it down.

Providing an educational experience

Remember that code reviews are, first and foremost, about ensuring code quality. By doing this the author should be able to learn from the feedback. It’s also very good if reviewer learns something but that’s mostly a positive side effect.

If you treat the review as a learning process for the reviewer, the code will look like the outcome of a learning process.

Long reviews

If the reviewer needs to analyse a lot of changes through multiple parts of the system, they’re likely to miss something. Also, as the review drags on, their tiredness increases and the quality of the review declines.

Furthermore, a long review greatly increases the number of changes that need to be done afterwards. If the author can provide smaller sections for review, they’ll receive higher quality feedback faster.

Inconsistent reviews

Some people say that a single project should have multiple reviewers in order to get more insights. This is true, but you shouldn’t put dozens of reviewers on one project to get the highest possible number of insights.

Good code is consistent code. Having one sub-optimal approach applied consistently throughout the project is better than having lots of ‘best practice’ approaches that make the code hard to understand and follow.

Good luck with your next code review – I hope you find these hints helpful

Slawomir Dybiec, software engineer

August 10th, 2016

Posted In: Blog

Tags: ,

Christofer Backlin

Some time ago I needed to schedule a weekly BigQuery job that involved some statistical testing. Normally I do all statistical work in R, but since our query scheduler wasn’t capable of talking to R I decided to try a pure BigQuery solution (rather than go through the hassle of calling R from a dataflow). After all, most statistical testing just involves computing and comparing a few summary statistics to some known distribution, so how hard could it be?

It did in fact turn out to be just as easy as I had hoped for, at least for the binomial test that I needed. The summary statistics were perfectly simple to calculate in SQL and the binomial distribution could be calculated using a user defined function (UDF). The solution is presented and explained below, and at the very end there’s also a section on how to implement other tests.

Binomial testing using a UDF

Let’s recap the maths behind the one-sided binomial test before looking at the code. Given that an event we want to study happened in k out of n independent trials, we want to make an inference about the probability p of observing the event. Under the null hypothesis we assume that p = p0 and under the alternative hypothesis we assume that p < p0. The probability of observing k or fewer events under the null hypothesis, i.e. the p-value, is calculated in the following way:

One-sided binomial test

This can be expressed as the UDF below. It includes a few tricks to deal with the fact that the JavaScript flavour used by BigQuery lacks many common mathematical functions, like fast and accurate distribution functions. The binomial distribution function is calculated by performing the multiplications as additions in logarithmic space to get around over and underflow problems. The base change is needed to get around the fact that both the expand log10 functions were missing.

* Binomial test for BigQuery
* Description
*    Performs an exact test of a simple null hypothesis that the probability of
*    success in a Bernoulli experiment is `p` with an alternative hypothesis
*    that the probability is less than `p`.
* Arguments
*    k   Number of successes.
*    n   Number of trials.
*    p   Probability of success under the null hypothesis.
* Details
*    The calculation is performed as a cumulative sum over the binomial
*    distribution. All calculations are done in logarithmic space since the
*    factors of each term are often very large, causing variable overflow and
*    NaN as a result.
* Example
*      id,
*      pvalue
*    FROM
*      binomial_test(
*        SELECT
*          *
*        FROM
*          (SELECT "test1" AS id,   100 AS total,   10 AS observed,    3 AS expected),
*          (SELECT "test2" AS id,  1775 AS total,    4 AS observed,    7 AS expected),
*          (SELECT "test3" AS id, 10000 AS total, 9998 AS observed, 9999 AS expected)
*      )
* References
* Author
*     Christofer Backlin,
function binomial_test(k, n, p){
  if(k < 0 || k > n || n <= 0 || p < 0 || p > 1) return NaN;
  // i = 0 term
  var logcoef = 0;
  var pvalue = Math.pow(Math.E, n*Math.log(1-p)); // Math.exp is not available
  // i > 0 terms
  for(var i = 1; i <= k; i++) {
    logcoef = logcoef + Math.log(n-i+1) - Math.log(i);
    pvalue = pvalue + Math.pow(Math.E, logcoef + i*Math.log(p) + (n-i)*Math.log(1-p));
return pvalue;

// Function registration
  // Name used to call the function from SQL
  // Input column names
  // JSON representation of the output schema
    { name: 'id', type: 'string' },
    { name: 'pvalue', type: 'float' }
  // Function definition
  function(row, emit) {
        pvalue: binomial_test(row.observed,, row.probability)

Demonstration on a toy example

To demonstrate the UDF let’s use it for figuring out something we all have wondered about at some point or another: Which Man v Food challenge was really the hardest? Each challenge of the show was presented together with some rough stats from previous attempts by other contestants. Compiled in table form the data looks something like this (download the complete dataset as a CSV file or SQL statement):

Row City Challenge Attempts Successes
1 San Antonio Four horsemen 100 3
2 Las Vegas B3 burrito 140 2
3 Charleston Spicy tuna handroll 475 8
4 San Francisco Kitchen sink challenge 150 4

Just dividing the number of successes with the number of attempts isn’t a very good strategy since some challenges have very few attempts. To take the amount of data into consideration we’ll instead rank them by binomial testing p-values (assuming that there is no bias in the performance of the challengers that seek out any particular challenge). Here’s the SQL you need to apply the test above:

  Challenge, Attempts, Successes,
  RANK() OVER (ORDER BY pvalue) Difficulty
FROM binomial_test(
    sum_observed/sum_total probability
  FROM (
        Challenge id,
        Attempts total,
        Successes observed,
        SUM(Attempts) OVER () sum_total,
        SUM(Successes) OVER () sum_observed
        FROM tmp.man_v_food
        WHERE Attempts > 0
) TestResults
  FROM tmp.man_v_food
) ChallengeData
ON == ChallengeData.Challenge
Row Challenge Attempts Successes pvalue Difficulty
1 Shut-up-juice 4000 64 1.123E-56 1
2 Stuffed pizza 442 2 8.548E-12 2
3 Johnny B Goode 1118 30 1.875E-10 3
4 Spicy tuna handroll 475 8 1.035E-7 4
5 Mac Daddy pancake 297 4 6.094E-6 5

An alternative way to tackle the problem – that is related and arguably better – is to infer and compare the success probability of each challenge. We can do this by finding the posterior probability distribution of the success probability q using Bayesian inference and extract the median and a 95% credible interval from it. Using Bayes’ theorem we have that

Using Bayesian inference

For computational simplicity we’ll choose uniform priors, turning the fraction to the right into a normalisation constant. Thus we arrive at the following expression for calculating any a-quantile qa of the posterior, which is the continuous analogue of the expression for the binomial test defined above:

Uniform priors.png

Implemented as a UDF (code here) we can use the following query to infer the success probability:

  Challenge, Attempts, Successes,
  q_lower, q_median, q_upper
    id, q_lower, q_median, q_upper
  FROM bayesian_ci(
        Challenge id,
        Attempts total,
        Successes observed
  FROM tmp.man_v_food
    Attempts IS NOT NULL
    AND Attempts > 0
) TestResults
  FROM tmp.man_v_food
) ChallengeData
ON == ChallengeData.Challenge
ORDER BY q_median
Row Challenge Attempts Successes q_lower q_median q_upper
1 Stuffed pizza challenge 442 2 5.529E-4 0.00618 0.0157
2 Mac Daddy pancake 297 4 0.00514 0.0157 0.0341
3 Shut-up-juice 4000 64 0.0115 0.0162 0.0213
4 Spicy tuna handroll 475 8 0.00814 0.0182 0.0330
5 B3 burrito 140 2 0.00399 0.0189 0.0503

Implementation of other tests

The binomial tests lends itself particularly well for UDF implementation because it is easy to implement the binomial distribution. Similar examples include Fisher’s exact test and the Poisson test. However, many commonly used tests do not fall into this category.

Tests whose null distribution is considerably harder to implement include Student’s t-test, the χ2 test, and Wilson’s test of proportions. For those you are probably better off using Dataflow or Spark, but if you desperately want to you can use BigQuery alone. In that case you need to precalculate the distribution functions for every degree of freedom you might want to use, store in a separate table, calculate summary statistics and degrees of freedom for each test, join the two, and compare (just like in the good ol’ days of Neyman, Pearson, and Fisher!). If you go down this route you might want to use a theta join.

Non-parametric test, like Wilcoxon’s rank sum and signed rank tests, require yet another approach because they use all the data points to define the distribution. To use them you must aggregate all data points of each test into a single row and pass it to the UDF. This is because UDFs cannot operate on multiple rows simultaneously (more info). Note that in order to do so you’ll have to use aggregation functions that are only available in stardard SQL (ARRAY_AGG and friends), but not in legacy SQL, which is still the default. Also note that standard SQL is still in beta and that UDFs are wrapped in a different way.

Christofer Backlin, Data Scientist

August 3rd, 2016

Posted In: Blog

Tags: , , , , , , , , ,

Stand-up meeting

In October, my area within Ocado Technology undertook a radical transformation. Over the course of a single day we carried out an area-wide, self-selecting restructuring exercise. The team I now work in consists of six team members. None of us had worked together before. Although most of the other new teams decided to take on a fairly traditional Scrum approach to their process, we decided that such a radical reorganisation would be a good opportunity to introduce some experimental new process ideas and ways of working. We incorporated some of our favourite Agile/Lean development techniques and spent the next three months adapting the process to fit our team. Here’s a taste of some of the things we did.

We chose Kanban.

We decided to use a variation of the Arrow Kanban board, which is itself a variation on the traditional Kanban method. Where Kanban usually has “Work In Progress” (WIP) limits for each stage of the process, we simply have a “Story in Progress” (SIP) limit. While we were a new team (learning the systems and trying to get on top of in-office support) we set our SIP limit to two stories, which provided plenty of work for two pairs of developers. We’re now more familiar with things, so we’ve given ourselves room to take on a third story, but we only do this if there isn’t anything else which can be progressed or parallelised on the other stories.

Kanban board

Our take on Tomas Rybing’s “arrow kanban” board

Kanban is working great for us. We maintain two work queues; a “prioritised” queue and a “random” queue. Our Product Owner (PO) can move the user stories around between these queues, changing priorities whenever she likes. This means that we can be more flexible when it comes to avoiding taking on stories which we know will become blocked, we can respond to changing business needs or deadlines very easily and we can prioritise work which would otherwise cause us to block another team. We think that this is a huge benefit over sprint-based methods of working, where it’s hard to estimate how much work to commit to and – even with a one-week sprint – you are left with very little room to act in an agile way when priorities change.

We decided we didn’t like meetings. Well, some of them.

Most of the team were in agreement that we prefer writing code to holding endless meetings. Spending a few hours every two weeks in sprint planning sessions, estimation sessions and doing backlog grooming didn’t seem like our idea of fun… and the value we thought we would get from those meetings was questionable. Maintaining a small work queue and taking on stories whenever we are ready helps us avoid these lengthy planning meetings.

As a result of Kanban and a lack of meetings, we don’t estimate our stories, either. I’ve found that estimates are almost always wrong, even when only considered in a relative manner, so I’m skeptical that there is any value in estimating stories on a routine basis. Additionally, this removes the pressure upon the team to ensure that all of the stories estimated to be finished during a given iteration are complete by the end of the iteration – something which I’ve noticed often results in teams either speeding up or slowing down towards the end of the sprint or, worse, cutting corners such as leaving out testing in order to mark the story as “done”.

This doesn’t mean we don’t have meetings. We still talk about stories and break stories down into tasks, but we do this when we’re about to start work on the story so that we aren’t as affected by requirements changing between planning and coding. It means that we can dedicate more time to discussing complex stories, rather than bundling the conversation into a meeting with several other breakdowns also required. Very importantly, we don’t require the whole team to be involved in story breakdown – just the team members who have an interest in that story. Other team members are brought up to speed during stand-ups, or when we switch pair-programming duos; something we try to do as often as possible, but perhaps not as often as we might wish to.

Removing unnecessary meetings from our calendar also meant that we have been able to spend more time doing things with more business value, such as engaging more with our users in order to keep a tighter feedback loop.

We solve problems fast.

Our team has two stand-ups a day. Yes, two. Rather than giving individual status updates, we talk through the stories we have on the board (remember – a maximum of three!). We talk about the progress made, what the blockers are and what the next steps will be. Frequent stand-ups mean that blockers can be identified and discussed more quickly, and the process has become so well-oiled that we can finish each stand-up in around five minutes.


Despite ditching sprints, we still have retrospectives. In fact, we hold them weekly. This is probably more often than most of the scrum-based teams host retrospectives. Like our stand-ups, we’ve made these meetings as short as possible; they’re an hour. A different person hosts each week and, after discussing the things that went badly or well in the past week, we try to come out of every retro with some actionable items. If an action is specific, it goes on the board. If the action item is more culture or behaviour based, we make a meme about that aspect of our work and stick it onto our team board to remind us to take the retro discussion on-board. For example, on one occasion people disliked how our “let’s take this meeting offline” codeword, “Mango”, was being used to shut people up rather than genuinely move discussion out of a meeting (see the image).

Discussion points

Our “three things to discuss” meeting ideas canvas fills up

Sometimes, issues spring to mind which need team input but are perhaps too large to fall into the scope of a retrospective. This was most true when the team first had formed, and we wanted to make decisions about our release strategies, our code review policies, etc. To try and address these problems, we set up a section of our team whiteboard called “3 Things”. People could write down things they wanted to discuss in more depth and, once three things had been written on the board, we booked in an hour-long meeting to talk about those things. Now that we’ve spoken about many topics, we’ve started to notice that decisions can be made more quickly. We’re currently trialling discussing some of the smaller topics after our standups.

We measure ourselves. A lot.

Cumulative Flow Diagram

Our cumulative flow diagram (CFD) shows where stories get held up

Whenever our stories change state, we log it. From this data, we can track how much time stories spend in each state and we can measure our cycle time. I think we all agree that what we really want to be doing is writing code, rather than spending our time manually testing or working through a painful deployment process. By considering how much time we spend in each development phase, we can identify which types of stories get stuck in undesirable phases and work out how we can avoid this in future – usually by adding some more automation. We tend to look at the graphs of our data in our retrospectives so that we can look for problematic stories and follow our progress in improving the metrics we’re tracking. We find that our Cumulative Flow Diagram (CFD) is particularly useful at identifying bottlenecks where stories regularly get held-up.


Histograms captured during a reto. Green blocks are the current week, orange outlines are the result from the previous week

Another thing we measure in our retrospectives is our Productivity, Collaboration and Happiness. “Our” is loosely defined here – individuals tend to factor in their perceptions of how well the team is doing, and also their own view. We cast a vote (between one and five) anonymously on post-it notes for each aspect and then draw the results up on the board. Once the results are up, we compare them with the results from the previous week and discuss the reasons for any variations. People seem to enjoy this activity. Despite the voting being anonymous, I’ve observed that people who give one of the three aspects a low score are almost always happy to identify themselves to the group and give specific reasons as to why they voted this way. I like to think that this is a sign of excellent trust within the team.

We don’t tell people how to manage their workload.

Despite the fact that I think our process is awesome (awesome enough to write a blog post), we’re hesitant to tell other people how to work. A team located near us has decided to try the switch from Scrum to Kanban. I’m delighted. Although they are doing things very differently from us, we’ve tried to give them some advice along the way.

It’s important to note that there’s no silver bullet when it comes to Agile or Lean software development. Many people think that “Agile” means “Scrum”, and that “Scrum” is a rigid process, but this couldn’t be further from the truth. What I have realised is that if you don’t enjoy your process then you probably won’t enjoy following that process. Don’t do something because I’ve said it works for my team. Don’t do something because it’s what all of the rest of the teams in your company do. Break the mould, and choose a process which works well for you.

I hope that this post gives you some confidence to pick and choose the parts of each Agile tool that your team finds the most useful, and ditch the parts that don’t provide enough value to your team.

Lawrence Weetman, Software Engineer

February 17th, 2016

Posted In: Blog

Tags: , , , , , , ,

Scroll Up