Mysterious branch misprediction

Mysterious branch misprediction

Branch predictor is one of the crucial parts of a cpu that maximizes performance of pipelines. In software layer, we didn’t have any direct control on BP but is this means that we can ignore it when we are writing programs? The short answer is No, but why? In this post I will try to answer this question with some examples on Go.

What is a branch predictor and why we need it?

As wikipedia says, branch predictor is a digital circuit that tries to guess which way a branch (e.g., an if–then–else structure) will go before this is known definitively. But why we need to guess a branch result when we can execute it and get the definite result? In order to answer this question we need to know how a cpu executes given instructions.

In modern CPUs every instruction executed using a technique called pipelining. This technique helps us to achieve instruction-level parallelism on processors. Pipelining attempts to keep every part of the processor busy with some instruction by dividing incoming instructions into a series of sequential steps (i.e pipeline) performed by different processor units with different parts of instructions processed in parallel.

Let’s imagine that you want to do laundry and it’s consist of three steps:

  • Washing
  • Drying
  • Ironing

Each step will take some time to accomplish:

  • Washing -> 30 minutes
  • Drying -> 40 minutes
  • Ironing -> 20 minutes

So doing laundry for one load will take 90 minutes in total. What about multiple loads? If we have 4 loads and we decide to start new load when the previous one done, total time needed for finishing all loads would be 4 * 90 = 360 minutes or 6 hours.

Sequential laundry
Sequential laundry

Hmm, it seems that we can do better. When we are drying one load, washing machine and iron were not used at all and This goes for all the steps. So the better solution would be putting new load on the washing machine when we are drying washed ones and ironing dried ones. This way we are minimizing the overall time need for laundry. Now it will take only 3 hours and 30 minutes.

Pipelined laundry
Pipelined laundry

Pipelining steps varies between different cpu architectures but in this article we will use a simple 5 stage pipeline:

  • Fetch
  • Decode
  • Execute
  • Memory
  • Write Back

Misprediction Chaos

func main() {

    arr := make([]int, 32768)
    sum := 0

    for i := 0; i < len(arr); i++ {
        arr[i] = rand.Int() % 256


    for m := 0; m < 100000; m++ {
        for i := 0; i < len(arr); i++ {
            if arr[i] <= 128 {
                sum += arr[i]