What is branch prediction?

30Jun2024

The Pentium 4 appeared in 2002, and I remember the excitement it caused me. Not only did it make me fall in love with the two colours that remain my favourite combination (the orange and blue combination is still my favourite), but it also brought significant changes to the computing world.

When the Pentium 4 came out and replaced the Pentium III (which, by the way, was the CPU of my first personal computer and the computer I had at the time), it sought to push AMD out of the market and for a large part of its existence, it somehow did. I say somehow because the reality is that AMD's CPUs were much better.

If you look closely at this timeline of CPUs, Intel was the better choice from the Pentium I to Pentium III, then again through the Pentium M until AMD released their Ryzen CPUs; you can see that there have only been a few points where AMD was in the lead, and one of those points is during the days of the Pentium 4.

But why was the Pentium 4 so bad, especially since it was the first CPU to reach 2GHz and 3GHz? Well, it's for those reasons.

The Megahertz Myth

Hertz is the unit of frequency. A computer processes in hertz, which is called the clock frequency. A clock frequency is the number of pulses the CPU sends out in a single second. The more pulses it sends out, the more the computer can do in a second. Modern computers are in the GHz range, meaning billions of hertz. A clock pulse keeps the components (CPU, RAM, GPU, etc.) in sync. A pulse in the computer tells each device to act. If the CPU has a frequency of 400MHz and the RAM has a frequency of 200MHz, that means that for every operation the RAM does, the CPU does two.

The clock frequency is not the most reliable way of measuring performance, as other factors, such as the complexity of the instructions being carried out, also play a role.

In the early 2000s, Intel tried to play a marketing card called the Megahertz Myth. This is where pushing the clock frequency higher helps push the price up and make it look better than the competition. When Intel achieved 3.2GHz and beyond, AMD's most powerful CPU could only reach 2.4GHz. This made Intel look better on paper than AMD and helped sell Intel's CPUs.

But nobody realised that while Intel looked better on paper, AMD's CPUs achieved more operations per clock, meaning that a 1.8GHz AMD CPU would destroy a 2.6GHz Pentium 4 almost every time. Intel had to act on this, and they eventually did with the Pentium M, which was based on the architecture of the Pentium II from almost 8 years prior. This architecture is still part of the basis of today's Intel chips.

Why was the Pentium 4 so bad?

The Pentium 4 struggled with heat and had lower performance per watt than the competition, though, as mentioned, this didn't matter to Intel because they were selling them anyway. This was mainly through OEMs at the time, and custom PC builders didn't exist as much as they do now.

But what caused the Pentium 4 to really suck was not how much power it sucked or how much heat it generated, but it's longer pipeline.

What is a pipeline?

A pipeline is a part of the CPU where instructions go and sit before execution. The longer the pipeline, the more stages it is said to have. Take this program:

let x = 10
let y = x * 2
print y

This program consists of 4 operations:

setting x to 10
calculating x * 2
setting y to the result of x * 2
printing y

In a pipelined system, these four operations are fetched from the memory and put into the CPU instruction pipeline. In a simple pipeline, all stages are executed in order.

In modern pipelines, we can execute multiple stages at the same time. For example, if the CPU sets the variable x to 10 and multiplies x by 2, it could perform both operations simultaneously by replacing the x with 10 in the second and third operations. This is great and makes the CPU perform better than doing each in order.

Branch prediction

Branch prediction is one of the most useful architectural design changes in CPUs. It's simple to explain, too. Branch prediction is all about the CPU trying to identify patterns. For example:

YASS

$arr = [0] * 1000
for ($i = 0 to 1000) {
  $arr[$i] = $i % 2
  if($arr[$i] == 1) {
    print($arr[$i])
  }
}

Now you can see a pattern here, right? The pattern is that every number will be either 0 or 1.

But what about this one?

YASS

$arr = [0] * 1000
for ($i = 0 to 1000) {
  $arr[$i] = random_number(0, 1)
  if($arr[$i] == 1) {
    print($arr[$i])
  }
}

Not so deterministic. Changing to a random number means that the CPU cannot be sure what the random number would be at that exact time. Branch prediction forces the CPU to push the if statement code through the pipeline at the same time as doing the assignment, as it 'predicts' that the outcome of the if statement could be satisfied and the code will be executed. If not, it will have mispredicted (a branch prediction miss), resulting in a performance penalty, as it will have wasted stages in a cycle.

With the first program, it can almost figure out that this program is deterministic, i.e. the result will be the same repeatedly.

Now, this can become an issue with a long pipeline like the one the Pentium 4 had. Imagine the code inside the if statement has hundreds of instructions inside it, and these are pushed to the pipeline along with a long and complex calculation and condition to determine if the if statement is going to be satisfied or not; the CPU might go through these hundreds of instructions in the pipeline whilst it figures out the result of the condition at the same time but then the result of the condition ends up as not satisfying the if statement. Now it's wasted all of that time running through the code in the if statement that will never be executed used anyway. This is branch misprediction. When branch misprediction happens, the branch is flushed, and the CPU has wasted a large chunk of a cycle.

So why is a longer pipeline bad?

A longer pipeline means reading longer branches into the CPU only later to realise they are unnecessary due to branch misprediction. This wastes CPU resources, wastes energy and produces heat. Every incorrect prediction is detrimental, and the longer the pipeline, the more predictions the CPU will incur and waste time (and energy). The Pentium 4 started with 20 stages in the pipeline but maxed at 31 stages, whereas AMD only used 12 or 13 stages, and the Pentium M used in mobile computers by Intel was maxed at 12 stages. Modern CPUs have 14 - 19 stages in their pipelines.

So, while Intel had a higher frequency due to having a longer pipeline (remember, their CPUs could do more in 1 second than AMD), their CPUs were making many branch mispredictions that a large proportion of the clock cycles were wasted anyway.

That sounds fine, though, right? Surely, you can ignore a branch misprediction - especially when you can do so many clock cycles per second, right? Well, ideally, yes, you could. However, when a modern CPU hits a branch misprediction, it instead flushes the pipeline and starts again, after which the pipeline needs to be repopulated. And, with 31 stages needing to be refilled, that takes time.

But that's not just it. With 31 stages, you have 31 different points where it can mispredict the outcome, so what if it mispredicted on stage 30? Now, the whole pipeline would be flushed, and the CPU would need to figure out where it was again and fetch, decode, and execute the correct instructions from scratch, adding additional latency.

All of this means wasted clock cycles that could have been used more productively. This generates heat and slows things down, creating a bottleneck in the system.

So there you have what branch prediction is, why misprediction should be avoided, and why a longer pipeline is rarely a good thing.

Posted in Technology

Comments

There are no comments on this page.

New comment

Comments are welcome and encouraged, including disagreement and critique. However, this is not a space for abuse. Disagreement is welcome; personal attacks, harassment, or hate will be removed instantly. This site reflects personal opinions, not universal truths. If you can’t distinguish between the two, this probably isn’t the place for you. The system temporarily stores IP addresses and browser user agents for the purposes of spam prevention, moderation, and safeguarding. This data is automatically removed after fourteen days. Your email address is stored so that replies can be sent to your email address.

Comments powered by BalfComment

Powered by DASH 2.0

Scan and keep for the latest article or review every time!

https://www.jamiebalfour.scot/redirect/action/latest-article/

News

ZPE 1.14.6

ZPE 1.14.6 is available now. This major update brings performance improvements, major fixes, a System, Windows and ZPE module, built-in SQLite, full Unicode emoji support and many new UI improvements through BalfLaf.