17  Advanced Processor Features [WIP]

The first microprocessors I worked on, mostly the Intel 8086 and Z80s, were marvels of technology. I didn’t understand how they were made and I had pretty vague notions of the basic electronics (I bought a book in the early 80s by the legendary Steve Ciarcia called “Build Your Own Z80 Computer”, but I didn’t make it past the power supply chapter). On the other hand, they made sense at the level necessary to program them. Instructions were read from memory, which in turn read data from memory, put it in registers, and then did things with it. The instructions were executed in the order they were read from memory, and the results of one instruction were always available to the next instruction. It was fairly easy to see the relationship between the logic of your program and the instructions that were executed on the processor.

Not anymore. We’ve talked about some of the complexities already, like the multiple levels of instruction and data caching. But almost all modern processors, including the M2 in my Mac, do some totally crazy stuff. I will discuss some of these features in increasing order of craziness. Only the first of these is something we can easily control in our programs.

17.1 SIMD

A common pattern in programming is to go through data sequentially and apply the same operations to each item. So, say, you have two arrays of numbers and you want to add the elements of one to the elements of the other. Pretty standard looping thing. At some point processor designers realized that they could move some of this right into the processor itself. This isn’t exactly a new idea, for example in my first job I got to work with Cray supercomputers, and one of the tricks they used to make them so fast was what they called “vectorization”, which you can think of as “unrolling” a loop so that one instruction operates on multiple data items. This is a form of what’s called SIMD, which stands for “single instruction, multiple data”.

The M2 has a special unit called Neon that enables SIMD operations. The instructions look fairly similar, but there’s a special syntax for the registers that indicates that they are Neon registers, and the instructions themselves have a different format. For example, the instruction to add two vectors of 4 32-bit integers looks like this:

// Simple Neon Addition for 4 single-precision floats
.global _main
.align 4

_main:
    // 1. Load data from memory into Neon registers
    // adrp/add gets the address of our data labels
    adrp    x0, vec_a@PAGE
    add     x0, x0, vec_a@PAGEOFF
    adrp    x1, vec_b@PAGE
    add     x1, x1, vec_b@PAGEOFF

    // ld1 {v0.4s} loads four 's' (single-precision/32-bit) floats into v0
    ld1     {v0.4s}, [x0]
    ld1     {v1.4s}, [x1]

    // 2. The SIMD Instruction
    // fadd (Floating-point Add) 
    // .4s tells the engine to treat the registers as 4 separate lanes
    fadd    v2.4s, v0.4s, v1.4s

.data
.align 4
vec_a:  .float 10.0, 20.0, 30.0, 40.0
vec_b:  .float  1.0,  2.0,  3.0,  4.0

The Neon unit has 128-bit registers rather than the normal 64-bit registers. These registers are divided into “lanes” based on the suffix added to the register name. Here for example, the v0, v1, and v2 registers use the suffix .4s, which means that each register is treated as 4 separate lanes of 32 bits each. The fadd instruction then adds the corresponding lanes of v0 and v1 together and puts the result in the corresponding lanes of v2. So, in this case, the first lane of v2 will be 11.0 (10.0 + 1.0), the second lane will be 22.0 (20.0 + 2.0), and so on. The options for the lane suffixes include everything from 16 half-word data elements to 2 double-word values (probably double precision floats).

As written here it looks the same instructions we’ve seen before but operation on different registers. But the instruction are really specific to the Neon unit. A disassembly of this code shows this:

    0x100000380 <+16>: ld1.4s { v0 }, [x0]
    0x100000384 <+20>: ld1.4s { v1 }, [x1]
    0x100000388 <+24>: fadd.4s v2, v0, v1

which makes it more clear that these are special instructions that operate on the Neon registers. The Neon unit has a whole set of instructions for doing various operations on vectors, including not just arithmetic but also things like comparisons and shuffling of data between lanes. On the other hand, the SIMD instructions still operation within the machinery of the standard processor, with access to the same memory and caches, etc.

While you can write SIMD instructions like this, and ARM provides intrinsics for doing so in C, for the most part you’ll rely on compilers to do this optimization for you.

17.2 Pipelining and Out-of-order execution

Pipelining is sort of orthogonal to SIMD. Instead of one instruction operating on a bunch of data, pipelining is the idea of having multiple instructions in different stages of execution at the same time. For example, while one instruction is being decoded, another instruction can be fetched from memory, and yet another instruction can be executed. If this hurts your brain, you are not alone. You can pretty easily imagine the complexity this adds, since one instruction might depend on the result of another. The processor might have to skip over an instruction that needs to wait for a result and execute a later instruction that doesn’t have to wait. This is called out-of-order execution, and it is a common feature of modern processors. The M2 has a 12-stage pipeline, which means that there can be up to 12 instructions in various stages of execution at the same time.

There aren’t any special instructions for pipelining, it’s just a thing that the processor does. However, there are ways to write code that make it more friendly to pipelining. At the assembly language level, you can sometimes do tricks to align instructions using the .align directive, or you can use the nop instruction to insert “no operation” instructions that can help to fill in pipeline bubbles.

17.3 Speculative prefetching and execution

There are two common things that defeat pipelining: cache misses and branching. If the processor has to go get data from memory, it defeats the purpose of the pipeline. With branching, pipelining is tricky because the processor doesn’t know which instructions to fetch until the condition is evaluated. Since pipelining is so important for performance, the M2 and most modern processors have a clever trick to handle these problems: they guess.

Of course they don’t call it guessing. There are two ways that the processor predicts what might happen next. To reduce cache misses the processor can use speculative prefetching, which is the idea that the processor will try to predict which data will be needed next and fetch it into the cache before it’s actually requested by the program. To deal with branching the processor does branch prediction along with speculative execution, which is the idea that the processor will try to predict which way a branch will go and execute instructions along that path before it’s actually known whether that path will be taken or not. If the prediction is correct, then the results of those instructions can be used immediately, but if the prediction is wrong, then the results are discarded and the correct path is executed instead.

Sometimes you can write code in certain ways to make it more friendly to speculative execution. For example, instead of writing:

#include <stdint.h>

int32_t filter_sum_branchy(int32_t* data, int32_t length, int32_t threshold) {
    int32_t sum = 0;
    for (int i = 0; i < length; i++) {
        if (data[i] > threshold) {
            sum += data[i];
        }
    }
    return sum;
}

You could write:

#include <stdint.h>

int32_t filter_sum_branchless(int32_t* data, int32_t length, int32_t threshold) {
    int32_t sum = 0;
    for (int i = 0; i < length; i++) {
        // (data[i] > threshold) evaluates to 1 (true) or 0 (false)
        // This is a "data dependency" instead of a "control dependency"
        sum += data[i] * (data[i] > threshold);
    }
    return sum;
}

The first version might work OK if the data has a pattern that the branch predictor can learn, so that it usually predicts the right branch. But if the data is random, then the branch predictor will be wrong a lot and the performance will suffer. The second version doesn’t have any branches, so it will perform more consistently regardless of the data pattern. Of course, the second version might be less readable to some people, so it’s a trade-off between readability and performance.

Note that there are some instructions that are specifically designed to be friendly to speculative execution. For example, the ARM architecture has a set of instructions called “conditional select” instructions, which allow you to select between two values based on a condition without using a branch. These instructions can be used to write code that is more friendly to speculative execution. That second version of the C function above would probably compile down to something that uses conditional select instructions.

Also, note that there are compiler macros in some versions of C that allow you to give hints to the compiler about branch prediction. On a Mac with clang, you can use the builtin_expect() macro to indicate that a certain condition is likely to be true or false, which can help the compiler generate code that is more friendly to branch prediction. In practice this macro is often further macro-ized to something like likely() and unlikely() for readability. These macros aren’t used that frequently since in most cases the hardware and compiler are smarter than the programmer at predicting branches, but you’ll see them in some performance-critical code where one outcome is much more likely than the other.