5 Instructions
When discussing registers, I mentioned that the actual machine instructions corresponding to each line of assembly language are all 32 bits long. This can be hard to grasp at first, because some lines just look way more complicated and it seems like they should take more space. But no, an instruction like
W0, #0x808, LSL #16
takes the same number of bits as
B loop
To manage this the ARM architecture defines a format (various formats) to fit everything necessary into 32 bits. This includes the operation itself (aka the operation code or opcode1), and operands like memory addresses, literal values, and registers. Taken together all of the different variations of those 32 bits constitute the instruction set.
Let’s take a look at some of the instructions from the programs we’ve already seen. In the registers episode we saw that the instruction for MOVZ W0, #0x8405 is the number
01010010100100001000000010100000 (0x529080a0)
The 32 bits for the instruction in this case need to encode (1) the opcode, (2) a register, and (3) the “immediate” value 0x8405.
The basic format of a MOV instruction is like this:
| 31 | 30_29 | 28________23 | 22_21 | 20_____________________________5 | 4_______0 |
| sf | opc | 1 0 0 1 0 1 | hw | imm16 | Rd |
| 0 | 1 0 | 1 0 0 1 0 1 | 00 | 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 | 0 0 0 0 0 |
The sf flag indicates either a 32- or 64-bit operation. Here it’s 0 because we’re using the 32-bit variant of the register, if we’d used the register X0 it would be 1. The next 8 bits are basically the opcode for this instruction (this will change slightly for different variants of MOV). The hw here stands for “halfword”. A halfword is 16 bits, or half of a 32-bit word, but the value of the bits in this instruction are not super-relevant since we’re moving an immediate value rather than a value at a memory location. The key parts here are the imm16 bits, which are the value we’re putting into the register and the Rd bits, which indicate the register. The register value here is 00000 since we’re using the W0 register. If it were 00001 it would indicate the W1 register.
Let’s look at another one. The multiply instruction in our register program (MUL W4, W3, W0) has the hex value 0x1b007c64, or 00011011000000000111110001100100 as 32-bit binary. The MUL instruction is actually an alias for a version of the MADD instruction (multiply and add), which does “multiply the values in two registers then add the value of another register”. The bits for this instruction break down thusly:
| 31 | 30-29 | 28_______24 | 23 21 | 20_______16 | 15 | 14______10 | 9______5 | 4______0 |
| sf | op54 | 1 1 0 1 1 | op31 | Rm | o0 | Ra | Rn | Rd |
| 0 | 0 0 | 1 1 0 1 1 | 0 0 0 | 0 0 0 0 0 | 0 | 1 1 1 1 1 | 0 0 0 1 1 | 0 0 1 0 0 |
You can see that the two registers values we’re multiplying are in Rm (W0=00000) and Rn (W3=00011) and result goes into Rd (W4=00100). The added value would be in the Ra register, but here that is 11111, which tells the processor to just do the multiply.
OK, let’s look at some more instructions, this time things that deal with memory addresses. Way back in our first program we saw the instruction ADR X1, bel, which loads the address of the data item labeled bel into register X1. ADR is essentially a variation of an ADD or SUB instruction, where the things being added are the PC (program counter) register and a value that gives the desired memory address relative to the PC.
The hex value of this instruction is 0x100000a1 or 00010000000000000000000010100001 in binary. The instruction format looks like:
| 31 | 30-29 | 28______24 | 23_________________________________5 | 4______0 |
| op | immlo | 1 0 0 0 0 | immhi | Rd |
| 0 | 0 0 | 1 0 0 0 0 | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 | 0 0 0 0 1 |
This one’s a little strange, but basically we tack the immlo bits on to the end of the immhi bits and we get 10100 or 20, which happens to be the number of bytes above the current program location where the bel memory address resides. So, 20 gets added to the current PC and the result is put in register Rd (00001).
In the chapter on memory we saw a similar instruction: ADRP X1, mya1. The difference here is that the value being loaded into the register is not the the exact location of the data, but the page in memory where that item resides. Page? Like books? Well, kinda. I’ll defer a complete discussion for now, but recall that I already mentioned that the memory addresses are virtual. When the virtual memory system loads memory it does so in 4KB chunks called pages. As a result, a specific memory address can be referred to by a page number and an offset within that page.
The ADRP instruction has the same basic format as the ADR instruction with two exceptions. First, the leftmost bit is 1, and second, the immediate value in bits 23 to 5 is the number of pages where the data item resides relative to the current PC value. Of course to get the actual byte address, you need the following instruction that adds the offset within the page. Since this is a small program and the data is actually close the data is in the same page as the location where PC points to, so the 22 bits that make up the relative page number are all 0.
| 31 | 30-29 | 28______24 | 23_________________________________5 | 4______0 |
| op | immlo | 1 0 0 0 0 | immhi | Rd |
| 1 | 0 0 | 1 0 0 0 0 | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | 0 0 0 0 1 |
Anyway, you get the idea. Every instruction in the aarch64 instruction is packed into 32 bits, which includes the operation, immediate values, and registers. As we’ll see these instructions get much more complicated but they’re still 32 bits. Conceptually the processor loads these instructions into memory and then executes them in turn (accounting for branching), however as we’ll see in future episodes, modern processors have various ways to make the execution faster, including pipelining, operating on multiple data elements with a single instruction, and predictive branching.
One last note on instructions that connects them to the previous memory episode. As mentioned above the processor goes through a process of fetching the instruction and then executing it, which is called the fetch-execute cycle. You might wonder what drives the processor to proceed from one instruction to the next. This is primarily the function of what’s called the clock. The clock isn’t really a timekeeper, but rather a signal that oscillates between a logical 1 and logical 0 value. The rate at which this happens is called the clock speed, and generally a higher clock speed means a faster processor. My computer has a clock speed measured in GHz, while my original PC has a clock speed measured in MHz.
Instructions take some number of clock cycles, however saying precisely how many is complicated by CPU caching and some of the things mentioned above (eg, pipelining). As we’ll see later on, the clock speed on my computer is also variable to some degree, which is an energy-saving measure since faster clock speeds correlate to greater power. Anyway, it’s part of the lingo: instructions and clock cycles are inextricably linked, and you’ll often hear talk of clock speeds and clock cycles when talking about low-level programming and CPU speed comparisons.
Everyone says opcode. It just sounds cooler.↩︎