16 Subroutines [WIP]

The word subroutine is hard to explain because it implies the existence of a routine of which it is a part or to which it is subordinate. Better words might be subprogram or procedure, and in some cases the word function makes more sense. But subroutine is what we have, and it’s still being used by Data on Star Trek in the 24th century, so i guess we need to stick with it. The [ARM Developer docs](Arm Limited 2021) define it this way:

A subroutine is a block of code that performs a task based on some arguments and optionally returns a result.

The key thing about a subroutine is that it’s used more than once, meaning that you want to apply the same logic to a different set of data or parameters. Based on what we’ve learned so far about assembly language, you might be able to imagine ways to do this. For example, we know that we can put values in registers, including the program counter. We know that we can branch to a different part of the code. The only real trick then is to figure out how to get back to the place from which we branched to the subroutine.

Suppose you want some code that multiplies two numbers and then adds a third. You could repeat those instructions every time you needed them but that’s not how we professionals do it(LOL, except when we do). For now, just think of this fairly trivial process as a proxy for something way more complicated that would be a bummer to repeat over and over again. You could probably do something like this:

.global _main             // Entry point for macOS
.align 4

_main:
  mov x0, #5
  mov x1, #10
  adr x4, r1
  b multiply_and_add
r1:
  mov x0, #2
  mov x1, #3
  adr x4, r2
  b multiply_and_add
r2:

bail:
  mov     x16, #1         // macOS syscall for exit
  svc     #0x80           // Invoke supervisor call

multiply_and_add:
  mul x2, x0, x1
  add x2, x2, #3
  br x4

This works, but it’s ugly. There are fancier ways to set up the return address (ie, the thing we stick in X4) but since this is a very common thing to do, there are already instructions in the ARM64 instruction set for it. The proper way to write this code would be like this:

.global _main             // Entry point for macOS
.align 4

_main:
  mov x0, #5
  mov x1, #10
  bl multiply_and_add

ma2:
  mov x0, #2
  mov x1, #3
  bl multiply_and_add

bail:
  mov     x16, #1         // macOS syscall for exit
  svc     #0x80           // Invoke supervisor call

multiply_and_add:
  mul x2, x0, x1
  add x2, x2, #3
  ret

The key instructions here are bl and ret. The bl instruction stands for “branch with link”. It does two things: it branches to the specified address, and it saves the return address in the link register (LR, which is X30). The ret instruction simply branches to the address in the link register, which is where we want to return to after the subroutine is done. That saves us from needing to stick in weird labels or try to figure out the return address ourselves.

If we run this with the debugger, we can see that we get the two results we expect (5 * 10 + 3 = 53, and 2 * 3 + 3 = 9).

$ lldb ./subs2
(lldb) target create "./subs2"
Current executable set to '/Documents/Dev/hdmcw_book/examples/subs/subs2' (arm64).
(lldb) break set -n ma2
Breakpoint 1: where = subs2`ma2, address = 0x00000001000002ec
(lldb) r
Process 1386 launched: '/Documents/Dev/hdmcw_book/examples/subs/subs2' (arm64)
Process 1386 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
    frame #0: 0x00000001000002ec subs2`ma2
subs2`ma2:
->  0x1000002ec <+0>: mov    x0, #0x2 ; =2 
    0x1000002f0 <+4>: mov    x1, #0x3 ; =3 
    0x1000002f4 <+8>: bl     0x100000300    ; multiply_and_add

subs2`bail:
    0x1000002f8 <+0>: mov    x16, #0x1 ; =1 
Target 0: (subs2) stopped.
(lldb) re read -fd x2
      x2 = 53
(lldb) break set -n bail
Breakpoint 2: where = subs2`bail, address = 0x00000001000002f8
(lldb) c
Process 1386 resuming
Process 1386 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
    frame #0: 0x00000001000002f8 subs2`bail
subs2`bail:
->  0x1000002f8 <+0>: mov    x16, #0x1 ; =1 
    0x1000002fc <+4>: svc    #0x80

subs2`multiply_and_add:
    0x100000300 <+0>: mul    x2, x0, x1
    0x100000304 <+4>: add    x2, x2, #0x3
Target 0: (subs2) stopped.
(lldb) re read -fd x2
      x2 = 9

Obviously multiply_and_add is super basic, but a subroutine could be quite complicated and it’s even possible that it would call its own subroutines, and so forth. A subroutine could even call itself if structured properly (recursion: see recursion). As we’ll see, those types of subroutines require more machinery, but the basic idea is the same.

As in bowling, with subroutines there are rules. One is that you should only use certain registers for passing arguments and returning values, and you should save and restore any registers that you use in the subroutine. The ARM64 calling convention specifies that X0-X7 are used for passing arguments and returning values, and that X19-X28 are callee-saved registers (meaning that if a subroutine uses them, it must save their original values and restore them before returning). The link register (X30) is used for storing the return address. One of the best bugs I ever had was caused by forgetting to restore a register in an assembly language subroutine that was called by a C function. The result was that memory was afterward allocated in the video buffer. Ah, the good ol’d days.

16.1 The Stack

Since subroutine calls can be nested and recursive, sometimes relying on registers to hold state is not enough. We need a place we can stash things that can grow abritrarily large. Remember way back in the Chapter on Memory where we talked about the stack? The stack sits at the top of the virtual memory space and it grows downwards. It’s called “the stack” because it’s like a stack of books: the last thing you put on top is the first thing you take off. The operations of putting something on the stack and taking it off are called “pushing” and “popping”. On my first computer with 8086 assembly language there were literal PUSH and POP instructions, but ARM64 uses a different approach. Also note that processor keeps track of the stack with a special register called the stack pointer (SP, which is X31). The stack pointer points to the top of the stack, and it is automatically updated when we push or pop values.

Using the stack, we could write our multiply_and_add subroutine like this:

.global _main             // Entry point for macOS
.align 4

_main:
  mov x0, #5
  mov x1, #10
  stp x0, x1, [sp, #-0x10]!
  bl multiply_and_add

bail:
  mov     x16, #1         // macOS syscall for exit
  svc     #0x80           // Invoke supervisor call

multiply_and_add:
  ldp x3, x4, [sp], #0x10
  mul x2, x3, x4
  add x2, x2, #3
  ret

The key bits here are the STP and LDP instructions. These are a bit wonky. You can see that both of them operate on pairs of registers (X0 and X1 in the first case, X3 and X4 in the second case). The STP instruction stands for “store pair”, and it stores the values of the two registers on the stack. The LDP instruction stands for “load pair”, and it loads the values from the stack into the two registers. The ! at the end of the STP instruction means that we want to update the stack pointer after storing the values, that is we want to store the 16 bytes from the registers and then move the stack pointer down by 16 bytes so that it points at the new “top”. What seems like weirdness here comes from the fact the the ARM64 processor wants the stack to 16-byte aligned (aka “quad-word aligned”). We can also use more conventional instructions to save and retrieve values from the stack (after all, it’s just memory), but you have to be careful to keep the stack pointer aligned.

In practice, we have to be more careful about saving certain registers when creating a subroutine, especially when there are multiple levels of subroutine calls.
Recall from above that the BR instruction stores the return address in the link register (LR, which is X30). If we have a subroutine that calls another subroutine, we need to make sure to save the LR before calling the next subroutine, and then restore it before returning. Similarly, the ARM64 standard uses register X29 as what’s called the frame pointer (FP). The frame pointer is used to keep track of the current stack frame, which is a section of the stack that is used for a particular subroutine call. When we enter a subroutine, we typically save the current frame pointer on the stack and then set the frame pointer to the current stack pointer. This allows us to easily access local variables and function arguments using fixed offsets from the frame pointer. When we exit the subroutine, we restore the previous frame pointer from the stack.

In subroutines generated by a C compiler, you will often see something like this at the beginning and end of the subroutine:

->  0x100000474 <+0>:   sub    sp, sp, #0x20
    0x100000478 <+4>:   stp    x29, x30, [sp, #0x10]

...

    0x1000004d4 <+96>:  ldp    x29, x30, [sp, #0x10]
    0x1000004d8 <+100>: add    sp, sp, #0x20
    0x1000004dc <+104>: ret

A couple of things are going on here. First, we are reserving space on the stack for our local variables by subtracting from the stack pointer (SP). Then we are saving the current frame pointer (X29) and link register (X30) on the stack using the STP instruction. At the end of the subroutine, we are restoring the frame pointer and link register using the LDP instruction, and then we are adding back to the stack pointer to clean up the stack before returning.