Skip to content

3.59

Problem

The following code computes the 128-bit product of two 64-bit signed values x and y and stores the result in memory:

1   typedef __int128 int128_t;
2   
3   void store_prod(int128_t *dest, int64_t x, int64_t y) {
4   *dest = x * (int128_t) y;
5   }

Gcc generates the following assembly code implementing the computation:

1   store_prod:
2   movq    %rdx, %rax
3   cqto    
4   movq    %rsi, %rcx
5   sarq    $63, %rcx
6   imulq   %rax, %rcx
7   imulq   %rsi, %rdx
8   addq    %rdx, %rcx
9   mulq    %rsi
10  addq    %rcx, %rdx
11  movq    %rax, (%rdi)
12  movq    %rdx, 8(%rdi)
13  ret 

This code uses three multiplications for the multiprecision arithmetic required to implement 128-bit arithmetic on a 64-bit machine. Describe the algorithm used to compute the product, and annotate the assembly code to show how it realizes your algorithm. Hint: When extending arguments of x and y to 128 bits, they can be rewritten as x = 264 · xh + xl and y = 264 · yh + y<sub<l, where xh, xl, yh, and y<sub<l are 64-bit values. Similarly, the 128-bit product can be written as p = 264 · ph + pl, where ph and pl are 64-bit values. Show how the code computes the values of ph and pl in terms of xh, xl, yh, and y<sub<l.

Solution

While reading the assembly code, it's important to remember that the mulq instruction computes a 128-bit result stored in %rdx:%rax%. That is, the high 64 bits in %rdx and the low 64 bits in %rax.

\[ x = 2^{64} \cdot x_{h} + x_{l},\ y = 2^{64} \cdot y_{h} + y_{l} \]
\[ x \cdot y = (2^{64} \cdot x_{h} + x_{l}) \cdot (2^{64} \cdot y_{h} + y_{l}) \]
\[ x \cdot y = 2^{128} \cdot x_{h} \cdot y_{h} + 2^{64} \cdot (x_{h} \cdot y_{l} + y_{h} \cdot x_{l}) + x_{l} \cdot y_{l} \]

We can ignore the term with \(2^{128}\) because it cannot fit in 128 bits, and we'd end up with the remainder.

Since \(p\) can also be represented similarly in terms of \(p_{h}\) and \(p_{l}\), we get

\[ p = 2^{64} \cdot p_{h} + p_{l} \]
\[ 2^{64} \cdot p_{h} + p_{l} = 2^{64} \cdot (x_{h} \cdot y_{l} + y_{h} \cdot x_{l}) + x_{l} \cdot y_{l} \]

Besides the coefficient of \(2^{64}\) on the right hand side, \(x_{l} \cdot y_{l}\) can also add to the value of \(p_{h}\).

So, we get:

\[ p_{h} = x_{h} \cdot y_{l} + y_{h} \cdot x_{l} + quotient(x_{l} \cdot y_{l},\ 2^{64}) \]

And,

\[ p_{l} = (x_{l} \cdot y_{l}) \bmod 2^{64} \]

This is reflected in our assembly code, we sign extend both \(x\) (to %rdx:%rax with line 2 cqto) and \(y\) (to %rcx:%rsi after line 5 sarq).

We first calculate the value of \(p_{h}\) without considering the factor from \(x_{l} \cdot y_{l}\) (after line 8) in %rcx.

The mulq instruction then sets %rax and %rdx to the low and high 64-bits of \(x_{l} \cdot y_{l}\) respectively.

We then add %rcx to %rdx to get \(p_{h}\) by adding the missing factor. %rax is already \(p_{l}\).