3.59
Problem
The following code computes the 128-bit product of two 64-bit signed values x and y and stores the result in memory:
1 typedef __int128 int128_t;
2
3 void store_prod(int128_t *dest, int64_t x, int64_t y) {
4 *dest = x * (int128_t) y;
5 }
Gcc generates the following assembly code implementing the computation:
1 store_prod:
2 movq %rdx, %rax
3 cqto
4 movq %rsi, %rcx
5 sarq $63, %rcx
6 imulq %rax, %rcx
7 imulq %rsi, %rdx
8 addq %rdx, %rcx
9 mulq %rsi
10 addq %rcx, %rdx
11 movq %rax, (%rdi)
12 movq %rdx, 8(%rdi)
13 ret
This code uses three multiplications for the multiprecision arithmetic required to implement 128-bit arithmetic on a 64-bit machine. Describe the algorithm used to compute the product, and annotate the assembly code to show how it realizes your algorithm. Hint: When extending arguments of x and y to 128 bits, they can be rewritten as x = 264 · xh + xl and y = 264 · yh + y<sub<l, where xh, xl, yh, and y<sub<l are 64-bit values. Similarly, the 128-bit product can be written as p = 264 · ph + pl, where ph and pl are 64-bit values. Show how the code computes the values of ph and pl in terms of xh, xl, yh, and y<sub<l.
Solution
While reading the assembly code, it's important to remember
that the mulq
instruction computes a 128-bit result stored
in %rdx:%rax%
. That is, the high 64 bits in %rdx
and the
low 64 bits in %rax
.
We can ignore the term with \(2^{128}\) because it cannot fit in 128 bits, and we'd end up with the remainder.
Since \(p\) can also be represented similarly in terms of \(p_{h}\) and \(p_{l}\), we get
Besides the coefficient of \(2^{64}\) on the right hand side, \(x_{l} \cdot y_{l}\) can also add to the value of \(p_{h}\).
So, we get:
And,
This is reflected in our assembly code, we sign extend both \(x\) (to
%rdx:%rax
with line 2 cqto
) and \(y\) (to %rcx:%rsi
after
line 5 sarq
).
We first calculate the value of \(p_{h}\) without considering the
factor from \(x_{l} \cdot y_{l}\) (after line 8) in %rcx
.
The mulq
instruction then sets %rax
and %rdx
to the low and
high 64-bits of \(x_{l} \cdot y_{l}\) respectively.
We then add %rcx
to %rdx
to get \(p_{h}\) by adding the missing
factor. %rax
is already \(p_{l}\).