"Too much precision" bug

I am currently working on the Rosalind Franklin rover, part of the Exomars 2022 mission to Mars. I develop a software layer to integrate the autonomous navigation algorithms (developed by CNES) with the hardware rover (built by Airbus). CNES has been working on image processing algorithms to give some autonomy to robots. Their expertise is being used, among other projects, in Exomars. Airbus is developing the actual rover, and has defined some APIs to be exposed to allow executing CNES’s algorithms.

The real hardware which will fly is a LEON2-FT, a radiation-tolerant Sparc based CPU. Testing on this hardware is very slow (software must first be uploaded, and the CPU is clocked at 80 MHz). In order to speed up development, the software is first tested on a standard x86 computer, to validate the algorithms and iron out most bugs, and later run on Sparc for validation.

There are a few differences, which require some work on Sparc. Most issues I encountered until now are due two things:

Endianness: x86 is litle-endian while Sparc is big-endian. This requires some care when sending input or reading output data.
Memory alignment: Sparc requires reading ints from addresses which are a multiple of 4, and doubles from multiple of 8. x86 does not. Therefore, the same code can work on x86 but raise a trap on Sparc. This can happen when using __attribute__((packed)) on structs.

Failing test

One of my tests was returning a different result on x86 vs Sparc: 0xBF FD 67 34 50 FF A7 AE and 0xBF FD 67 34 50 FF A7 AD. The difference is subtle: the last byte is 0xE vs 0xD. These bytes are to be interpreted as doubles, which gives: -1.8376963771829468 and -1.8376963771829466. Again, the difference is small: it is in the order of 10**-16.

I could have justified this small difference in the test report, tell my boss I had fixed another test and call it a day. But that would have required updating the test runner to check floats up to a given precision, which would have added some complexity. Moreover, a few things bothered me:

The input data is exactly the same (of course)
This part of the algorithm contains few operations and no loops, so an error accumulation over hundreds of loops does not look like a good explanation
The difference is not exactly one byte, but actually the least significant bit. It means that the difference is the smallest possible difference we could have between two doubles. Coincidence? I don’t know yet, but that is pretty weird.

Let’s investigate this together.

Investigation

Locating the bug

When debugging, the first step is to reproduce the bug to be able to study it. In my case, it was pretty easy: I already had a test, and it was completely deterministic: the bug happened 100% of the time.

From the failed test report, I followed the function calls passing this value until the place where it was computed, sprinkling printfs along the way. I quickly located the exact line, and checked the inputs and outputs: the error happened when converting a value from degrees to radians.

Preparing a minimal reproducible example

A minimal reproducible example is useful to be able to reproduce and play with the bug, with the least amount of code possible. This allow testing several hypothesis with fast feedback loops.

Here is the code I came up with, to reproduce the issue. I was suspecting that the order of the operations *// could be the root cause, so I included both ways.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


#include <stdio.h>

#define D_PI 3.1415926535  // define from project

int main(void)
{
    double deg = -105.2922464420737469;

    double rad1 = deg*D_PI/180.0;
    printf("rad1 %.20lf\n", rad1);

    double rad2 = deg/180.0*D_PI;
    printf("rad2 %.20lf\n", rad2);

    return 0;
}

Compiled with:

gcc -m32 -O0 main.c -o a.out.gcc
sparc-rtems-gcc -m32 -O0 main.c -o a.out.sparc

The output is:

1
2
3
4
5
6
7


# x86
rad1 -1.83769637718294664985
rad2 -1.83769637718294664985

# sparc
rad1 -1.83769637718294687190
rad2 -1.83769637718294664985

So, it appears that the order is indeed important, and that the issue can be fixed easily.

Counter intuitively, you have more precision when you first divide then multiply. The reason is because floats have a fixed precision (fixed-size mantissa), shared between the integer and decimal parts. So you want small numbers (divide first) to keep as much precision as possible in the decimal part.

Now that I had found a difference, I wanted to understand the reason. So I tried to look under the hood.

Suspecting the generated ASM

Since the C code shown previously gives different results, I suspected the lower level: ASM. Maybe the order or types of instructions used could have an impact. So I disassembled and studied the assembly.

Disassembly

The project is compiled as 32 bits. To improve ease of debugging, I added -O0 to disable optimizations. The Sparc version is compiled with the RTEMS’s (real-time operating system) toolchain.

To disassemble, I used the following command:

1
2


objdump -d -M intel -m i386 a.out.gcc-default
sparc-rtems-objdump -d -M intel -m sparc a.out.sparc

In the dump shown below, I have shortened the ASM to keep only the relevant part. If you’re trying to follow along, here are some manuals: x86 and Sparc.

x86

I assume you know a bit of x86 assembly, but just in case, here is a quick ref to refresh your memory:

fdivrp: FPU division
fld: FPU load
fmulp: FPU multiplication
fstp: FPU store

The st and st(1) are the FPU register stack.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40


;
; v_deg * D_PI / 180.0
;

; load v_deg and D_PI on the FPU stack
; not sure why it does this load/store/load sequence...
8048416: dd 05 38 85 04 08       fld    QWORD PTR ds:0x8048538
804841c: dd 5c 24 28             fstp   QWORD PTR [esp+0x28]
8048420: dd 44 24 28             fld    QWORD PTR [esp+0x28]
8048424: dd 05 40 85 04 08       fld    QWORD PTR ds:0x8048540
; multiply
804842a: de c9                   fmulp  st(1),st
; load 180.0 on the FPU stack
804842c: dd 05 48 85 04 08       fld    QWORD PTR ds:0x8048548
; divide
8048432: de f9                   fdivrp st(1),st
; store the result in rad1
8048434: dd 5c 24 20             fstp   QWORD PTR [esp+0x20]
; put the arguments (rad1 and format string) on the stack and call printf
; the FPU stack is used, but something more classic could have worked too:
; mov [esp+0x4], [esp+0x20]
8048438: dd 44 24 20             fld    QWORD PTR [esp+0x20]
804843c: dd 5c 24 04             fstp   QWORD PTR [esp+0x4]
8048440: c7 04 24 18 85 04 08    mov    DWORD PTR [esp],0x8048518
8048447: e8 94 fe ff ff          call   80482e0 <printf@plt>

;
; v_deg / 180.0 * D_PI
;

804844c: dd 44 24 28             fld    QWORD PTR [esp+0x28]
8048450: dd 05 48 85 04 08       fld    QWORD PTR ds:0x8048548
8048456: de f9                   fdivrp st(1),st
8048458: dd 05 40 85 04 08       fld    QWORD PTR ds:0x8048540
804845e: de c9                   fmulp  st(1),st
8048460: dd 5c 24 18             fstp   QWORD PTR [esp+0x18]
8048464: dd 44 24 18             fld    QWORD PTR [esp+0x18]
8048468: dd 5c 24 04             fstp   QWORD PTR [esp+0x4]
804846c: c7 04 24 25 85 04 08    mov    DWORD PTR [esp],0x8048525
8048473: e8 68 fe ff ff          call   80482e0 <printf@plt>

Sparc

You might have some knowledge of x86 or ARM assembly, but Sparc is less common, and has some major differences. Here is a crash course on Sparc assembly to get you up to speed:

Each call is followed by a nop. This is because of Sparc’s multi-stages instructions decoding/execution pipeline: when the call instruction is executed, the next instruction is already loaded and will be executed. A nop is used to avoid executing something that should not be.
Loads from memory are made with three instructions. This is due to the instructions having a fixed size of 4 bytes (you’ll notice that x86 has variable-size instructions). That is why the instructions can not have a 32 bits value and several of them are needed to load it: first sethi (sets the 22 high bits), followed by an or (sets the low 10 bits).
The register system is a bit complex. You can have a look at this or that if you’re curious. Just note that there are different categories of registers: global (g0 to g7), out (o0 to o7) and floating point (f0 to f31) (and a few other ones that we won’t need here).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63


;
; v_deg * D_PI / 180.0
;

; load D_PI/v_deg in %f10
; like x86, a load/store/load sequence
400011f8:       03 10 01 06     sethi  %hi(0x40041800), %g1
400011fc:       82 10 61 60     or  %g1, 0x160, %g1
40001200:       d1 18 40 00     ldd  [ %g1 ], %f8
40001204:       d1 3f bf e0     std  %f8, [ %fp + -32 ]
40001208:       d5 1f bf e0     ldd  [ %fp + -32 ], %f10
; load D_PI/v_deg in %f8
4000120c:       03 10 01 06     sethi  %hi(0x40041800), %g1
40001210:       82 10 61 68     or  %g1, 0x168, %g1
40001214:       d1 18 40 00     ldd  [ %g1 ], %f8
; multiply
40001218:       91 a2 89 48     fmuld  %f10, %f8, %f8
; load mul result in %o0 and 180.0 in %o2
4000121c:       03 10 01 06     sethi  %hi(0x40041800), %g1
40001220:       82 10 61 70     or  %g1, 0x170, %g1
40001224:       d1 3f bf f8     std  %f8, [ %fp + -8 ]
40001228:       d0 1f bf f8     ldd  [ %fp + -8 ], %o0
4000122c:       d4 18 40 00     ldd  [ %g1 ], %o2
; divide. appears to be a software divide
; the return value is actually 2*32 bits, hence 2 fmovs
40001230:       40 00 00 25     call  400012c4 <__divdf3>
40001234:       01 00 00 00     nop
40001238:       91 a0 00 20     fmovs  %f0, %f8
4000123c:       93 a0 00 21     fmovs  %f1, %f9
40001240:       d1 3f bf e8     std  %f8, [ %fp + -24 ]
; call printf (followed by the famous nop)
40001244:       03 10 01 06     sethi  %hi(0x40041800), %g1
40001248:       90 10 61 40     or  %g1, 0x140, %o0
4000124c:       d2 07 bf e8     ld  [ %fp + -24 ], %o1
40001250:       d4 07 bf ec     ld  [ %fp + -20 ], %o2
40001254:       40 00 d8 54     call  400373a4 <printf>
40001258:       01 00 00 00     nop

;
; v_deg / 180.0 * D_PI
;

4000125c:       03 10 01 06     sethi  %hi(0x40041800), %g1
40001260:       82 10 61 70     or  %g1, 0x170, %g1
40001264:       d0 1f bf e0     ldd  [ %fp + -32 ], %o0
40001268:       d4 18 40 00     ldd  [ %g1 ], %o2
4000126c:       40 00 00 16     call  400012c4 <__divdf3>
40001270:       01 00 00 00     nop
40001274:       91 a0 00 20     fmovs  %f0, %f8
40001278:       93 a0 00 21     fmovs  %f1, %f9
4000127c:       95 a0 00 28     fmovs  %f8, %f10
40001280:       97 a0 00 29     fmovs  %f9, %f11
40001284:       03 10 01 06     sethi  %hi(0x40041800), %g1
40001288:       82 10 61 68     or  %g1, 0x168, %g1
4000128c:       d1 18 40 00     ldd  [ %g1 ], %f8
40001290:       91 a2 89 48     fmuld  %f10, %f8, %f8
40001294:       d1 3f bf f0     std  %f8, [ %fp + -16 ]
40001298:       03 10 01 06     sethi  %hi(0x40041800), %g1
4000129c:       90 10 61 50     or  %g1, 0x150, %o0
400012a0:       d2 07 bf f0     ld  [ %fp + -16 ], %o1
400012a4:       d4 07 bf f4     ld  [ %fp + -12 ], %o2
400012a8:       40 00 d8 3f     call  400373a4 <printf>
400012ac:       01 00 00 00     nop

I noticed the software division: it is a function call, not a dedicated assembly instruction. This project uses a math library from ESA instead of math.h, but I could not find the function in it. I guess it means this function is actually some compiler microcode…

I tried to understand if this software division was the reason of the issue but could not find any evidences, so I kept searching…

Finding the root cause

Since the beginning of the project, and in particular the previous days, I had found and fixed several other bugs impacting Sparc. They were mostly related to alignment. I was expecting the issue to be in my code, as usual. But this time it was not, what a surprise.

Extended precision

I don’t really remember how I found it, probably a bit out of luck. I saw that Sparc was using software division, so I thought it could be the root cause. I looked at the Makefile and in GCC’s manpage for references to hard/soft float operations. On thing leading to another, I probably ended up on seeing a mention of extended precision, which I had already heard of. This corresponds to 80 bits floats.

The page is actually a bit funny:

Known Math Bugs

Implicit use of extended precision on x86 when using the x87 FPU

Source: https://gcc.gnu.org/wiki/FloatingPointMath

So, the root cause is that x86 uses these 80 bits floats instead of 64 bits, giving more precision. The Sparc result is “wrong”, but it is the “expected” result nonetheless.

Hopefully, this option can be disabled with a compiler flag: -mfpmath=sse -msse2.

I tried the option, and it worked! That was the culprit!

New x86 assembly

Here is the new x86 assembly with the extended precision disabled.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


;
; v_deg * D_PI / 180.0
;

8048416:     f2 0f 10 05 68 85 04    movsd  xmm0,QWORD PTR ds:0x8048568
804841d:     08
804841e:     f2 0f 11 44 24 28       movsd  QWORD PTR [esp+0x28],xmm0
8048424:     f2 0f 10 4c 24 28       movsd  xmm1,QWORD PTR [esp+0x28]
804842a:     f2 0f 10 05 70 85 04    movsd  xmm0,QWORD PTR ds:0x8048570
8048431:     08
8048432:     f2 0f 59 c1             mulsd  xmm0,xmm1
8048436:     f2 0f 10 0d 78 85 04    movsd  xmm1,QWORD PTR ds:0x8048578
804843d:     08
804843e:     f2 0f 5e c1             divsd  xmm0,xmm1
8048442:     f2 0f 11 44 24 20       movsd  QWORD PTR [esp+0x20],xmm0
8048448:     f2 0f 10 44 24 20       movsd  xmm0,QWORD PTR [esp+0x20]
804844e:     f2 0f 11 44 24 04       movsd  QWORD PTR [esp+0x4],xmm0
8048454:     c7 04 24 48 85 04 08    mov    DWORD PTR [esp],0x8048548
804845b:     e8 80 fe ff ff          call   80482e0 <printf@plt>

We can see that instead of the f* instructions, standards movsd, mulsd and divsd are used. Also, the xmm registers are used instead of the st. These XMM are the registers used by SSE, which does not support 80 bits floats.

Conclusion

This bug was funny to investigate, and it took me less than half a day. I like to look at ASM, but there are few opportunities because there are (almost) always higher-level solutions which are faster. I learned a few things on x86 and Sparc, which could be useful in the future. I like to understand how computers work under the hood.

The use of extended precision on x86 makes sense and having more precision is great in most cases. Except here where it is actually very stupid, because Sparc does not support it and it is needed to have the exact same results on both.

Alternative solutions

Instead of disabling the extended precision from the CLI, there are a few other options, that I decided were not as good.

I could have kept two datasets, one for x86 and one for Sparc. But that means two sources of truth, thus more complexity.

I could also have limited the precision required when comparing floats, on the Test Runner side (for an angle, 10**-3 is surely enough). I would probably have done that if the issue was not so easy to find and solve.

Finally, I also found a purely code solution. Although it is better to not rely on compiler options, it seemed a bit hacky.

Debugging method improvements

I think the steps I took to debug were efficients. I started by locating the bug, and I then wrote a minimal version to reproduce it and quickly test hypothesis. There is not much to improve.

When working with embedded systems, the code-execution feedback loop is usually very slow and inefficient. In my case, my setup makes it ok: there are not too many manual steps, and the major bottleneck is flashing the software, and executing (which can not really be improved).

Final words

To conclude, here is a related XKCD. I feel a bit like LongTimeUser4 relying on a weird behavior that nobody would normally need (less precision, what a weird requirement)…

“Workflow” - CC BY-NC https://xkcd.com/1172/