"Too much precision" bug
I am currently working on the Rosalind Franklin rover, part of the Exomars 2022 mission to Mars. I develop a software layer to integrate the autonomous navigation algorithms (developed by CNES) with the hardware rover (built by Airbus). CNES has been working on image processing algorithms to give some autonomy to robots. Their expertise is being used, among other projects, in Exomars. Airbus is developing the actual rover, and has defined some APIs to be exposed to allow executing CNES’s algorithms.
The real hardware which will fly is a LEON2-FT, a radiation-tolerant Sparc based CPU. Testing on this hardware is very slow (software must first be uploaded, and the CPU is clocked at 80 MHz). In order to speed up development, the software is first tested on a standard x86 computer, to validate the algorithms and iron out most bugs, and later run on Sparc for validation.
There are a few differences, which require some work on Sparc. Most issues I encountered until now are due two things:
- Endianness: x86 is litle-endian while Sparc is big-endian. This requires some care when sending input or reading output data.
- Memory alignment: Sparc requires reading
int
s from addresses which are a multiple of 4, anddouble
s from multiple of 8. x86 does not. Therefore, the same code can work on x86 but raise a trap on Sparc. This can happen when using__attribute__((packed))
onstruct
s.
Failing test
One of my tests was returning a different result on x86 vs Sparc: 0xBF FD 67 34 50 FF A7 AE
and 0xBF FD 67 34 50 FF A7 AD
.
The difference is subtle: the last byte is 0xE
vs 0xD
.
These bytes are to be interpreted as double
s, which gives: -1.8376963771829468 and -1.8376963771829466.
Again, the difference is small: it is in the order of 10**-16.
I could have justified this small difference in the test report, tell my boss I had fixed another test and call it a day. But that would have required updating the test runner to check floats up to a given precision, which would have added some complexity. Moreover, a few things bothered me:
- The input data is exactly the same (of course)
- This part of the algorithm contains few operations and no loops, so an error accumulation over hundreds of loops does not look like a good explanation
- The difference is not exactly one byte, but actually the least significant bit. It means that the difference is the smallest possible difference we could have between two
double
s. Coincidence? I don’t know yet, but that is pretty weird.
Let’s investigate this together.
Investigation
Locating the bug
When debugging, the first step is to reproduce the bug to be able to study it. In my case, it was pretty easy: I already had a test, and it was completely deterministic: the bug happened 100% of the time.
From the failed test report, I followed the function calls passing this value until the place where it was computed, sprinkling printf
s along the way.
I quickly located the exact line, and checked the inputs and outputs: the error happened when converting a value from degrees to radians.
Preparing a minimal reproducible example
A minimal reproducible example is useful to be able to reproduce and play with the bug, with the least amount of code possible. This allow testing several hypothesis with fast feedback loops.
Here is the code I came up with, to reproduce the issue.
I was suspecting that the order of the operations *
//
could be the root cause, so I included both ways.
|
|
Compiled with:
gcc -m32 -O0 main.c -o a.out.gcc
sparc-rtems-gcc -m32 -O0 main.c -o a.out.sparc
The output is:
|
|
So, it appears that the order is indeed important, and that the issue can be fixed easily.
Counter intuitively, you have more precision when you first divide then multiply. The reason is because floats have a fixed precision (fixed-size mantissa), shared between the integer and decimal parts. So you want small numbers (divide first) to keep as much precision as possible in the decimal part.
Now that I had found a difference, I wanted to understand the reason. So I tried to look under the hood.
Suspecting the generated ASM
Since the C code shown previously gives different results, I suspected the lower level: ASM. Maybe the order or types of instructions used could have an impact. So I disassembled and studied the assembly.
Disassembly
The project is compiled as 32 bits.
To improve ease of debugging, I added -O0
to disable optimizations.
The Sparc version is compiled with the RTEMS’s (real-time operating system) toolchain.
To disassemble, I used the following command:
|
|
In the dump shown below, I have shortened the ASM to keep only the relevant part. If you’re trying to follow along, here are some manuals: x86 and Sparc.
x86
I assume you know a bit of x86 assembly, but just in case, here is a quick ref to refresh your memory:
- fdivrp: FPU division
- fld: FPU load
- fmulp: FPU multiplication
- fstp: FPU store
The st
and st(1)
are the FPU register stack.
|
|
Sparc
You might have some knowledge of x86 or ARM assembly, but Sparc is less common, and has some major differences. Here is a crash course on Sparc assembly to get you up to speed:
- Each
call
is followed by anop
. This is because of Sparc’s multi-stages instructions decoding/execution pipeline: when thecall
instruction is executed, the next instruction is already loaded and will be executed. Anop
is used to avoid executing something that should not be. - Loads from memory are made with three instructions. This is due to the instructions having a fixed size of 4 bytes (you’ll notice that x86 has variable-size instructions). That is why the instructions can not have a 32 bits value and several of them are needed to load it: first
sethi
(sets the 22 high bits), followed by anor
(sets the low 10 bits). - The register system is a bit complex. You can have a look at this or that if you’re curious. Just note that there are different categories of registers:
global
(g0
tog7
),out
(o0
too7
) and floating point (f0
tof31
) (and a few other ones that we won’t need here).
|
|
I noticed the software division: it is a function call, not a dedicated assembly instruction. This project uses a math library from ESA instead of math.h, but I could not find the function in it. I guess it means this function is actually some compiler microcode…
I tried to understand if this software division was the reason of the issue but could not find any evidences, so I kept searching…
Finding the root cause
Since the beginning of the project, and in particular the previous days, I had found and fixed several other bugs impacting Sparc. They were mostly related to alignment. I was expecting the issue to be in my code, as usual. But this time it was not, what a surprise.
Extended precision
I don’t really remember how I found it, probably a bit out of luck. I saw that Sparc was using software division, so I thought it could be the root cause. I looked at the Makefile and in GCC’s manpage for references to hard/soft float operations. On thing leading to another, I probably ended up on seeing a mention of extended precision, which I had already heard of. This corresponds to 80 bits floats.
The page is actually a bit funny:
Known Math Bugs
Implicit use of extended precision on x86 when using the x87 FPU
So, the root cause is that x86 uses these 80 bits floats instead of 64 bits, giving more precision. The Sparc result is “wrong”, but it is the “expected” result nonetheless.
Hopefully, this option can be disabled with a compiler flag: -mfpmath=sse -msse2
.
I tried the option, and it worked! That was the culprit!
New x86 assembly
Here is the new x86 assembly with the extended precision disabled.
|
|
We can see that instead of the f*
instructions, standards movsd
, mulsd
and divsd
are used.
Also, the xmm
registers are used instead of the st
.
These XMM
are the registers used by SSE, which does not support 80 bits floats.
Conclusion
This bug was funny to investigate, and it took me less than half a day. I like to look at ASM, but there are few opportunities because there are (almost) always higher-level solutions which are faster. I learned a few things on x86 and Sparc, which could be useful in the future. I like to understand how computers work under the hood.
The use of extended precision on x86 makes sense and having more precision is great in most cases. Except here where it is actually very stupid, because Sparc does not support it and it is needed to have the exact same results on both.
Alternative solutions
Instead of disabling the extended precision from the CLI, there are a few other options, that I decided were not as good.
I could have kept two datasets, one for x86 and one for Sparc. But that means two sources of truth, thus more complexity.
I could also have limited the precision required when comparing floats, on the Test Runner side (for an angle, 10**-3
is surely enough).
I would probably have done that if the issue was not so easy to find and solve.
Finally, I also found a purely code solution. Although it is better to not rely on compiler options, it seemed a bit hacky.
Debugging method improvements
I think the steps I took to debug were efficients. I started by locating the bug, and I then wrote a minimal version to reproduce it and quickly test hypothesis. There is not much to improve.
When working with embedded systems, the code-execution feedback loop is usually very slow and inefficient. In my case, my setup makes it ok: there are not too many manual steps, and the major bottleneck is flashing the software, and executing (which can not really be improved).
Final words
To conclude, here is a related XKCD. I feel a bit like LongTimeUser4 relying on a weird behavior that nobody would normally need (less precision, what a weird requirement)…