Skip to content

XCC RG-2017.8-linux: generated object code is affected by -g2 level (the default level) + longer source paths #7114

@marc-hb

Description

@marc-hb

tl;dr: if you're failing to reproduce a build, try a longer or shorter west topdir source directory. That's the workaround. There seems to be a threshold somewhere around 27 characters.

Summary

A very strange and complex toolchain issue where the -g level and debug symbols affect the object code on some Intel CI build system(s) for a totally unknown reason.

Unexpected variations like this break build reproducibility = how this issue was discovered in the first place.

Variations like these could also make it harder to reproduce race conditions and other difficult bugs.

I focused on .text.k_ticks_to_us_floor64 and RG-2017.8-linux below because it just happened to be the first difference I found but I noticed that the -g level also affects .text.z_swap_irqlock in the same conditions and there could be unexpected differences in other files.

=> Don't trust RG-2017.8-linux to produce consistent object code

This is not a problem with the Zephyr SDK toochain: it produces identical object code across Linux and Windows, as routinely verified in CI across all targets: https://github.com/thesofproject/sof/actions/runs/4189511798/jobs/7262044804

I haven't observed any problem with RI-2020.5-linux (and MTL) yet which Zephyr still identifies as GNU 4.2.2. Note there is on-going work to switch to xt-clang.

Longer story

Reproduced on SOF commit 64fe2ea + west update + sof/scripts/xtensa-build-zephyr.py -p tgl

Reproduced with a couple other SOF commits too.

The code generated for the k_ticks_to_us_floor64() normally looks like this:

xt-objdump -s --section=.text.k_ticks_to_us_floor64 ../build-tgl/zephyr/kernel/CMakeFiles/kernel.dir/sched.c.obj

Contents of section .text.k_ticks_to_us_floor64:
 0000 3661000c 05b10000 61000041 00000c07  6a......a..A....
 0010 b08682b0 b6a2a2c8 fe87ba01 1bbb0c0d  ................
 0020 c2afff81 0000e008 00a0d4a2 a0e582a0  ................
 0030 c482b0f4 827cfb0c 0afaddea dd16fd07  .....|..........
 0040 9cea40a2 8250d282 40b2a240 c382dabb  ..@..P..@..@....
 0050 cabbdd07 cd068100 00e00800 3d0b2d0a  ............=.-.
 0060 1df0bd03 ad02dd07 cd068100 00e00800  ................
 0070 b0d482a0 c582a0b4 a2a0a482 dabbcabb  ................
 0080 dd07cd06 810000e0 0800cd06 dd07a911  ................
 0090 b901ad02 bd038100 00e00800 3801a095  ............8...
 00a0 82e811b0 b482a084 a2a02482 ba882a2e  ..........$...*.
 00b0 9a888a33 e7b2151b 331df000 00000000  ...3....3.......   <== padding / alignment
 00c0 ceebb0c0 41f73f0d 0c1a86dc ff1df0    ....A.?........ 

=> Note the five 00 bytes of padding.

@andyross, who answered A LOT of questions during this investigation (thank you!), suspects xt-as adds this padding for cache alignment and performance reasons. This padding is there in most but unfortunately not all cases, which is a bug.

You don't need SOF to observe this code and padding, you can produce it with a pure upstream zephyr workspace:

ZEPHYR_TOOLCHAIN_VARIANT=xcc

XTENSA_TOOLCHAIN_PATH=$XCCLOC/install/tools
TOOLCHAIN_VER=RG-2017.8-linux
XTENSA_SYSTEM=$XCCLOC/install/builds/RG-2017.8-linux/cavs2x_LX6HiFi3_2017_8/config
export ZEPHYR_TOOLCHAIN_VARIANT XTENSA_TOOLCHAIN_PATH TOOLCHAIN_VER XTENSA_SYSTEM

cd zephyr/
git checkout d9c4ec31fc49e7eef3c
west build -p -b intel_adsp_cavs25 samples/hello_world/ -- -DCONFIG_SYS_CLOCK_TICKS_PER_SEC=15000 -DCONFIG_SPEED_OPTIMIZATIONS=y
xt-objdump -s --section=.text.k_ticks_to_us_floor64 zephyr/kernel/CMakeFiles/kernel.dir/sched.c.obj

Make sure you use ZEPHYR_TOOLCHAIN_VARIANT=xcc and other variables below. If .text.k_ticks_to_us_floor64 is missing then you probably forgot to switch to that toolchain.

I could unfortunately not reproduce this issue with hello_world in any configuration; so far reproduction requires SOF.

On at least two of our automated build systems (sofbld07 and 08, Ubuntu 20.04), the padding disappears in the default configuration. This breaks build reproducibility. There are other differences in sched.c.obj caused by -g (apparently not in other .c.obj files, which don't have 150+ ELF sections)

EDIT: this happens because these systems use longer source paths / debug symbols, that's what triggering this compiler bug.

When unspecified with -g, the default debug level is '-g2'. All other things equal, the high debug level set by default by Zephyr makes the usual padding disappear on these particular systems.

When decreasing the debug level in cmake/compiler/gcc/compiler_flags.cmake to -g1 or -g0 or no -g at all and making no other change whatsoever, the usual padding re-appears! In other words, a lower -g debug level makes the object code that runs on the DSP normal again.

--- a/cmake/compiler/gcc/compiler_flags.cmake
+++ b/cmake/compiler/gcc/compiler_flags.cmake
@@ -168,11 +168,11 @@ check_set_compiler_property(APPEND PROPERTY hosted -fno-freestanding)
 check_set_compiler_property(PROPERTY freestanding -ffreestanding)
 
 # Flag to enable debugging
-set_compiler_property(PROPERTY debug -g)
+set_compiler_property(PROPERTY debug -g1 ) 
 
 # GCC 11 by default emits DWARF version 5 which cannot be parsed by

For quick testing use:

cd build-tgl/
touch ../zephyr/kernel/sched.c

ninja -j1 -v

... then copy/paste and edit the line that compiles sched.c

The difference appears during the -S sched.i -> sched.s compilation step.

The compiled sched-g1.s and sched-g2.s have no actual difference in the assembly code, they only have a lot of .byte lines which are different. Somehow these bytes affects the assembly phase and the object code on those systems.

The pre-processed files sched-g1.i and sched-g2.i are strictly identical.

Another proof that the -g level triggers the issue: "hiding" the zephyr source code during the -S sched.i -> sched.s step restores the usual padding. stracing with -e openat the compiler at that step shows that it reads hundreds pf source files when using -g (even at the -g1 level that preserves the usual padding).

I tried to reproduce on a few other systems including another Ubuntu 20.04 but did not: the padding is always there no matter what I tried, only a few systems are affected. I'm of course using the very same toolchain.

I could not reproduce on the "guilty" systems with samples/hello_world/ either: with hello_world the padding is always there too. Too few debug symbols in hello_world to trigger this bug?

I unfortunately couldn't find what's unusual about these CI systems. Whatever makes them special, I don't think that, all other things being equal, the generated object code should EVER differ between -g1 and -g2 in any circumstance. It should even less differ on some systems but not on others when using the very same toolchain. So this qualifies as a compiler bug IMHO and makes that toolchain untrustworthy for reproducible builds.

cc:

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3Low-impact bugs or featuresbugSomething isn't working as expectedwon't fixThis will not be worked on atm (e.g. a bug closed for lack of user request, hardware etc)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions