Skip to content

[BUG] [APL] FW core dump has empty oops data if FW built with GCC #1346

@kv2019i

Description

@kv2019i

Describe the bug
In case DSP hits an exception, it dumps oops data for the driver to read out.

What actually happens is that I get a dump file that is partially ok. I can find "struct sof_ipc_panic_info" correctly filled. E.g. I can find SOF_IPC_PANIC_MAGIC and the rest of the struct seems ok. Start of the dump however is just zeroes and there doesn't seem to be any valid values for "struct sof_ipc_dsp_oops_xtensa". Stack dump seems ok and I can find correct functions my manually looking up symbols from the stack dump, but the coredumper python scripts cannot make sense out of this dump as many key fields are just zeroes.

To Reproduce
Cause FW to hit oops. I did this by adding following code to volume.c:volume_copy()

»       »       panic(1234);
»       »       while(1) {}

Expected behavior
I can extract the oops file by doing:

scp root@dut:/sys/kernel/debug/sof/exception oops.bin

And feed data to coredumper tool.

Impact
If DSP oops cannot be succesfully saved, debugging hard-to-reproduce bugs is severely impacted.

Environment

  1. Branch name and commit hash of 3 repositories: sof (firmware), linux (kernel driver) and soft (tools & topology).
    linux sofdev 2e94569
    sof master e14ab70

  2. Name of the topology file
    n/a

  3. Name of the platform(s) on which the bug is observed.
    APL UP2

  4. Reproducibility Rate. If you can only reproduce it randomly, it’s useful to report how many times the bug has been reproduced vs. the number of attempts it’s taken to reproduce the bug.
    100%

Screenshots or console output
Two example dumps attached.

oopses-20190429.zip


Highlights from the comments below:

problem is definitely in how the dump routine handles WINDOWBASE updates. If ROTW is called even once (like happens with 1 iteration of store_register_loop), the result is an invalid dump. Code looks current and I fail to see how a single ROTW can have such impact (there are only a few ops on the core after this

Got basic gdb working at least do a degree within QEMU and it seems ROTW causes another exception and we end up in DoubleExceptionVector handler. But that's probably just a symptom, the same code works when compiled on XT-CC. I

I now got an OK exception dump (for another bug) on WHL (cnl image), built with GCC, so at least this is not happening in all cases. Rootcause still unknown.

Fwiw, the ABI between GCC and XCC is slightly different wrt calling convention and registers windows hence there are some incompatibilities with some of the dump data.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3Low-impact bugs or featuresbugSomething isn't working as expectedstaleIssue/PR marked as stale and will be closed after 14 days if there is no activity.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions