Fix chain dma task handling issue and re-enable it. #7073

jsarha · 2023-02-10T12:19:30Z

The known issues with the pipeline-free chain DMA implementation has now been solved, so we can re-enable chain DMA usage.

This reverts commit c3807ae.

Signed-off-by: Jyri Sarha jyri.sarha@intel.com

ujfalusi

@jsarha, @makarukp, thank you, tested and works on my upx-i11 board!

kv2019i

Let's wait for CI results.

kv2019i · 2023-02-10T16:14:58Z

Something fishy:
https://sof-ci.01.org/sofpr/PR7073/build3851/devicetest/index.html?model=TGLU_UP_HDA_IPC4ZPH&testcase=multiple-pause-resume-5

[  577.894633] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx      : 0xe050101|0x0: GLB_CHAIN_DMA
[  578.400201] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc timed out for 0xe050101|0x0
[  578.400327] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: preventing DSP entering D3 state to preserve context

ranj063 · 2023-02-10T17:16:39Z

@jsarha TGL HDA has quite a few failures. Can you please check?

jsarha · 2023-02-13T13:17:24Z

@jsarha TGL HDA has quite a few failures. Can you please check?

Yes, indeed the multiple suspend resume now fails in FW crash. I can reproduce the problem on my upx-i11, but there is not much information to go on to close up on the problem. However, its quite clear that we should not merge this PR yet.

kv2019i · 2023-02-14T09:05:50Z

Let's test this again in CI (I'll trigger this) @jsarha @makarukp The latest run did not have the Linux kernel fix to get last DSP logs to mtrace. This looks like a FW crash and if that's the case, we should get a clean DSP crash dump with a backtrace to mtrace. (https://sof-ci.01.org/sofpr/PR7073/build3851/devicetest/index.html does not have it)

kv2019i · 2023-02-14T09:06:02Z

SOFCI TEST

kv2019i · 2023-02-14T10:36:34Z

@jsarha @makarukp This will get published later, but the CI now capture a full DSP panic. I'll copy an early peek here:

[   34.201618] <inf> ipc: rx	: 0xe050101|0x0
[   34.201640] <inf> chain_dma: comp:129 0x810000 chain_host_stop(): dma_stop() host chan_index = 1
[   34.201653] <inf> chain_dma: comp:129 0x810000 chain_link_stop(): dma_stop() link chan_index = 1
[   34.201761] <err> os:  ** FATAL EXCEPTION
[   34.201770] <err> os:  ** CPU 0 EXCCAUSE 13 (load/store PIF data error)
[   34.201775] <err> os:  **  PC 0xbe01cdc0 VADDR 0x1
[   34.201780] <err> os:  **  PS 0x60b20
[   34.201790] <err> os:  **    (INTLEVEL:0 EXCM: 0 UM:1 RING:0 WOE:1 OWB:11 CALLINC:2)
[   34.201798] <err> os:  **  A0 0xbe021ac8  SP 0xbe0854f0  A2 0xbe098c80  A3 0xbe09ab6c
[   34.201805] <err> os:  **  A4 (nil)  A5 0xbe09ab50  A6 0x8  A7 0x1
[   34.201811] <err> os:  **  A8 0xbe01cdc0  A9 0xbe0854d0 A10 0xbe098c80 A11 0xbe085500
[   34.201820] <err> os:  ** A12 0x60b20 A13 0x120 A14 0x200 A15 0x400
[   34.201826] <err> os:  ** LBEG 0xbe03fc86 LEND 0xbe03fc95 LCOUNT 0xbe040695
[   34.201831] <err> os:  ** SAR 0x20

Backtrace:0xbe01cdbd:0xbe0854f0 0xbe021ac5:0xbe085530 0xbe012d62:0xbe085570 

[   34.201868] <err> os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
[   34.201875] <err> os: Current thread: 0x9e0989f8 (unknown)
[   34.203343] <err> os: Halting systeTerminated

lgirdwood · 2023-02-14T15:58:59Z

@kv2019i nice to see the mtrace, but it looks like its missing %8.8x formatting ?

jsarha · 2023-02-17T21:45:22Z

I think I found the fix for: #7084
At least the multiple pause resume test now passes at least locally on my upx-i11.
I piggybacked the fix to this old chain dma re-enable PR to have proper CI results with chain dma enabled.

When schedule_task_free() is called to a scheduled zephyr_ll task it sometimes causes a crash. Canceling the task before freeing appears to fix the problem. Signed-off-by: Jyri Sarha <jyri.sarha@intel.com>

The known issues with the pipeline-free chain DMA implementation has now been solved, so we can re-enable chain DMA usage. This reverts commit c3807ae. Signed-off-by: Jyri Sarha <jyri.sarha@intel.com>

marc-hb · 2023-02-17T22:17:16Z

This will get published later, but the CI now capture a full DSP panic. I'll copy an early peek here:

It got published later on the public site but then @jsarha force-pushed a new version so we'll never find the complete results.

Always save links to test results before force-pushing.

jsarha · 2023-02-20T07:10:35Z

This will get published later, but the CI now capture a full DSP panic. I'll copy an early peek here:

It got published later on the public site but then @jsarha force-pushed a new version so we'll never find the complete results.

Always save links to test results before force-pushing.

Sorry about that. Just fixed a stupid typo in the commit message.

ujfalusi · 2023-02-20T08:20:57Z

src/audio/chain_dma.c

 	if (!ret)
 		ret = ret2;

+	schedule_task_cancel(&cd->chain_task);


@jsarha, 'appears to fix' implies that we don't know the real issue?
Or we must stop before freeing unconditionally as this is the correct way?

@lyakh, @kv2019i, can you clarify on this?

I had several locks and prints in places, when debugging this on Friday. Based on those prints it looked like the crash is caused by the task being executed after its been freed. I am pretty sure this a correct fix. Before the fix the multiple-paise-resume crashed in a matter of seconds, after it was started. To pass the test it needs run something like 30mins without a glitch.

@ujfalusi yes, I think it's needed

would something like this can be added instead for a generic fix?

diff --git a/src/include/sof/schedule/schedule.h b/src/include/sof/schedule/schedule.h index 5d069443ab07..136f3ffbc8ef 100644 --- a/src/include/sof/schedule/schedule.h +++ b/src/include/sof/schedule/schedule.h @@ -363,8 +363,15 @@ static inline int schedule_task_free(struct task *task) list_for_item(slist, &schedulers->list) { sch = container_of(slist, struct schedule_data, list); - if (task->type == sch->type) + if (task->type == sch->type) { + /* Make sure that the task is not scheduled, cancel it first */ + int ret = sch->ops->schedule_task_cancel(sch->data, task); + + if (ret) + return ret; + return sch->ops->schedule_task_free(sch->data, task); + } } return -ENODEV;

probably the ret check is not needed? I'm not sure what is the return if the task is not scheduled, my guess would be 0, but if not, then the ret check is not good.

That, or add clear documentation to schedule_task_free() that one should not call free to a task that may still be scheduled. E.g. keep the API symmetric schedule_task_init() and schedule_task_free(), vs. schedule_task() and schedule_task_cancel().

@kv2019i do you mean waiting on a semaphore in zephyr_ll_task_free()? Well, that's just one particular case. I think that so far we have what @jsarha described as a symmetric API. It could in principle be changed , I guess, but we'd need to verify all the cases - XTOS, Zephyr, LL, EDF,...

Now that I try to think how the task can be scheduled after being freed, I can not put my finger on it. But since I have seen it, I am pretty sure there can be some unfortunate serializations zephyr_ll_run() and zephyr_ll_task_free() where this could happen. But indeed it starts to look like there is a bug in zephyr_ll.c.

Do the zephyr_ll_lock() = zephyr arch_irq_lock() and zephyr_ll_unlock() = zephyr arch_irq_unlock(), do anything about the assignments actually being flushed to the memory? Should they? It looks to me that their implementation on xtensa reduce to a single instruction:

static ALWAYS_INLINE unsigned int arch_irq_lock(void) { unsigned int key; __asm__ volatile("rsil %0, %1" : "=r"(key) : "i"(XCHAL_EXCM_LEVEL) : "memory"); return key; } static ALWAYS_INLINE void arch_irq_unlock(unsigned int key) { __asm__ volatile("wsr.ps %0; rsync" :: "r"(key) : "memory"); }

Even if the threads run on the same core, the functions trust the values are immediately readable from another thread, before the function exits. How do we know the values are actually written to memory instead of them being still held in the registers, if there is no explicit memory barriers or even volatile qualifiers?

Not sure why I came to the conclusion, that schedule_task_cancel() has to be called before schedule_task_free() with Zephyr LL scheduler. Looking at zephyr_ll_task_free() now it should handle freeing tasks in any state, including running. Note, that as described there, it must only be called on the same core as where the task is running and only from a thread context. Is this all the case? Have we decoded the crash? @jsarha

The crash always happens at rsil instruction when taking the lock (= forbidding interrupts?) in zephyr_ll_run() after do_task_run() call. At which point - according to my debug prints - the task has already been freed. After thinking different serializations of the functions for some time, I can not think any other reason for that to happen than the "locking" not working as it should (e.g. not acting as explicit memory barrier). Maybe a more correct way to fix this particular problem would be not to init() and free() the task all the time, but just to schedule() and cancel() it. But it still appears that there is something fishy how the zephyr_ll works.

So the crash always happens here:

sof/src/schedule/zephyr_ll.c

Line 208 in 7fdc623

zephyr_ll_lock(sch, &flags);

lgirdwood · 2023-02-20T16:28:59Z

SOFCI TEST

lgirdwood · 2023-02-20T16:29:10Z

Double test to be sure.

lyakh

there's a bug in Zephyr LL scheduler and I tried to "quickly" fix it in #7138 and it didn't work. We can merge this for now and I'll debug this more and will fix the scheduler properly when I get some time. In fact this fix shouldn't be needed, but it's needed to compensate for that scheduler bug

marc-hb · 2023-02-21T19:58:06Z

SOFCI TEST, Double test to be sure.

The previous URL was not saved so we lost the previous test results again. The new 01.org lacks directory listing.

jsarha · 2023-02-21T21:57:36Z

I do not know why this version works, but it appears to be the only one that does. #7138 does much better job in theorizing what goes wrong with the original version, and fixes that, but unfortunately that version does not pass the test.

jsarha requested review from aborisovich and marc-hb as code owners February 10, 2023 12:19

ujfalusi approved these changes Feb 10, 2023

View reviewed changes

kv2019i approved these changes Feb 10, 2023

View reviewed changes

kv2019i requested review from lyakh and makarukp February 10, 2023 16:13

makarukp approved these changes Feb 10, 2023

View reviewed changes

jsarha mentioned this pull request Feb 14, 2023

[BUG] If ChainDMA is enabled FW crashes during multiple-pause-resume test #7084

Closed

This was referenced Feb 14, 2023

[BUG] DSP is not suspended after playback is stopped 15s on TGLU_UP_HDA #6976

Closed

[BUG] FW reported error: 3018 - Unknown for 0xe070000|0x0 (GLB_CHAIN_DMA) when pause-release on TGLU_UP_HDA #6975

Closed

jsarha force-pushed the chain-dma-re-enable branch from 74f6cc7 to cd64e68 Compare February 17, 2023 21:41

jsarha requested review from dbaluta, lbetlej, lgirdwood, mmaka1 and plbossart as code owners February 17, 2023 21:41

jsarha changed the title ~~Revert "topology2: avs-tplg: disable USE_CHAIN_DMA"~~ Fix chain dma task handling issue and re-enable it. Feb 17, 2023

jsarha added 2 commits February 18, 2023 00:03

hda: chain dma: cancel task before freeing it

5296bd0

When schedule_task_free() is called to a scheduled zephyr_ll task it sometimes causes a crash. Canceling the task before freeing appears to fix the problem. Signed-off-by: Jyri Sarha <jyri.sarha@intel.com>

Revert "topology2: avs-tplg: disable USE_CHAIN_DMA"

cf2115d

The known issues with the pipeline-free chain DMA implementation has now been solved, so we can re-enable chain DMA usage. This reverts commit c3807ae. Signed-off-by: Jyri Sarha <jyri.sarha@intel.com>

jsarha force-pushed the chain-dma-re-enable branch from cd64e68 to cf2115d Compare February 17, 2023 22:04

jsarha requested review from kv2019i, makarukp and ujfalusi February 20, 2023 08:08

ujfalusi approved these changes Feb 20, 2023

View reviewed changes

lyakh approved these changes Feb 21, 2023

View reviewed changes

lgirdwood merged commit d08b894 into thesofproject:main Feb 21, 2023

kv2019i mentioned this pull request Feb 22, 2023

[BUG] DSP crash with HDMI playback on sof-hda-generic tplg #7156

Closed

Fix chain dma task handling issue and re-enable it. #7073

Fix chain dma task handling issue and re-enable it. #7073

Uh oh!

Conversation

jsarha commented Feb 10, 2023

Uh oh!

ujfalusi left a comment

Choose a reason for hiding this comment

Uh oh!

kv2019i left a comment

Choose a reason for hiding this comment

Uh oh!

kv2019i commented Feb 10, 2023

Uh oh!

ranj063 commented Feb 10, 2023

Uh oh!

jsarha commented Feb 13, 2023

Uh oh!

kv2019i commented Feb 14, 2023

Uh oh!

kv2019i commented Feb 14, 2023

Uh oh!

kv2019i commented Feb 14, 2023

Uh oh!

lgirdwood commented Feb 14, 2023

Uh oh!

jsarha commented Feb 17, 2023

Uh oh!

marc-hb commented Feb 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jsarha commented Feb 20, 2023

Uh oh!

ujfalusi Feb 20, 2023

Choose a reason for hiding this comment

Uh oh!

jsarha Feb 20, 2023

Choose a reason for hiding this comment

Uh oh!

lyakh Feb 20, 2023

Choose a reason for hiding this comment

Uh oh!

ujfalusi Feb 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jsarha Feb 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lyakh Feb 20, 2023

Choose a reason for hiding this comment

Uh oh!

jsarha Feb 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jsarha Feb 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lyakh Feb 21, 2023

Choose a reason for hiding this comment

Uh oh!

jsarha Feb 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lgirdwood commented Feb 20, 2023

Uh oh!

lgirdwood commented Feb 20, 2023

Uh oh!

lyakh left a comment

Choose a reason for hiding this comment

Uh oh!

marc-hb commented Feb 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jsarha commented Feb 21, 2023

Uh oh!

marc-hb commented Feb 17, 2023 •

edited

Loading

ujfalusi Feb 20, 2023 •

edited

Loading

jsarha Feb 20, 2023 •

edited

Loading

jsarha Feb 20, 2023 •

edited

Loading

jsarha Feb 20, 2023 •

edited

Loading

jsarha Feb 21, 2023 •

edited

Loading

marc-hb commented Feb 21, 2023 •

edited

Loading