Skip to content

Conversation

@jsarha
Copy link
Contributor

@jsarha jsarha commented Feb 10, 2023

The known issues with the pipeline-free chain DMA implementation has now been solved, so we can re-enable chain DMA usage.

This reverts commit c3807ae.

Signed-off-by: Jyri Sarha jyri.sarha@intel.com

Copy link
Contributor

@ujfalusi ujfalusi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsarha, @makarukp, thank you, tested and works on my upx-i11 board!

Copy link
Collaborator

@kv2019i kv2019i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's wait for CI results.

@kv2019i kv2019i requested review from lyakh and makarukp February 10, 2023 16:13
@kv2019i
Copy link
Collaborator

kv2019i commented Feb 10, 2023

Something fishy:
https://sof-ci.01.org/sofpr/PR7073/build3851/devicetest/index.html?model=TGLU_UP_HDA_IPC4ZPH&testcase=multiple-pause-resume-5

[  577.894633] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx      : 0xe050101|0x0: GLB_CHAIN_DMA
[  578.400201] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc timed out for 0xe050101|0x0
[  578.400327] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: preventing DSP entering D3 state to preserve context

@ranj063
Copy link
Collaborator

ranj063 commented Feb 10, 2023

@jsarha TGL HDA has quite a few failures. Can you please check?

@jsarha
Copy link
Contributor Author

jsarha commented Feb 13, 2023

@jsarha TGL HDA has quite a few failures. Can you please check?

Yes, indeed the multiple suspend resume now fails in FW crash. I can reproduce the problem on my upx-i11, but there is not much information to go on to close up on the problem. However, its quite clear that we should not merge this PR yet.

@kv2019i
Copy link
Collaborator

kv2019i commented Feb 14, 2023

Let's test this again in CI (I'll trigger this) @jsarha @makarukp The latest run did not have the Linux kernel fix to get last DSP logs to mtrace. This looks like a FW crash and if that's the case, we should get a clean DSP crash dump with a backtrace to mtrace. (https://sof-ci.01.org/sofpr/PR7073/build3851/devicetest/index.html does not have it)

@kv2019i
Copy link
Collaborator

kv2019i commented Feb 14, 2023

SOFCI TEST

@kv2019i
Copy link
Collaborator

kv2019i commented Feb 14, 2023

@jsarha @makarukp This will get published later, but the CI now capture a full DSP panic. I'll copy an early peek here:

[   34.201618] <inf> ipc: rx	: 0xe050101|0x0
[   34.201640] <inf> chain_dma: comp:129 0x810000 chain_host_stop(): dma_stop() host chan_index = 1
[   34.201653] <inf> chain_dma: comp:129 0x810000 chain_link_stop(): dma_stop() link chan_index = 1
[   34.201761] <err> os:  ** FATAL EXCEPTION
[   34.201770] <err> os:  ** CPU 0 EXCCAUSE 13 (load/store PIF data error)
[   34.201775] <err> os:  **  PC 0xbe01cdc0 VADDR 0x1
[   34.201780] <err> os:  **  PS 0x60b20
[   34.201790] <err> os:  **    (INTLEVEL:0 EXCM: 0 UM:1 RING:0 WOE:1 OWB:11 CALLINC:2)
[   34.201798] <err> os:  **  A0 0xbe021ac8  SP 0xbe0854f0  A2 0xbe098c80  A3 0xbe09ab6c
[   34.201805] <err> os:  **  A4 (nil)  A5 0xbe09ab50  A6 0x8  A7 0x1
[   34.201811] <err> os:  **  A8 0xbe01cdc0  A9 0xbe0854d0 A10 0xbe098c80 A11 0xbe085500
[   34.201820] <err> os:  ** A12 0x60b20 A13 0x120 A14 0x200 A15 0x400
[   34.201826] <err> os:  ** LBEG 0xbe03fc86 LEND 0xbe03fc95 LCOUNT 0xbe040695
[   34.201831] <err> os:  ** SAR 0x20

Backtrace:0xbe01cdbd:0xbe0854f0 0xbe021ac5:0xbe085530 0xbe012d62:0xbe085570 

[   34.201868] <err> os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
[   34.201875] <err> os: Current thread: 0x9e0989f8 (unknown)
[   34.203343] <err> os: Halting systeTerminated

@lgirdwood
Copy link
Member

@kv2019i nice to see the mtrace, but it looks like its missing %8.8x formatting ?

@jsarha jsarha force-pushed the chain-dma-re-enable branch from 74f6cc7 to cd64e68 Compare February 17, 2023 21:41
@jsarha jsarha changed the title Revert "topology2: avs-tplg: disable USE_CHAIN_DMA" Fix chain dma task handling issue and re-enable it. Feb 17, 2023
@jsarha
Copy link
Contributor Author

jsarha commented Feb 17, 2023

I think I found the fix for: #7084
At least the multiple pause resume test now passes at least locally on my upx-i11.
I piggybacked the fix to this old chain dma re-enable PR to have proper CI results with chain dma enabled.

When schedule_task_free() is called to a scheduled zephyr_ll task it
sometimes causes a crash. Canceling the task before freeing appears to
fix the problem.

Signed-off-by: Jyri Sarha <jyri.sarha@intel.com>
The known issues with the pipeline-free chain DMA implementation has
now been solved, so we can re-enable chain DMA usage.

This reverts commit c3807ae.

Signed-off-by: Jyri Sarha <jyri.sarha@intel.com>
@jsarha jsarha force-pushed the chain-dma-re-enable branch from cd64e68 to cf2115d Compare February 17, 2023 22:04
@marc-hb
Copy link
Collaborator

marc-hb commented Feb 17, 2023

This will get published later, but the CI now capture a full DSP panic. I'll copy an early peek here:

It got published later on the public site but then @jsarha force-pushed a new version so we'll never find the complete results.

Always save links to test results before force-pushing.

@jsarha
Copy link
Contributor Author

jsarha commented Feb 20, 2023

This will get published later, but the CI now capture a full DSP panic. I'll copy an early peek here:

It got published later on the public site but then @jsarha force-pushed a new version so we'll never find the complete results.

Always save links to test results before force-pushing.

Sorry about that. Just fixed a stupid typo in the commit message.

if (!ret)
ret = ret2;

schedule_task_cancel(&cd->chain_task);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsarha, 'appears to fix' implies that we don't know the real issue?
Or we must stop before freeing unconditionally as this is the correct way?

@lyakh, @kv2019i, can you clarify on this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had several locks and prints in places, when debugging this on Friday. Based on those prints it looked like the crash is caused by the task being executed after its been freed. I am pretty sure this a correct fix. Before the fix the multiple-paise-resume crashed in a matter of seconds, after it was started. To pass the test it needs run something like 30mins without a glitch.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ujfalusi yes, I think it's needed

Copy link
Contributor

@ujfalusi ujfalusi Feb 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would something like this can be added instead for a generic fix?

diff --git a/src/include/sof/schedule/schedule.h b/src/include/sof/schedule/schedule.h
index 5d069443ab07..136f3ffbc8ef 100644
--- a/src/include/sof/schedule/schedule.h
+++ b/src/include/sof/schedule/schedule.h
@@ -363,8 +363,15 @@ static inline int schedule_task_free(struct task *task)
 
 	list_for_item(slist, &schedulers->list) {
 		sch = container_of(slist, struct schedule_data, list);
-		if (task->type == sch->type)
+		if (task->type == sch->type) {
+			/* Make sure that the task is not scheduled, cancel it first */
+			int ret = sch->ops->schedule_task_cancel(sch->data, task);
+
+			if (ret)
+				return ret;
+
 			return sch->ops->schedule_task_free(sch->data, task);
+		}
 	}
 
 	return -ENODEV;

probably the ret check is not needed? I'm not sure what is the return if the task is not scheduled, my guess would be 0, but if not, then the ret check is not good.

Copy link
Contributor Author

@jsarha jsarha Feb 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That, or add clear documentation to schedule_task_free() that one should not call free to a task that may still be scheduled. E.g. keep the API symmetric schedule_task_init() and schedule_task_free(), vs. schedule_task() and schedule_task_cancel().

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kv2019i do you mean waiting on a semaphore in zephyr_ll_task_free()? Well, that's just one particular case. I think that so far we have what @jsarha described as a symmetric API. It could in principle be changed , I guess, but we'd need to verify all the cases - XTOS, Zephyr, LL, EDF,...

Copy link
Contributor Author

@jsarha jsarha Feb 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I try to think how the task can be scheduled after being freed, I can not put my finger on it. But since I have seen it, I am pretty sure there can be some unfortunate serializations zephyr_ll_run() and zephyr_ll_task_free() where this could happen. But indeed it starts to look like there is a bug in zephyr_ll.c.

Copy link
Contributor Author

@jsarha jsarha Feb 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the zephyr_ll_lock() = zephyr arch_irq_lock() and zephyr_ll_unlock() = zephyr arch_irq_unlock(), do anything about the assignments actually being flushed to the memory? Should they? It looks to me that their implementation on xtensa reduce to a single instruction:

static ALWAYS_INLINE unsigned int arch_irq_lock(void)
{
	unsigned int key;

	__asm__ volatile("rsil %0, %1"
			 : "=r"(key) : "i"(XCHAL_EXCM_LEVEL) : "memory");
	return key;
}

static ALWAYS_INLINE void arch_irq_unlock(unsigned int key)
{
	__asm__ volatile("wsr.ps %0; rsync"
			 :: "r"(key) : "memory");
}

Even if the threads run on the same core, the functions trust the values are immediately readable from another thread, before the function exits. How do we know the values are actually written to memory instead of them being still held in the registers, if there is no explicit memory barriers or even volatile qualifiers?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why I came to the conclusion, that schedule_task_cancel() has to be called before schedule_task_free() with Zephyr LL scheduler. Looking at zephyr_ll_task_free() now it should handle freeing tasks in any state, including running. Note, that as described there, it must only be called on the same core as where the task is running and only from a thread context. Is this all the case? Have we decoded the crash? @jsarha

Copy link
Contributor Author

@jsarha jsarha Feb 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The crash always happens at rsil instruction when taking the lock (= forbidding interrupts?) in zephyr_ll_run() after do_task_run() call. At which point - according to my debug prints - the task has already been freed. After thinking different serializations of the functions for some time, I can not think any other reason for that to happen than the "locking" not working as it should (e.g. not acting as explicit memory barrier). Maybe a more correct way to fix this particular problem would be not to init() and free() the task all the time, but just to schedule() and cancel() it. But it still appears that there is something fishy how the zephyr_ll works.

So the crash always happens here:

zephyr_ll_lock(sch, &flags);

@lgirdwood
Copy link
Member

SOFCI TEST

@lgirdwood
Copy link
Member

Double test to be sure.

Copy link
Collaborator

@lyakh lyakh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a bug in Zephyr LL scheduler and I tried to "quickly" fix it in #7138 and it didn't work. We can merge this for now and I'll debug this more and will fix the scheduler properly when I get some time. In fact this fix shouldn't be needed, but it's needed to compensate for that scheduler bug

@marc-hb
Copy link
Collaborator

marc-hb commented Feb 21, 2023

SOFCI TEST, Double test to be sure.

The previous URL was not saved so we lost the previous test results again. The new 01.org lacks directory listing.

@jsarha
Copy link
Contributor Author

jsarha commented Feb 21, 2023

I do not know why this version works, but it appears to be the only one that does. #7138 does much better job in theorizing what goes wrong with the original version, and fixes that, but unfortunately that version does not pass the test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants