-
Notifications
You must be signed in to change notification settings - Fork 140
ASoC: SOF: ipc: fix a race, leading to IPC timeouts #943
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@mengdonglin @keqiaozhang can we please stress test this patch on CML/APL with the IPC flood test PR #852 and make sure we can not see the IPC timeouts? |
|
@lyakh please rebase, hsw.c is no longer part of the code... |
|
I am having difficulties figuring out what the race condition is. The last part of the IRQ thread is about unmasking the interrupt, so how can the IRQ thread be pre-empted by another interrupt? |
|
And I don't even understand why we need to have interrupt masks in the first place. |
|
does byt_get_reply still have also spinlocks? so spinlocks inside spinlocks? |
|
Oops, right, thanks for catching! bdw too, fixed now. |
That I don't know either, but changing that would be a different topic, it would be a rather significant change and would require intensive testing. |
ranj063
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I don't even understand why we need to have interrupt masks in the first place.
That I don't know either, but changing that would be a different topic, it would be a rather significant change and would require intensive testing.
@lyakh I think Pierre is right. We should remove the masking in the IRQ threads. If we are fixing IPC issues, I think it makes sense to do it right this time. This has been lingering for a while
IRQ thread is just a thread, it can be preempted just like any other thread. Not only by interrupts, but by simple context switches. That's why you can sleep in them and do other types of scheduling. Does that answer your question or would you like me to sketch a diagram to explain the race scenario? |
I don't have any issue with the pre-emption, I just don't get what other entity is competing with this thread or doing bad things while the thread is pre-empted. |
@ranj063 @plbossart I'm not at all against removing that masking, as long as that doesn't break anything, but my understanding is, that at the moment we have to fix the "IPC timeout" bug quickly, reliably, with minimal risk. IMHO changing the masking strategy at that point would be more intrusive and risky than what we want to have. But maybe I'm wrong. If the common opinion is, that we do have the time for that change and all the follow up testing and validation, I can do that. I can tomorrow remove those masks and see if I can detect any breakages. Then, if I don't find anything, we can take that for a spin with CI. Should we do that or should we rather take this and continue fixing other IPC bugs? |
The sender thread, that got woken up by this IRQ thread. It wakes up and sends the next IPC. The DSP receives it and replies to it, but the host cannot receive that reply because the interrupt is still masked. |
Isn't the masking part of the "hardware programming requirement"? |
I don't think there's such a universal requirement. It depends on many factors - whether, if yes - at which level and for how long to mask interrupts. |
|
@lyakh sorry, now I get it. I read your commit message sideways and thought you added a new spin lock without seeing anyone use it. Here you are explicitly preventing a new IPC from being sent using the same spin lock as before. That looks fine, regardless of whether we need the interrupt masks this looks like a valid change. |
|
@plbossart @lyakh I think this PR will solve one case below: So with the solution here is will become: But how about following situation: And this will also blocking the HDA IRQ handler/thread for the HDA streams. I am not sure if our thread handler is short enough that we can block all IRQ safely during the thread. |
What exactly are IPC A and B in your description? We have a single IPC doorbell register so what mechanism would the DSP use to signal an IPC B? |
|
@lyakh there is recursive spin_lock in snd_sof_ipc_reply() with your change applied? |
|
Basically, we should not call _ipc_dsp_done() before the whole processing to previous Host->DSP IPC is finished. That is, we should call _ipc_dsp_done() (to enable starting of the next host->DSP tx IPC) at the end of tx_wait_done(). In today's code, we are calling snd_sof_ipc_reply() to wake the tx_wait_done(), and then try to set done immediately, but So here, what race I can see is the next host-DSP TX IPC in Thread C may happens before the _ipc_dsp_done() is finished in Thread A, This will make the IPC doorbell registers to be disordered, I previously add some check to done bits inside _ipc_send_msg() and see we did have cases that DONE bit is not cleared yet! In old version IPC logic, we had is_dsp_ready() checks before sending a new IPC which can fix this kind of issue, now we might think about similar logic, or use spin_lock to make sure the tx_wait_done() will finished after _ipc_dsp_done(), and then we can make sure the Thread C to transfer next IPC will happens at DONE bit cleared status. To make sure the tx_wait_done() will finished after _ipc_dsp_done(), We need the change what @lyakh do here, and plus this IMO: |
|
@lyakh I think your PR can fix the risk I listed above, as the Thread C will be hold on spin_lock_irq() in sof_ipc_tx_message_unlocked() before it will do the real IPC sending via snd_sof_dsp_send_msg(), as you have hold the lock in Thread A until _ipc_dsp_done() is finished. |
@plbossart @xiulipan Please study Kenyon's comments carefully. |
No, the spin-lock is removed from snd_sof_ipc_reply() by this patch, please check again |
@keyonjie no, I don't think this should be happening. Host IPC sending is now fully serialised by the ipc->tx_mutex |
Have you checked it in my description about Thread A/B/C, I mean before applying your PR, we have race. |
@keyonjie yes, I don't think that scenario is possible, because thread B will be holding the mutex, so thread C will not be able to begin sending the next IPC. |
@lyakh It is very tricky and let me talk about my understanding. your idea is correct: Host IPC sending is now fully serialised by the ipc->tx_mutex. But the issue was IPC register setting. Without this PR, The error sequence was: here (4) should happen before (5) because we should firstly set IPC Done register then send IPC message. For original code, before we send IPC message, we checked IPC Done register, do you remember it ? your mutex protect the IPC sending message, not cover the IPC register setting. Now cnl_ipc_dsp_done is protected by spin_lock, and step (4) would try to get the spin_lock and has to wait (5) finished to release the spin_lock, so this issue is resolved. If you agree with me, please refine your commit message. |
|
@lyakh you can review the comment of keyon and he mentioned Done bit was not correct for some IPC timeout cases. |
Currently on all supported platforms the IPC IRQ thread first signals the sender when an IPC response is received from the DSP, then unmasks the IPC interrupt. Those actions are performed without holding any locks, so the thread can be interrupted between them. IPC timeouts have been observed in such scenarios: if the sender is woken up and it proceeds with sending the next message without unmasking the IPC interrupt, it can miss the next response. This patch takes a spin-lock to prevent the IRQ thread from being preempted at that point. It also makes sure, that the next IPC transmission by the host cannot take place before the IRQ thread has finished updating all the required IPC registers. Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>
@RanderWang better now? |
|
@emilchudzik can u check ipc timeouts with this pr? |
RanderWang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
plbossart
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
@plbossart @RanderWang |
|
@RanderWang Agreed that this race condition is once fixed by our old IPC status check. But after IPC refine, the check is removed. Either way, this will sync our IPC status with our IPC thread. |
I see many approves, so maybe code looks good, but I strongly recommend to NOT MERGE. Kernel based on 465c8e4 2019-05-15 07:31:34 .. Pierre-Louis Bossart ASoC: SOF: core: fix error handling with the probe workqueue IPC timeout occurs after:
So I mark repro: 10/10 |
emilchudzik
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check my comments.
From GK VAL - I have 100% (10/10) repro of IPC timeouts with this commit.
|
@plbossart @lgirdwood @ranj063 @emilchudzik But without this PR, I can easily reproduce this in every stress test with in 10000 IPCs I will update test result on WHL and CFL later. |
|
update the test result on WHL: Always failed with this when trying resume from D3. need @ranj063 #895 and thesofproject/sof#1354 I suggest to merge this PR first to fix one type ipc issue first then to figure out what missed on WHL. On APL 15 mins flooding passed On CFL 15mins flooding passed |
|
update for WHL, enable pulseaudio or play something with HDA pcm will cause the boot time out issue. Disable the pulseaudio for flood ipc testing with this patch. WHL can also pass 15 mins flood test. |
|
@xiulipan this is good information, thank you. It's the first time we have a clear result that this patch improves things. Will merge now, let's debug WHL separately. |
Currently on all supported platforms the IPC IRQ thread first signals
the sender when an IPC response is received from the DSP, then
unmasks the IPC interrupt. Those actions are performed without
holding any locks, so the thread can be interrupted between them. IPC
timeouts have been observed in such scenarios: if the sender is woken
up and it proceeds with sending the next message without unmasking
the IPC interrupt, it can miss the next response. This patch takes a
spin-lock to prevent the IRQ thread from being preempted at that
point.