Skip to content

Conversation

@lyakh
Copy link
Collaborator

@lyakh lyakh commented May 14, 2019

Currently on all supported platforms the IPC IRQ thread first signals
the sender when an IPC response is received from the DSP, then
unmasks the IPC interrupt. Those actions are performed without
holding any locks, so the thread can be interrupted between them. IPC
timeouts have been observed in such scenarios: if the sender is woken
up and it proceeds with sending the next message without unmasking
the IPC interrupt, it can miss the next response. This patch takes a
spin-lock to prevent the IRQ thread from being preempted at that
point.

@ranj063
Copy link
Collaborator

ranj063 commented May 14, 2019

@mengdonglin @keqiaozhang can we please stress test this patch on CML/APL with the IPC flood test PR #852 and make sure we can not see the IPC timeouts?

@plbossart
Copy link
Member

@lyakh please rebase, hsw.c is no longer part of the code...

@lyakh
Copy link
Collaborator Author

lyakh commented May 14, 2019

fixes #876 and possibly fixes #940, maybe also some other IPC timeout issues with non-deterministic reproduction patterns.

@plbossart
Copy link
Member

I am having difficulties figuring out what the race condition is. The last part of the IRQ thread is about unmasking the interrupt, so how can the IRQ thread be pre-empted by another interrupt?

@plbossart
Copy link
Member

And I don't even understand why we need to have interrupt masks in the first place.

@juimonen
Copy link

does byt_get_reply still have also spinlocks? so spinlocks inside spinlocks?

@lyakh
Copy link
Collaborator Author

lyakh commented May 14, 2019

@lyakh please rebase, hsw.c is no longer part of the code...
done

@lyakh
Copy link
Collaborator Author

lyakh commented May 14, 2019

does byt_get_reply still have also spinlocks? so spinlocks inside spinlocks?

Oops, right, thanks for catching! bdw too, fixed now.

@lyakh
Copy link
Collaborator Author

lyakh commented May 14, 2019

And I don't even understand why we need to have interrupt masks in the first place.

That I don't know either, but changing that would be a different topic, it would be a rather significant change and would require intensive testing.

Copy link
Collaborator

@ranj063 ranj063 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I don't even understand why we need to have interrupt masks in the first place.

That I don't know either, but changing that would be a different topic, it would be a rather significant change and would require intensive testing.

@lyakh I think Pierre is right. We should remove the masking in the IRQ threads. If we are fixing IPC issues, I think it makes sense to do it right this time. This has been lingering for a while

@lyakh
Copy link
Collaborator Author

lyakh commented May 14, 2019

I am having difficulties figuring out what the race condition is. The last part of the IRQ thread is about unmasking the interrupt, so how can the IRQ thread be pre-empted by another interrupt?

IRQ thread is just a thread, it can be preempted just like any other thread. Not only by interrupts, but by simple context switches. That's why you can sleep in them and do other types of scheduling. Does that answer your question or would you like me to sketch a diagram to explain the race scenario?

@plbossart
Copy link
Member

I am having difficulties figuring out what the race condition is. The last part of the IRQ thread is about unmasking the interrupt, so how can the IRQ thread be pre-empted by another interrupt?

IRQ thread is just a thread, it can be preempted just like any other thread. Not only by interrupts, but by simple context switches. That's why you can sleep in them and do other types of scheduling. Does that answer your question or would you like me to sketch a diagram to explain the race scenario?

I don't have any issue with the pre-emption, I just don't get what other entity is competing with this thread or doing bad things while the thread is pre-empted.

@lyakh
Copy link
Collaborator Author

lyakh commented May 14, 2019

And I don't even understand why we need to have interrupt masks in the first place.

That I don't know either, but changing that would be a different topic, it would be a rather significant change and would require intensive testing.

@lyakh I think Pierre is right. We should remove the masking in the IRQ threads. If we are fixing IPC issues, I think it makes sense to do it right this time. This has been lingering for a while

@ranj063 @plbossart I'm not at all against removing that masking, as long as that doesn't break anything, but my understanding is, that at the moment we have to fix the "IPC timeout" bug quickly, reliably, with minimal risk. IMHO changing the masking strategy at that point would be more intrusive and risky than what we want to have. But maybe I'm wrong. If the common opinion is, that we do have the time for that change and all the follow up testing and validation, I can do that. I can tomorrow remove those masks and see if I can detect any breakages. Then, if I don't find anything, we can take that for a spin with CI. Should we do that or should we rather take this and continue fixing other IPC bugs?

@lyakh
Copy link
Collaborator Author

lyakh commented May 14, 2019

I am having difficulties figuring out what the race condition is. The last part of the IRQ thread is about unmasking the interrupt, so how can the IRQ thread be pre-empted by another interrupt?

IRQ thread is just a thread, it can be preempted just like any other thread. Not only by interrupts, but by simple context switches. That's why you can sleep in them and do other types of scheduling. Does that answer your question or would you like me to sketch a diagram to explain the race scenario?

I don't have any issue with the pre-emption, I just don't get what other entity is competing with this thread or doing bad things while the thread is pre-empted.

The sender thread, that got woken up by this IRQ thread. It wakes up and sends the next IPC. The DSP receives it and replies to it, but the host cannot receive that reply because the interrupt is still masked.

@wenqingfu wenqingfu requested review from RanderWang and keyonjie May 14, 2019 16:24
@wenqingfu
Copy link

And I don't even understand why we need to have interrupt masks in the first place.

That I don't know either, but changing that would be a different topic, it would be a rather significant change and would require intensive testing.

@lyakh I think Pierre is right. We should remove the masking in the IRQ threads. If we are fixing IPC issues, I think it makes sense to do it right this time. This has been lingering for a while

@ranj063 @plbossart I'm not at all against removing that masking, as long as that doesn't break anything, but my understanding is, that at the moment we have to fix the "IPC timeout" bug quickly, reliably, with minimal risk. IMHO changing the masking strategy at that point would be more intrusive and risky than what we want to have. But maybe I'm wrong. If the common opinion is, that we do have the time for that change and all the follow up testing and validation, I can do that. I can tomorrow remove those masks and see if I can detect any breakages. Then, if I don't find anything, we can take that for a spin with CI. Should we do that or should we rather take this and continue fixing other IPC bugs?

Isn't the masking part of the "hardware programming requirement"?

@lyakh
Copy link
Collaborator Author

lyakh commented May 14, 2019

Isn't the masking part of the "hardware programming requirement"?

I don't think there's such a universal requirement. It depends on many factors - whether, if yes - at which level and for how long to mask interrupts.

@plbossart
Copy link
Member

@lyakh sorry, now I get it. I read your commit message sideways and thought you added a new spin lock without seeing anyone use it. Here you are explicitly preventing a new IPC from being sent using the same spin lock as before. That looks fine, regardless of whether we need the interrupt masks this looks like a valid change.
Has anyone given it a quick spin to see if this improves anything?

@xiulipan
Copy link

@plbossart @lyakh
I just go through the patch and have thought about the race cases:

I think this PR will solve one case below:

IPC A IRQ comes to host
IPC A IRQ thread start
IPC B IRQ comes to host
IPC B IRQ thread is blocked, and IRQ B dropped
IPC A IRQ thread end
IPC B timeout due to replay IRQ dropped

So with the solution here is will become:

IPC A IRQ comes to host
IPC A IRQ thread start with spin_lock_irqsave
IPC B IRQ comes to host, saved
IPC IRQ A thread end
IPC B IRQ restored
IPC B IRQ thread start
..
Good to go.

But how about following situation:

IPC A IRQ comes to host
IPC A IRQ thread start with spin_lock_irqsave
IPC B IRQ comes to host, saved
IPC A IRQ thread end after 300ms(IPC timeout)
IPC B timeout due to IPC B IRQ not handled in time
IPC B IRQ restored
IPC B timeout already.

And this will also blocking the HDA IRQ handler/thread for the HDA streams. I am not sure if our thread handler is short enough that we can block all IRQ safely during the thread.

@plbossart
Copy link
Member

@xiulipan

IPC A IRQ comes to host
IPC A IRQ thread start
IPC B IRQ comes to host

What exactly are IPC A and B in your description? We have a single IPC doorbell register so what mechanism would the DSP use to signal an IPC B?

@keyonjie
Copy link

@lyakh there is recursive spin_lock in snd_sof_ipc_reply() with your change applied?

@keyonjie
Copy link

Basically, we should not call _ipc_dsp_done() before the whole processing to previous Host->DSP IPC is finished. That is, we should call _ipc_dsp_done() (to enable starting of the next host->DSP tx IPC) at the end of tx_wait_done().

In today's code, we are calling snd_sof_ipc_reply() to wake the tx_wait_done(), and then try to set done immediately, but

Thread A                                               Thread B                 Thread C
_ipc_irq_thread()
    snd_sof_ipc_reply() //wake up tx_wait_done()       tx_wait_done()
    _ipc_dsp_done()                                                            sof_ipc_tx_message //Tx next IPC

So here, what race I can see is the next host-DSP TX IPC in Thread C may happens before the _ipc_dsp_done() is finished in Thread A, This will make the IPC doorbell registers to be disordered, I previously add some check to done bits inside _ipc_send_msg() and see we did have cases that DONE bit is not cleared yet!

In old version IPC logic, we had is_dsp_ready() checks before sending a new IPC which can fix this kind of issue, now we might think about similar logic, or use spin_lock to make sure the tx_wait_done() will finished after _ipc_dsp_done(), and then we can make sure the Thread C to transfer next IPC will happens at DONE bit cleared status.

To make sure the tx_wait_done() will finished after _ipc_dsp_done(), We need the change what @lyakh do here, and plus this IMO:

/* wait for IPC message reply */
static int tx_wait_done(struct snd_sof_ipc *ipc, struct snd_sof_ipc_msg *msg,
			void *reply_data)
{
	struct snd_sof_dev *sdev = ipc->sdev;
	struct sof_ipc_cmd_hdr *hdr = msg->msg_data;
+	unsigned long flags;
	int ret;

	/* wait for DSP IPC completion */
	ret = wait_event_timeout(msg->waitq, msg->ipc_complete,
				 msecs_to_jiffies(IPC_TIMEOUT_MS));

+	spin_lock_irqsave(&sdev->ipc_lock, flags);

	if (ret == 0) {
		dev_err(sdev->dev, "error: ipc timed out for 0x%x size %d\n",
			hdr->cmd, hdr->size);
		snd_sof_dsp_dbg_dump(ipc->sdev, SOF_DBG_REGS | SOF_DBG_MBOX);
		snd_sof_ipc_dump(ipc->sdev);
		snd_sof_trace_notify_for_error(ipc->sdev);
		ret = -ETIMEDOUT;
	} else {
		/* copy the data returned from DSP */
		ret = msg->reply_error;
		if (msg->reply_size)
			memcpy(reply_data, msg->reply_data, msg->reply_size);
		if (ret < 0)
			dev_err(sdev->dev, "error: ipc error for 0x%x size %zu\n",
				hdr->cmd, msg->reply_size);
		else
			ipc_log_header(sdev->dev, "ipc tx succeeded", hdr->cmd);
	}
+	spin_unlock_irqrestore(&sdev->ipc_lock, flags);
	return ret;
}

@keyonjie
Copy link

@lyakh I think your PR can fix the risk I listed above, as the Thread C will be hold on spin_lock_irq() in sof_ipc_tx_message_unlocked() before it will do the real IPC sending via snd_sof_dsp_send_msg(), as you have hold the lock in Thread A until _ipc_dsp_done() is finished.

@RanderWang
Copy link

@xiulipan

IPC A IRQ comes to host
IPC A IRQ thread start
IPC B IRQ comes to host

What exactly are IPC A and B in your description? We have a single IPC doorbell register so what mechanism would the DSP use to signal an IPC B?

@plbossart @xiulipan
The sequence is totally wrong! To discuss IPC issue, please first understand both kernel and FW. In FW,
before each message is sent by FW, IPC registers will be checked to make sure host has finished last message, so all the message is serialized.

Please study Kenyon's comments carefully.

@lyakh
Copy link
Collaborator Author

lyakh commented May 15, 2019

@lyakh there is recursive spin_lock in snd_sof_ipc_reply() with your change applied?

No, the spin-lock is removed from snd_sof_ipc_reply() by this patch, please check again

@lyakh
Copy link
Collaborator Author

lyakh commented May 15, 2019

So here, what race I can see is the next host-DSP TX IPC in Thread C may happens before the _ipc_dsp_done() is finished in Thread A

@keyonjie no, I don't think this should be happening. Host IPC sending is now fully serialised by the ipc->tx_mutex

@keyonjie
Copy link

So here, what race I can see is the next host-DSP TX IPC in Thread C may happens before the _ipc_dsp_done() is finished in Thread A

@keyonjie no, I don't think this should be happening. Host IPC sending is now fully serialised by the ipc->tx_mutex

Have you checked it in my description about Thread A/B/C, I mean before applying your PR, we have race.

@lyakh
Copy link
Collaborator Author

lyakh commented May 15, 2019

So here, what race I can see is the next host-DSP TX IPC in Thread C may happens before the _ipc_dsp_done() is finished in Thread A

@keyonjie no, I don't think this should be happening. Host IPC sending is now fully serialised by the ipc->tx_mutex

Have you checked it in my description about Thread A/B/C, I mean before applying your PR, we have race.

@keyonjie yes, I don't think that scenario is possible, because thread B will be holding the mutex, so thread C will not be able to begin sending the next IPC.

@RanderWang
Copy link

RanderWang commented May 15, 2019

So here, what race I can see is the next host-DSP TX IPC in Thread C may happens before the _ipc_dsp_done() is finished in Thread A

@keyonjie no, I don't think this should be happening. Host IPC sending is now fully serialised by the ipc->tx_mutex

@lyakh It is very tricky and let me talk about my understanding. your idea is correct: Host IPC sending is now fully serialised by the ipc->tx_mutex. But the issue was IPC register setting.

Without this PR, The error sequence was:
(1)a thread got mutex and sent IPC message, than wait_for_done.
(2) host received IPC reply message and invoked snd_sof_ipc_reply in irq thread function
(3) the wait thread waked up then release mutex
(4) next thread sent IPC again, and it would set IPC register to send ipc message
(5) for step (2), irq thread would go to cnl_ipc_dsp_done and set IPC Done register.

here (4) should happen before (5) because we should firstly set IPC Done register then send IPC message. For original code, before we send IPC message, we checked IPC Done register, do you remember it ?

your mutex protect the IPC sending message, not cover the IPC register setting. Now cnl_ipc_dsp_done is protected by spin_lock, and step (4) would try to get the spin_lock and has to wait (5) finished to release the spin_lock, so this issue is resolved.

If you agree with me, please refine your commit message.

@RanderWang
Copy link

@lyakh you can review the comment of keyon and he mentioned Done bit was not correct for some IPC timeout cases.

Currently on all supported platforms the IPC IRQ thread first signals
the sender when an IPC response is received from the DSP, then
unmasks the IPC interrupt. Those actions are performed without
holding any locks, so the thread can be interrupted between them. IPC
timeouts have been observed in such scenarios: if the sender is woken
up and it proceeds with sending the next message without unmasking
the IPC interrupt, it can miss the next response. This patch takes a
spin-lock to prevent the IRQ thread from being preempted at that
point. It also makes sure, that the next IPC transmission by the host
cannot take place before the IRQ thread has finished updating all the
required IPC registers.

Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>
@lyakh
Copy link
Collaborator Author

lyakh commented May 15, 2019

If you agree with me, please refine your commit message.

@RanderWang better now?

@jajanusz
Copy link

@emilchudzik can u check ipc timeouts with this pr?

Copy link

@RanderWang RanderWang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@plbossart plbossart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with this change, but I am far from the expert here so would like @keyonjie and possibly @ranj063 to approve so that we have a record of consensus.
Also would be good to have tests before merging.

@xiulipan
Copy link

@plbossart @RanderWang
I think @keyonjie have made the more precious explanation than me. I ignored to mention the directions of the IPC.
The patch LGTM

@xiulipan
Copy link

@RanderWang Agreed that this race condition is once fixed by our old IPC status check. But after IPC refine, the check is removed. Either way, this will sync our IPC status with our IPC thread.

@emilchudzik
Copy link

emilchudzik commented May 15, 2019

@emilchudzik can u check ipc timeouts with this pr?

I see many approves, so maybe code looks good, but I strongly recommend to NOT MERGE.

Kernel based on 465c8e4 2019-05-15 07:31:34 .. Pierre-Louis Bossart ASoC: SOF: core: fix error handling with the probe workqueue
+this commit 29fbf66 ASoC: SOF: ipc: fix a race, leading to IPC timeouts

IPC timeout occurs after:

  • 8/10 reboots
  • reboot ok, but after S3 ->IPC timeout
  • reboot ok, but after aplay -> IPC timeout

So I mark repro: 10/10

@emilchudzik emilchudzik self-requested a review May 15, 2019 21:32
Copy link

@emilchudzik emilchudzik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check my comments.
From GK VAL - I have 100% (10/10) repro of IPC timeouts with this commit.

@xiulipan xiulipan mentioned this pull request May 18, 2019
@xiulipan
Copy link

@plbossart @lgirdwood @ranj063 @emilchudzik
I strongly suggest to merge this PR.
I test this with #852, I could not see any time out on my UP2 board with 2 million IPCs

[34032.136537] sof-audio-pci 0000:00:0e.0: IPC Flood test duration: 100000ms
[34032.136539] sof-audio-pci 0000:00:0e.0: IPC Flood test count: 2157064, Avg response time: 61745ns

But without this PR, I can easily reproduce this in every stress test with in 10000 IPCs

I will update test result on WHL and CFL later.

@xiulipan
Copy link

update the test result on WHL:

Always failed with this when trying resume from D3. need @ranj063 #895 and thesofproject/sof#1354

I suggest to merge this PR first to fix one type ipc issue first then to figure out what missed on WHL.

[   62.764537] sof-audio-pci 0000:00:1f.3: FW Poll Status: reg=0xf0f0f0f successful
[   62.764548] sof-audio-pci 0000:00:1f.3: FW Poll Status: reg=0xf0f0f0e successful
[   62.764554] sof-audio-pci 0000:00:1f.3: unstall/run core: core_mask = 1
[   62.764560] sof-audio-pci 0000:00:1f.3: DSP core(s) enabled? 1 : core_mask 1
[   62.765461] sof-audio-pci 0000:00:1f.3: FW Poll Status: reg=0x80000000 successful
[   62.765469] sof-audio-pci 0000:00:1f.3: FW Poll Status: reg=0x1010e0e successful
[   62.765477] sof-audio-pci 0000:00:1f.3: FW Poll Status: reg=0x6000021 successful
[   62.765581] sof-audio-pci 0000:00:1f.3: error: no reply expected, received 0x0
[   65.765954] sof-audio-pci 0000:00:1f.3: FW Poll Status: reg=0x5000001 timedout
[   65.765965] sof-audio-pci 0000:00:1f.3: error: load fw failed ret: -110
[   65.765998] sof-audio-pci 0000:00:1f.3: unknown ROM status value 05000001
[   65.766005] sof-audio-pci 0000:00:1f.3: error: status = 0x00000000 panic = 0x00000000
[   65.766012] sof-audio-pci 0000:00:1f.3: unknown ROM status value 05000001
[   65.766019] sof-audio-pci 0000:00:1f.3: error: failed to reset DSP
[   65.766023] sof-audio-pci 0000:00:1f.3: error: failed to boot DSP firmware after resume -110
[   65.766031] sof-audio-pci 0000:00:1f.3: error: debugfs write failed to resume -110

On APL 15 mins flooding passed

[35453.006427] sof-audio-pci 0000:00:0e.0: IPC Flood test duration: 900000ms
[35453.006429] sof-audio-pci 0000:00:0e.0: IPC Flood test count: 19702935, Avg response time: 50521ns
[35453.006431] sof-audio-pci 0000:00:0e.0: IPC Flood test Max response time: 17427482ns
[35453.006432] sof-audio-pci 0000:00:0e.0: IPC Flood test Min response time: 29665ns

On CFL 15mins flooding passed

[ 1113.485978] sof-audio-pci 0000:00:1f.3: IPC Flood test duration: 900000ms
[ 1113.485979] sof-audio-pci 0000:00:1f.3: IPC Flood test count: 14939385, Avg response time: 66822ns
[ 1113.485979] sof-audio-pci 0000:00:1f.3: IPC Flood test Max response time: 1043658ns
[ 1113.485980] sof-audio-pci 0000:00:1f.3: IPC Flood test Min response time: 28362ns

@xiulipan
Copy link

update for WHL, enable pulseaudio or play something with HDA pcm will cause the boot time out issue. Disable the pulseaudio for flood ipc testing with this patch.

WHL can also pass 15 mins flood test.

@plbossart
Copy link
Member

@xiulipan this is good information, thank you. It's the first time we have a clear result that this patch improves things. Will merge now, let's debug WHL separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants