fix the deadlock issue when mailbox trace is configured. #4700

keyonjie · 2021-08-31T06:29:30Z

trace: don't hold trace->lock during dma_trace_on/off

As there is trace calling in the dma_trace_on/off() internal, we should
not do that with trace->lock held, to avoid deadlock.

lgirdwood

Good find @keyonjie - I agree we have a race between the trace work being scheduled and the the trace_off() call. I've had a quick check and Its made even more complex due to

We have a trace->enabled flag and a dma_trace_data->enabled flag, we should only have a one flag.
It does not look like we are flushing when we turn trace off (so missing the last data).
The trace_work() is not checking trace->enabled flag at entry. This would be a big simplification against scheduling races. i.e.

if (!trace->enabled)
    return SOF_TASK_STATE_CANCEL;

If we fix 1,2 & 3 then we do not need IRQ off locking around the trace_on() and trace_off() calls.

plbossart · 2021-08-31T13:15:57Z

src/trace/dma-trace.c

+
 	schedule_task(&trace_data->dmat_work, DMA_TRACE_PERIOD,
 		      DMA_TRACE_PERIOD);
+	trace_data->enabled = 1;


you're setting this flag protected by the spinlock, but you test it on like 480 without any protection.
Is this intentional?

plbossart · 2021-08-31T13:19:41Z

src/trace/trace.c

-	spin_lock_irq(&trace->lock, flags);
-
-	trace->enable = 1;
+	/* should not do this with trace->lock held as there is trace calling internal */


I was not able to parse the sentence above. 'there is trace calling internal' -> missing a complement. Or that was 'trace calling internally', which doesn't make more sense.

plbossart · 2021-08-31T13:20:06Z

src/trace/trace.c

 	dma_trace_on();

+	spin_lock_irq(&trace->lock, flags);
+	trace->enable = 1;


so the trace is enabled after the 'dma trace'. that seems surprising?

plbossart · 2021-08-31T13:20:25Z

src/trace/trace.c

-	spin_lock_irq(&trace->lock, flags);
-
-	trace->enable = 0;
+	/* should not do this with trace->lock held as there is trace calling internal */


same, this comment needs to be reworded.

lyakh · 2021-09-01T09:50:11Z

src/trace/dma-trace.c

 		return;

-	trace_data->enabled = 1;
+	spin_lock_irq(&trace_data->lock, flags);


do we have a good understanding of what this lock is protecting, vs. what the other trace lock is protecting? The use of this lock doesn't seem to be fully consistent. E.g. dtrace_event() this lock is also covering the test of .copy_in_progress, but not in a consistent way, because both locations where .copy_in_progress is set to 1 aren't protected by that lock, so, holding the lock while testing it doesn't help. Why do we have to lock here at all?

keyonjie · 2021-09-01T09:58:55Z

Good find @keyonjie - I agree we have a race between the trace work being scheduled and the the trace_off() call. I've had a quick check and Its made even more complex due to

We have a trace->enabled flag and a dma_trace_data->enabled flag, we should only have a one flag.

It does not look like we are flushing when we turn trace off (so missing the last data).

The trace_work() is not checking trace->enabled flag at entry. This would be a big simplification against scheduling races. i.e.
if (!trace->enabled)
    return SOF_TASK_STATE_CANCEL;
If we fix 1,2 & 3 then we do not need IRQ off locking around the trace_on() and trace_off() calls.

we could have different use cases, e.g. disable dma trace while mailbox trace is enabled, that's why we need 2 'enabled' flags correspondingly.

The race here is not on the flushing point, the schedule_task() calling in the dma_trace_on() will try to logging out something with trace_info(), which will require to hold the trace->log again.

	schedule_task(&trace_data->dmat_work, DMA_TRACE_PERIOD,
		      DMA_TRACE_PERIOD);

#if CONFIG_TRACEM
	/* send event by mail box too. */
	if (send_atomic) {
		mtrace_event((const char *)data, MESSAGE_SIZE(arg_count));
	} else {
		spin_lock_irq(&trace->lock, flags); //deadlock happens here
		mtrace_event((const char *)data, MESSAGE_SIZE(arg_count));
		spin_unlock_irq(&trace->lock, flags);
	}
#else

lgirdwood · 2021-09-01T11:26:36Z

Good find @keyonjie - I agree we have a race between the trace work being scheduled and the the trace_off() call. I've had a quick check and Its made even more complex due to

We have a trace->enabled flag and a dma_trace_data->enabled flag, we should only have a one flag.

It does not look like we are flushing when we turn trace off (so missing the last data).

The trace_work() is not checking trace->enabled flag at entry. This would be a big simplification against scheduling races. i.e.
if (!trace->enabled)
    return SOF_TASK_STATE_CANCEL;
If we fix 1,2 & 3 then we do not need IRQ off locking around the trace_on() and trace_off() calls.
we could have different use cases, e.g. disable dma trace while mailbox trace is enabled, that's why we need 2 'enabled' flags correspondingly.

ok, so we dont need 1 - but 2 & 3 are needed.

The race here is not on the flushing point, the schedule_task() calling in the dma_trace_on() will try to logging out something with trace_info(), which will require to hold the trace->log again.

Doing 3 above means no locking for on/off calls..

	schedule_task(&trace_data->dmat_work, DMA_TRACE_PERIOD,
		      DMA_TRACE_PERIOD);

#if CONFIG_TRACEM
	/* send event by mail box too. */
	if (send_atomic) {
		mtrace_event((const char *)data, MESSAGE_SIZE(arg_count));
	} else {
		spin_lock_irq(&trace->lock, flags); //deadlock happens here
		mtrace_event((const char *)data, MESSAGE_SIZE(arg_count));
		spin_unlock_irq(&trace->lock, flags);
	}
#else

Oh, this code looks wrong. We are entering atomic context (no IRQs) if (!send_atomic) ??

keyonjie · 2021-09-14T03:08:49Z

Oh, this code looks wrong. We are entering atomic context (no IRQs) if (!send_atomic) ??

So looks we are doing things oppositely here?

Hold the dma-trace lock when performing on/off switching, to make sure the status is consistent. Signed-off-by: Keyon Jie <yang.jie@linux.intel.com>

As there is trace calling in the dma_trace_on/off() internal, we should not do that with trace->lock held, to avoid deadlock. Signed-off-by: Keyon Jie <yang.jie@linux.intel.com>

marc-hb

I'm sure there are plenty valid points in this discussion but let's please first do a couple small reverts and get back to the "last known good state", fixing both #4676 and #4699 with a single line. The two recent commits that I'm reverting in #4760 started a "chain reaction" that may end up in completely refactoring the existing trace and I believe we don't want that. See longer commit message in #4760.

Oh, this code looks wrong. We are entering atomic context (no IRQs) if (!send_atomic) ??

send_atomic is very confusing, see revert of very old confusion in #4246

keyonjie · 2021-09-16T08:46:51Z

close this as @marc-hb has gone with #4760

keyonjie requested a review from akloniex as a code owner August 31, 2021 06:29

keyonjie requested review from marc-hb and ranj063 August 31, 2021 06:29

keyonjie mentioned this pull request Aug 31, 2021

[BUG] FW boot failure with SOF main when trace mailbox CONFIG_TRACEM is enabled #4699

Closed

paulstelian97 approved these changes Aug 31, 2021

View reviewed changes

lgirdwood reviewed Aug 31, 2021

View reviewed changes

plbossart requested changes Aug 31, 2021

View reviewed changes

lyakh reviewed Sep 1, 2021

View reviewed changes

keyonjie added 2 commits September 14, 2021 11:59

dma-trace: hold the lock during switching on/off the trace

eca9f70

Hold the dma-trace lock when performing on/off switching, to make sure the status is consistent. Signed-off-by: Keyon Jie <yang.jie@linux.intel.com>

trace: don't hold trace->lock during dma_trace_on/off

d639e5e

As there is trace calling in the dma_trace_on/off() internal, we should not do that with trace->lock held, to avoid deadlock. Signed-off-by: Keyon Jie <yang.jie@linux.intel.com>

keyonjie force-pushed the ipc-fix branch from 47679c4 to d639e5e Compare September 14, 2021 04:00

marc-hb requested changes Sep 14, 2021

View reviewed changes

keyonjie closed this Sep 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix the deadlock issue when mailbox trace is configured. #4700

fix the deadlock issue when mailbox trace is configured. #4700

Uh oh!

keyonjie commented Aug 31, 2021

Uh oh!

lgirdwood left a comment

Uh oh!

plbossart Aug 31, 2021

Uh oh!

plbossart Aug 31, 2021

Uh oh!

plbossart Aug 31, 2021

Uh oh!

plbossart Aug 31, 2021

Uh oh!

lyakh Sep 1, 2021

Uh oh!

keyonjie commented Sep 1, 2021

Uh oh!

lgirdwood commented Sep 1, 2021

Uh oh!

keyonjie commented Sep 14, 2021

Uh oh!

marc-hb left a comment

Uh oh!

keyonjie commented Sep 16, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

fix the deadlock issue when mailbox trace is configured. #4700

fix the deadlock issue when mailbox trace is configured. #4700

Uh oh!

Conversation

keyonjie commented Aug 31, 2021

Uh oh!

lgirdwood left a comment

Choose a reason for hiding this comment

Uh oh!

plbossart Aug 31, 2021

Choose a reason for hiding this comment

Uh oh!

plbossart Aug 31, 2021

Choose a reason for hiding this comment

Uh oh!

plbossart Aug 31, 2021

Choose a reason for hiding this comment

Uh oh!

plbossart Aug 31, 2021

Choose a reason for hiding this comment

Uh oh!

lyakh Sep 1, 2021

Choose a reason for hiding this comment

Uh oh!

keyonjie commented Sep 1, 2021

Uh oh!

lgirdwood commented Sep 1, 2021

Uh oh!

keyonjie commented Sep 14, 2021

Uh oh!

marc-hb left a comment

Choose a reason for hiding this comment

Uh oh!

keyonjie commented Sep 16, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants