Skip to content

Conversation

@keyonjie
Copy link
Contributor

trace: don't hold trace->lock during dma_trace_on/off

As there is trace calling in the dma_trace_on/off() internal, we should
not do that with trace->lock held, to avoid deadlock.

Copy link
Member

@lgirdwood lgirdwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find @keyonjie - I agree we have a race between the trace work being scheduled and the the trace_off() call. I've had a quick check and Its made even more complex due to

  1. We have a trace->enabled flag and a dma_trace_data->enabled flag, we should only have a one flag.
  2. It does not look like we are flushing when we turn trace off (so missing the last data).
  3. The trace_work() is not checking trace->enabled flag at entry. This would be a big simplification against scheduling races. i.e.
if (!trace->enabled)
    return SOF_TASK_STATE_CANCEL;

If we fix 1,2 & 3 then we do not need IRQ off locking around the trace_on() and trace_off() calls.


schedule_task(&trace_data->dmat_work, DMA_TRACE_PERIOD,
DMA_TRACE_PERIOD);
trace_data->enabled = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're setting this flag protected by the spinlock, but you test it on like 480 without any protection.
Is this intentional?

spin_lock_irq(&trace->lock, flags);

trace->enable = 1;
/* should not do this with trace->lock held as there is trace calling internal */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not able to parse the sentence above. 'there is trace calling internal' -> missing a complement. Or that was 'trace calling internally', which doesn't make more sense.

dma_trace_on();

spin_lock_irq(&trace->lock, flags);
trace->enable = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the trace is enabled after the 'dma trace'. that seems surprising?

spin_lock_irq(&trace->lock, flags);

trace->enable = 0;
/* should not do this with trace->lock held as there is trace calling internal */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, this comment needs to be reworded.

return;

trace_data->enabled = 1;
spin_lock_irq(&trace_data->lock, flags);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have a good understanding of what this lock is protecting, vs. what the other trace lock is protecting? The use of this lock doesn't seem to be fully consistent. E.g. dtrace_event() this lock is also covering the test of .copy_in_progress, but not in a consistent way, because both locations where .copy_in_progress is set to 1 aren't protected by that lock, so, holding the lock while testing it doesn't help. Why do we have to lock here at all?

@keyonjie
Copy link
Contributor Author

keyonjie commented Sep 1, 2021

Good find @keyonjie - I agree we have a race between the trace work being scheduled and the the trace_off() call. I've had a quick check and Its made even more complex due to

  1. We have a trace->enabled flag and a dma_trace_data->enabled flag, we should only have a one flag.
  2. It does not look like we are flushing when we turn trace off (so missing the last data).
  3. The trace_work() is not checking trace->enabled flag at entry. This would be a big simplification against scheduling races. i.e.
if (!trace->enabled)
    return SOF_TASK_STATE_CANCEL;

If we fix 1,2 & 3 then we do not need IRQ off locking around the trace_on() and trace_off() calls.

we could have different use cases, e.g. disable dma trace while mailbox trace is enabled, that's why we need 2 'enabled' flags correspondingly.

The race here is not on the flushing point, the schedule_task() calling in the dma_trace_on() will try to logging out something with trace_info(), which will require to hold the trace->log again.

	schedule_task(&trace_data->dmat_work, DMA_TRACE_PERIOD,
		      DMA_TRACE_PERIOD);
#if CONFIG_TRACEM
	/* send event by mail box too. */
	if (send_atomic) {
		mtrace_event((const char *)data, MESSAGE_SIZE(arg_count));
	} else {
		spin_lock_irq(&trace->lock, flags); //deadlock happens here
		mtrace_event((const char *)data, MESSAGE_SIZE(arg_count));
		spin_unlock_irq(&trace->lock, flags);
	}
#else

@lgirdwood
Copy link
Member

Good find @keyonjie - I agree we have a race between the trace work being scheduled and the the trace_off() call. I've had a quick check and Its made even more complex due to

  1. We have a trace->enabled flag and a dma_trace_data->enabled flag, we should only have a one flag.
  2. It does not look like we are flushing when we turn trace off (so missing the last data).
  3. The trace_work() is not checking trace->enabled flag at entry. This would be a big simplification against scheduling races. i.e.
if (!trace->enabled)
    return SOF_TASK_STATE_CANCEL;

If we fix 1,2 & 3 then we do not need IRQ off locking around the trace_on() and trace_off() calls.

we could have different use cases, e.g. disable dma trace while mailbox trace is enabled, that's why we need 2 'enabled' flags correspondingly.

ok, so we dont need 1 - but 2 & 3 are needed.

The race here is not on the flushing point, the schedule_task() calling in the dma_trace_on() will try to logging out something with trace_info(), which will require to hold the trace->log again.

Doing 3 above means no locking for on/off calls..

	schedule_task(&trace_data->dmat_work, DMA_TRACE_PERIOD,
		      DMA_TRACE_PERIOD);
#if CONFIG_TRACEM
	/* send event by mail box too. */
	if (send_atomic) {
		mtrace_event((const char *)data, MESSAGE_SIZE(arg_count));
	} else {
		spin_lock_irq(&trace->lock, flags); //deadlock happens here
		mtrace_event((const char *)data, MESSAGE_SIZE(arg_count));
		spin_unlock_irq(&trace->lock, flags);
	}
#else

Oh, this code looks wrong. We are entering atomic context (no IRQs) if (!send_atomic) ??

@keyonjie
Copy link
Contributor Author

Oh, this code looks wrong. We are entering atomic context (no IRQs) if (!send_atomic) ??

So looks we are doing things oppositely here?

Hold the dma-trace lock when performing on/off switching, to make sure
the status is consistent.

Signed-off-by: Keyon Jie <yang.jie@linux.intel.com>
As there is trace calling in the dma_trace_on/off() internal, we should
not do that with trace->lock held, to avoid deadlock.

Signed-off-by: Keyon Jie <yang.jie@linux.intel.com>
Copy link
Collaborator

@marc-hb marc-hb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure there are plenty valid points in this discussion but let's please first do a couple small reverts and get back to the "last known good state", fixing both #4676 and #4699 with a single line. The two recent commits that I'm reverting in #4760 started a "chain reaction" that may end up in completely refactoring the existing trace and I believe we don't want that. See longer commit message in #4760.

Oh, this code looks wrong. We are entering atomic context (no IRQs) if (!send_atomic) ??

send_atomic is very confusing, see revert of very old confusion in #4246

@keyonjie
Copy link
Contributor Author

close this as @marc-hb has gone with #4760

@keyonjie keyonjie closed this Sep 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants