[Not For Merge] PR to demonstrate trace crash issue #4683

keyonjie · 2021-08-26T02:08:37Z

The PR is created for demonstrating a potential trace issue: if we try to use trace_xx() to print out logs before the trace is initialized, the FW will crash and we will see "firmware boot failure" from the dmesg.

@marc-hb FYI.

Hold the dma-trace lock when performing on/off switching, to make sure the status is consistent. Signed-off-by: Keyon Jie <yang.jie@linux.intel.com>

As there is trace calling in the dma_trace_on/off() internal, we should not do that with trace->lock held, to avoid deadlock. Signed-off-by: Keyon Jie <yang.jie@linux.intel.com>

marc-hb · 2021-09-14T05:00:31Z

I'm not sure what this proved. #4759 finally demonstrates the issue, thanks.

This reverts commit 7df3674. This restores the ability to use CONFIG_TRACEM (copy everything to mailbox) without crashing, in other words it fixes thesofproject#4699 This also fixes the other DSP panic thesofproject#4676 and removes the need for logical changes in thesofproject#4678, which can be reverted too. commit 7df3674 ("trace: enable trace after it is ready") was meant to fix a crash when tr_xxx() was used early. However I've used very early tracing for months and it never caused any crash (see thesofproject#4334) I tried adding a tr_err() statement immediately after trace_init(sof) in primary_core_init() and it works just fine. primary_core_init() runs extremely early so I don't think it's too demanding not to use an tr_XXX() before the trace even exists. The reverted commits confused initializing and enabling. Reproduction thesofproject#4683 did not seem to demonstrate anything obvious, there's not even a link to a failed test run. I don't understand how playing with spin locks is relevant to this. Later, reproduction thesofproject#4759 finally demonstrated the real issue: through DEBUG_TRACE_PTR(), some tr_XXX() can indeed be called (in very unusal debug circumstances specific to the original author) before the trace is initialized. The previous commit in this series fixes that by simply guarding it with if(trace_get()) -------- I am _not_ pretending that these reverts make the tracing code bug-free and perfect again, absolutely not and very far from it. I'm merely saying that: - The first reverted commit caused at least two regressions: thesofproject#4676 and thesofproject#4699 - These two commits added yet another variable (time) in an already complex situation with an already existing combinatorial "explosion": compile-time Kconfigs, run-time settings, platform-specific bugs (thesofproject#4333, thesofproject#4573, ...), various races, mbox + DMA, different DMA engines, Zephyr vs XTOS, etc. - Last but not least, we don't want to invest in making the exist trace implementation better. We want to switch to the Zephyr implementation instead So let's go back to a previous known good state, I mean _relatively_ good and stay there if we can. Signed-off-by: Marc Herbert <marc.herbert@intel.com>

This reverts commit 7df3674. This restores the ability to use CONFIG_TRACEM (copy everything to mailbox) without crashing, in other words it fixes #4699 This also fixes the other DSP panic #4676 and removes the need for logical changes in #4678, which can be reverted too. commit 7df3674 ("trace: enable trace after it is ready") was meant to fix a crash when tr_xxx() was used early. However I've used very early tracing for months and it never caused any crash (see #4334) I tried adding a tr_err() statement immediately after trace_init(sof) in primary_core_init() and it works just fine. primary_core_init() runs extremely early so I don't think it's too demanding not to use an tr_XXX() before the trace even exists. The reverted commits confused initializing and enabling. Reproduction #4683 did not seem to demonstrate anything obvious, there's not even a link to a failed test run. I don't understand how playing with spin locks is relevant to this. Later, reproduction #4759 finally demonstrated the real issue: through DEBUG_TRACE_PTR(), some tr_XXX() can indeed be called (in very unusal debug circumstances specific to the original author) before the trace is initialized. The previous commit in this series fixes that by simply guarding it with if(trace_get()) -------- I am _not_ pretending that these reverts make the tracing code bug-free and perfect again, absolutely not and very far from it. I'm merely saying that: - The first reverted commit caused at least two regressions: #4676 and #4699 - These two commits added yet another variable (time) in an already complex situation with an already existing combinatorial "explosion": compile-time Kconfigs, run-time settings, platform-specific bugs (#4333, #4573, ...), various races, mbox + DMA, different DMA engines, Zephyr vs XTOS, etc. - Last but not least, we don't want to invest in making the exist trace implementation better. We want to switch to the Zephyr implementation instead So let's go back to a previous known good state, I mean _relatively_ good and stay there if we can. Signed-off-by: Marc Herbert <marc.herbert@intel.com>

This reverts commit 7df3674. This restores the ability to use CONFIG_TRACEM (copy everything to mailbox) without crashing, in other words it fixes thesofproject#4699 This also fixes the other DSP panic thesofproject#4676 and removes the need for logical changes in thesofproject#4678, which can be reverted too. commit 7df3674 ("trace: enable trace after it is ready") was meant to fix a crash when tr_xxx() was used early. However I've used very early tracing for months and it never caused any crash (see thesofproject#4334) I tried adding a tr_err() statement immediately after trace_init(sof) in primary_core_init() and it works just fine. primary_core_init() runs extremely early so I don't think it's too demanding not to use an tr_XXX() before the trace even exists. The reverted commits confused initializing and enabling. Reproduction thesofproject#4683 did not seem to demonstrate anything obvious, there's not even a link to a failed test run. I don't understand how playing with spin locks is relevant to this. Later, reproduction thesofproject#4759 finally demonstrated the real issue: through DEBUG_TRACE_PTR(), some tr_XXX() can indeed be called (in very unusal debug circumstances specific to the original author) before the trace is initialized. The previous commit in this series fixes that by simply guarding it with if(trace_get()) -------- I am _not_ pretending that these reverts make the tracing code bug-free and perfect again, absolutely not and very far from it. I'm merely saying that: - The first reverted commit caused at least two regressions: thesofproject#4676 and thesofproject#4699 - These two commits added yet another variable (time) in an already complex situation with an already existing combinatorial "explosion": compile-time Kconfigs, run-time settings, platform-specific bugs (thesofproject#4333, thesofproject#4573, ...), various races, mbox + DMA, different DMA engines, Zephyr vs XTOS, etc. - Last but not least, we don't want to invest in making the exist trace implementation better. We want to switch to the Zephyr implementation instead So let's go back to a previous known good state, I mean _relatively_ good and stay there if we can. Signed-off-by: Marc Herbert <marc.herbert@intel.com> (cherry picked from commit f2c13f5)

This reverts commit 7df3674. This restores the ability to use CONFIG_TRACEM (copy everything to mailbox) without crashing, in other words it fixes #4699 This also fixes the other DSP panic #4676 and removes the need for logical changes in #4678, which can be reverted too. commit 7df3674 ("trace: enable trace after it is ready") was meant to fix a crash when tr_xxx() was used early. However I've used very early tracing for months and it never caused any crash (see #4334) I tried adding a tr_err() statement immediately after trace_init(sof) in primary_core_init() and it works just fine. primary_core_init() runs extremely early so I don't think it's too demanding not to use an tr_XXX() before the trace even exists. The reverted commits confused initializing and enabling. Reproduction #4683 did not seem to demonstrate anything obvious, there's not even a link to a failed test run. I don't understand how playing with spin locks is relevant to this. Later, reproduction #4759 finally demonstrated the real issue: through DEBUG_TRACE_PTR(), some tr_XXX() can indeed be called (in very unusal debug circumstances specific to the original author) before the trace is initialized. The previous commit in this series fixes that by simply guarding it with if(trace_get()) -------- I am _not_ pretending that these reverts make the tracing code bug-free and perfect again, absolutely not and very far from it. I'm merely saying that: - The first reverted commit caused at least two regressions: #4676 and #4699 - These two commits added yet another variable (time) in an already complex situation with an already existing combinatorial "explosion": compile-time Kconfigs, run-time settings, platform-specific bugs (#4333, #4573, ...), various races, mbox + DMA, different DMA engines, Zephyr vs XTOS, etc. - Last but not least, we don't want to invest in making the exist trace implementation better. We want to switch to the Zephyr implementation instead So let's go back to a previous known good state, I mean _relatively_ good and stay there if we can. Signed-off-by: Marc Herbert <marc.herbert@intel.com> (cherry picked from commit f2c13f5)

keyonjie requested review from akloniex, bardliao, dbaluta, lbetlej, lgirdwood, libinyang, mmaka1 and plbossart as code owners August 26, 2021 02:08

keyonjie marked this pull request as draft August 26, 2021 02:08

keyonjie mentioned this pull request Aug 26, 2021

trace: enable trace after it is ready #4636

Merged

keyonjie added 2 commits August 31, 2021 14:19

dma-trace: hold the lock during switching on/off the trace

de7119f

Hold the dma-trace lock when performing on/off switching, to make sure the status is consistent. Signed-off-by: Keyon Jie <yang.jie@linux.intel.com>

trace: don't hold trace->lock during dma_trace_on/off

47679c4

As there is trace calling in the dma_trace_on/off() internal, we should not do that with trace->lock held, to avoid deadlock. Signed-off-by: Keyon Jie <yang.jie@linux.intel.com>

keyonjie force-pushed the ipc-fix branch from 987c531 to 47679c4 Compare August 31, 2021 06:26

keyonjie closed this Aug 31, 2021

This was referenced Sep 14, 2021

[Not For Merge] PR to demonstrate trace crash issue #4759

Closed

[DRAFT][TEST][DNM] Super early trace #4758

Closed

marc-hb mentioned this pull request Sep 14, 2021

Revert trace enable commits, fix DEBUG_TRACE_PTR() macro so it can be used early #4760

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Not For Merge] PR to demonstrate trace crash issue #4683

[Not For Merge] PR to demonstrate trace crash issue #4683

Uh oh!

keyonjie commented Aug 26, 2021

Uh oh!

marc-hb commented Sep 14, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Not For Merge] PR to demonstrate trace crash issue #4683

[Not For Merge] PR to demonstrate trace crash issue #4683

Uh oh!

Conversation

keyonjie commented Aug 26, 2021

Uh oh!

marc-hb commented Sep 14, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants