Revert trace changes to fix CML suspend resume #4573

plbossart · 2021-07-29T18:28:20Z

Experiments with #4566 show that the va_args usage and/or the indirection from using log_func() to calling log_func() within _log_sofdict() causes surprising DMA issues and timeouts on CML.

There is just no explanation so far, and after 4 days of shooting in the dark it's time to consider a revert of 0def905 "trace: move CONFIG_TRACEM implementation up a couple levels"

This reverts commit 57ee04f. Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>

This reverts commit 0def905. Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>

marc-hb

[Please do NOT close #4558 after this revert is merged]

I'm pretty sure this is just "shooting the messenger", not the least because:

the much older v1.8 tag compiled with GCC10 fails the same (not "about the same" but exactly the same)
why would this super generic change fail only on CML and only on suspend/resume?
#4558 has the most details, see also links from there.

However getting the tests to pass is important so: approved.

Unfortunately I'm on long vacation abroad for the next 3 weeks so I won't have time to investigate this #4558 issue soon. People who need sof-logger on Zephyr should re-apply these locally.

@keqiaozhang don't enable sof-logger in Zephyr CI yet, after these reverts it won't work.

plbossart · 2021-07-29T21:47:19Z

rather spectacular failure on CML_HELIOS https://sof-ci.01.org/sofpr/PR4573/build9818/devicetest/

wow. just wow. let's redo the tests....

plbossart · 2021-07-29T21:47:27Z

SOFCI TEST

plbossart · 2021-07-29T22:39:06Z

The failing tests on CML_HELIOS are due to the addition of a new test for PRs (multiple_pause_resume) added by @fredoh9. This test fails because of another issue described in #4560 and likely introduced by 9a7a5ce17dcb

We should probably revert this commit as well to bring CML back to normal.

ranj063 · 2021-07-29T22:45:56Z

SOFCI TEST

plbossart · 2021-07-30T17:19:47Z

replaced by PR #4578

This reverts commit 7df3674. This restores the ability to use CONFIG_TRACEM (copy everything to mailbox) without crashing, in other words it fixes thesofproject#4699 This also fixes the other DSP panic thesofproject#4676 and removes the need for logical changes in thesofproject#4678, which can be reverted too. commit 7df3674 ("trace: enable trace after it is ready") was meant to fix a crash when tr_xxx() was used early. However I've used very early tracing for months and it never caused any crash (see thesofproject#4334) I tried adding a tr_err() statement immediately after trace_init(sof) in primary_core_init() and it works just fine. primary_core_init() runs extremely early so I don't think it's too demanding not to use an tr_XXX() before the trace even exists. The reverted commits confused initializing and enabling. Reproduction thesofproject#4683 did not seem to demonstrate anything obvious, there's not even a link to a failed test run. I don't understand how playing with spin locks is relevant to this. Later, reproduction thesofproject#4759 finally demonstrated the real issue: through DEBUG_TRACE_PTR(), some tr_XXX() can indeed be called (in very unusal debug circumstances specific to the original author) before the trace is initialized. The previous commit in this series fixes that by simply guarding it with if(trace_get()) -------- I am _not_ pretending that these reverts make the tracing code bug-free and perfect again, absolutely not and very far from it. I'm merely saying that: - The first reverted commit caused at least two regressions: thesofproject#4676 and thesofproject#4699 - These two commits added yet another variable (time) in an already complex situation with an already existing combinatorial "explosion": compile-time Kconfigs, run-time settings, platform-specific bugs (thesofproject#4333, thesofproject#4573, ...), various races, mbox + DMA, different DMA engines, Zephyr vs XTOS, etc. - Last but not least, we don't want to invest in making the exist trace implementation better. We want to switch to the Zephyr implementation instead So let's go back to a previous known good state, I mean _relatively_ good and stay there if we can. Signed-off-by: Marc Herbert <marc.herbert@intel.com>

This reverts commit 7df3674. This restores the ability to use CONFIG_TRACEM (copy everything to mailbox) without crashing, in other words it fixes #4699 This also fixes the other DSP panic #4676 and removes the need for logical changes in #4678, which can be reverted too. commit 7df3674 ("trace: enable trace after it is ready") was meant to fix a crash when tr_xxx() was used early. However I've used very early tracing for months and it never caused any crash (see #4334) I tried adding a tr_err() statement immediately after trace_init(sof) in primary_core_init() and it works just fine. primary_core_init() runs extremely early so I don't think it's too demanding not to use an tr_XXX() before the trace even exists. The reverted commits confused initializing and enabling. Reproduction #4683 did not seem to demonstrate anything obvious, there's not even a link to a failed test run. I don't understand how playing with spin locks is relevant to this. Later, reproduction #4759 finally demonstrated the real issue: through DEBUG_TRACE_PTR(), some tr_XXX() can indeed be called (in very unusal debug circumstances specific to the original author) before the trace is initialized. The previous commit in this series fixes that by simply guarding it with if(trace_get()) -------- I am _not_ pretending that these reverts make the tracing code bug-free and perfect again, absolutely not and very far from it. I'm merely saying that: - The first reverted commit caused at least two regressions: #4676 and #4699 - These two commits added yet another variable (time) in an already complex situation with an already existing combinatorial "explosion": compile-time Kconfigs, run-time settings, platform-specific bugs (#4333, #4573, ...), various races, mbox + DMA, different DMA engines, Zephyr vs XTOS, etc. - Last but not least, we don't want to invest in making the exist trace implementation better. We want to switch to the Zephyr implementation instead So let's go back to a previous known good state, I mean _relatively_ good and stay there if we can. Signed-off-by: Marc Herbert <marc.herbert@intel.com>

This reverts commit 7df3674. This restores the ability to use CONFIG_TRACEM (copy everything to mailbox) without crashing, in other words it fixes thesofproject#4699 This also fixes the other DSP panic thesofproject#4676 and removes the need for logical changes in thesofproject#4678, which can be reverted too. commit 7df3674 ("trace: enable trace after it is ready") was meant to fix a crash when tr_xxx() was used early. However I've used very early tracing for months and it never caused any crash (see thesofproject#4334) I tried adding a tr_err() statement immediately after trace_init(sof) in primary_core_init() and it works just fine. primary_core_init() runs extremely early so I don't think it's too demanding not to use an tr_XXX() before the trace even exists. The reverted commits confused initializing and enabling. Reproduction thesofproject#4683 did not seem to demonstrate anything obvious, there's not even a link to a failed test run. I don't understand how playing with spin locks is relevant to this. Later, reproduction thesofproject#4759 finally demonstrated the real issue: through DEBUG_TRACE_PTR(), some tr_XXX() can indeed be called (in very unusal debug circumstances specific to the original author) before the trace is initialized. The previous commit in this series fixes that by simply guarding it with if(trace_get()) -------- I am _not_ pretending that these reverts make the tracing code bug-free and perfect again, absolutely not and very far from it. I'm merely saying that: - The first reverted commit caused at least two regressions: thesofproject#4676 and thesofproject#4699 - These two commits added yet another variable (time) in an already complex situation with an already existing combinatorial "explosion": compile-time Kconfigs, run-time settings, platform-specific bugs (thesofproject#4333, thesofproject#4573, ...), various races, mbox + DMA, different DMA engines, Zephyr vs XTOS, etc. - Last but not least, we don't want to invest in making the exist trace implementation better. We want to switch to the Zephyr implementation instead So let's go back to a previous known good state, I mean _relatively_ good and stay there if we can. Signed-off-by: Marc Herbert <marc.herbert@intel.com> (cherry picked from commit f2c13f5)

This reverts commit 7df3674. This restores the ability to use CONFIG_TRACEM (copy everything to mailbox) without crashing, in other words it fixes #4699 This also fixes the other DSP panic #4676 and removes the need for logical changes in #4678, which can be reverted too. commit 7df3674 ("trace: enable trace after it is ready") was meant to fix a crash when tr_xxx() was used early. However I've used very early tracing for months and it never caused any crash (see #4334) I tried adding a tr_err() statement immediately after trace_init(sof) in primary_core_init() and it works just fine. primary_core_init() runs extremely early so I don't think it's too demanding not to use an tr_XXX() before the trace even exists. The reverted commits confused initializing and enabling. Reproduction #4683 did not seem to demonstrate anything obvious, there's not even a link to a failed test run. I don't understand how playing with spin locks is relevant to this. Later, reproduction #4759 finally demonstrated the real issue: through DEBUG_TRACE_PTR(), some tr_XXX() can indeed be called (in very unusal debug circumstances specific to the original author) before the trace is initialized. The previous commit in this series fixes that by simply guarding it with if(trace_get()) -------- I am _not_ pretending that these reverts make the tracing code bug-free and perfect again, absolutely not and very far from it. I'm merely saying that: - The first reverted commit caused at least two regressions: #4676 and #4699 - These two commits added yet another variable (time) in an already complex situation with an already existing combinatorial "explosion": compile-time Kconfigs, run-time settings, platform-specific bugs (#4333, #4573, ...), various races, mbox + DMA, different DMA engines, Zephyr vs XTOS, etc. - Last but not least, we don't want to invest in making the exist trace implementation better. We want to switch to the Zephyr implementation instead So let's go back to a previous known good state, I mean _relatively_ good and stay there if we can. Signed-off-by: Marc Herbert <marc.herbert@intel.com> (cherry picked from commit f2c13f5)

plbossart added 2 commits July 27, 2021 17:02

Revert "trace: add _log_nodict() and enable DMA trace for Zephyr"

e7977ee

This reverts commit 57ee04f. Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>

Revert "trace: move CONFIG_TRACEM implementation up a couple levels"

e7e1eb6

This reverts commit 0def905. Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>

plbossart requested review from akloniex, dbaluta, lbetlej, lgirdwood and mmaka1 as code owners July 29, 2021 18:28

plbossart requested review from lyakh, marc-hb and ranj063 July 29, 2021 18:28

plbossart mentioned this pull request Jul 29, 2021

[TEST] decompose "trace: move CONFIG_TRACEM implementation up a couple levels" #4566

Closed

marc-hb requested review from cujomalainey, iuliana-prodan and kv2019i July 29, 2021 18:42

marc-hb approved these changes Jul 29, 2021

View reviewed changes

ranj063 approved these changes Jul 29, 2021

View reviewed changes

ranj063 mentioned this pull request Jul 29, 2021

trace: avoid passing va_list #4574

Closed

plbossart closed this Jul 30, 2021

marc-hb mentioned this pull request Sep 14, 2021

Revert trace enable commits, fix DEBUG_TRACE_PTR() macro so it can be used early #4760

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Revert trace changes to fix CML suspend resume #4573

Revert trace changes to fix CML suspend resume #4573

Uh oh!

plbossart commented Jul 29, 2021 •

edited by marc-hb

Loading

Uh oh!

marc-hb left a comment •

edited

Loading

Uh oh!

plbossart commented Jul 29, 2021

Uh oh!

plbossart commented Jul 29, 2021

Uh oh!

plbossart commented Jul 29, 2021

Uh oh!

ranj063 commented Jul 29, 2021

Uh oh!

plbossart commented Jul 30, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Revert trace changes to fix CML suspend resume #4573

Revert trace changes to fix CML suspend resume #4573

Uh oh!

Conversation

plbossart commented Jul 29, 2021 • edited by marc-hb Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marc-hb left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

plbossart commented Jul 29, 2021

Uh oh!

plbossart commented Jul 29, 2021

Uh oh!

plbossart commented Jul 29, 2021

Uh oh!

ranj063 commented Jul 29, 2021

Uh oh!

plbossart commented Jul 30, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

plbossart commented Jul 29, 2021 •

edited by marc-hb

Loading

marc-hb left a comment •

edited

Loading