Skip to content

Conversation

@plbossart
Copy link
Member

@plbossart plbossart commented Jul 29, 2021

Experiments with #4566 show that the va_args usage and/or the indirection from using log_func() to calling log_func() within _log_sofdict() causes surprising DMA issues and timeouts on CML.

There is just no explanation so far, and after 4 days of shooting in the dark it's time to consider a revert of 0def905 "trace: move CONFIG_TRACEM implementation up a couple levels"

This reverts commit 57ee04f.

Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
This reverts commit 0def905.

Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Copy link
Collaborator

@marc-hb marc-hb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Please do NOT close #4558 after this revert is merged]

I'm pretty sure this is just "shooting the messenger", not the least because:

  • the much older v1.8 tag compiled with GCC10 fails the same (not "about the same" but exactly the same)
  • why would this super generic change fail only on CML and only on suspend/resume?
    #4558 has the most details, see also links from there.

However getting the tests to pass is important so: approved.

Unfortunately I'm on long vacation abroad for the next 3 weeks so I won't have time to investigate this #4558 issue soon. People who need sof-logger on Zephyr should re-apply these locally.

@keqiaozhang don't enable sof-logger in Zephyr CI yet, after these reverts it won't work.

@plbossart
Copy link
Member Author

rather spectacular failure on CML_HELIOS https://sof-ci.01.org/sofpr/PR4573/build9818/devicetest/

wow. just wow. let's redo the tests....

@plbossart
Copy link
Member Author

SOFCI TEST

@plbossart
Copy link
Member Author

The failing tests on CML_HELIOS are due to the addition of a new test for PRs (multiple_pause_resume) added by @fredoh9. This test fails because of another issue described in #4560 and likely introduced by 9a7a5ce17dcb

We should probably revert this commit as well to bring CML back to normal.

@ranj063
Copy link
Collaborator

ranj063 commented Jul 29, 2021

SOFCI TEST

@plbossart
Copy link
Member Author

replaced by PR #4578

@plbossart plbossart closed this Jul 30, 2021
marc-hb added a commit to marc-hb/sof that referenced this pull request Sep 14, 2021
This reverts commit 7df3674.

This restores the ability to use CONFIG_TRACEM (copy everything to
mailbox) without crashing, in other words it fixes thesofproject#4699

This also fixes the other DSP panic thesofproject#4676 and removes the need for
logical changes in thesofproject#4678, which can be reverted too.

commit 7df3674 ("trace: enable trace after it is ready") was meant
to fix a crash when tr_xxx() was used early. However I've used very
early tracing for months and it never caused any crash (see thesofproject#4334)

I tried adding a tr_err() statement immediately after trace_init(sof) in
primary_core_init() and it works just fine. primary_core_init() runs
extremely early so I don't think it's too demanding not to use an
tr_XXX() before the trace even exists.

The reverted commits confused initializing and enabling.

Reproduction thesofproject#4683 did not seem to demonstrate anything obvious,
there's not even a link to a failed test run. I don't understand how
playing with spin locks is relevant to this.

Later, reproduction thesofproject#4759 finally demonstrated the real issue: through
DEBUG_TRACE_PTR(), some tr_XXX() can indeed be called (in very unusal
debug circumstances specific to the original author) before the trace is
initialized. The previous commit in this series fixes that by simply
guarding it with if(trace_get())

     --------

I am _not_ pretending that these reverts make the tracing code bug-free
and perfect again, absolutely not and very far from it. I'm merely
saying that:

- The first reverted commit caused at least two regressions: thesofproject#4676 and
  thesofproject#4699

- These two commits added yet another variable (time) in an already
  complex situation with an already existing combinatorial "explosion":
  compile-time Kconfigs, run-time settings, platform-specific bugs
  (thesofproject#4333, thesofproject#4573, ...), various races, mbox + DMA, different DMA engines,
  Zephyr vs XTOS, etc.

- Last but not least, we don't want to invest in making the exist trace
  implementation better. We want to switch to the Zephyr implementation
  instead

So let's go back to a previous known good state, I mean _relatively_
good and stay there if we can.

Signed-off-by: Marc Herbert <marc.herbert@intel.com>
lgirdwood pushed a commit that referenced this pull request Sep 15, 2021
This reverts commit 7df3674.

This restores the ability to use CONFIG_TRACEM (copy everything to
mailbox) without crashing, in other words it fixes #4699

This also fixes the other DSP panic #4676 and removes the need for
logical changes in #4678, which can be reverted too.

commit 7df3674 ("trace: enable trace after it is ready") was meant
to fix a crash when tr_xxx() was used early. However I've used very
early tracing for months and it never caused any crash (see #4334)

I tried adding a tr_err() statement immediately after trace_init(sof) in
primary_core_init() and it works just fine. primary_core_init() runs
extremely early so I don't think it's too demanding not to use an
tr_XXX() before the trace even exists.

The reverted commits confused initializing and enabling.

Reproduction #4683 did not seem to demonstrate anything obvious,
there's not even a link to a failed test run. I don't understand how
playing with spin locks is relevant to this.

Later, reproduction #4759 finally demonstrated the real issue: through
DEBUG_TRACE_PTR(), some tr_XXX() can indeed be called (in very unusal
debug circumstances specific to the original author) before the trace is
initialized. The previous commit in this series fixes that by simply
guarding it with if(trace_get())

     --------

I am _not_ pretending that these reverts make the tracing code bug-free
and perfect again, absolutely not and very far from it. I'm merely
saying that:

- The first reverted commit caused at least two regressions: #4676 and
  #4699

- These two commits added yet another variable (time) in an already
  complex situation with an already existing combinatorial "explosion":
  compile-time Kconfigs, run-time settings, platform-specific bugs
  (#4333, #4573, ...), various races, mbox + DMA, different DMA engines,
  Zephyr vs XTOS, etc.

- Last but not least, we don't want to invest in making the exist trace
  implementation better. We want to switch to the Zephyr implementation
  instead

So let's go back to a previous known good state, I mean _relatively_
good and stay there if we can.

Signed-off-by: Marc Herbert <marc.herbert@intel.com>
marc-hb added a commit to marc-hb/sof that referenced this pull request Sep 21, 2021
This reverts commit 7df3674.

This restores the ability to use CONFIG_TRACEM (copy everything to
mailbox) without crashing, in other words it fixes thesofproject#4699

This also fixes the other DSP panic thesofproject#4676 and removes the need for
logical changes in thesofproject#4678, which can be reverted too.

commit 7df3674 ("trace: enable trace after it is ready") was meant
to fix a crash when tr_xxx() was used early. However I've used very
early tracing for months and it never caused any crash (see thesofproject#4334)

I tried adding a tr_err() statement immediately after trace_init(sof) in
primary_core_init() and it works just fine. primary_core_init() runs
extremely early so I don't think it's too demanding not to use an
tr_XXX() before the trace even exists.

The reverted commits confused initializing and enabling.

Reproduction thesofproject#4683 did not seem to demonstrate anything obvious,
there's not even a link to a failed test run. I don't understand how
playing with spin locks is relevant to this.

Later, reproduction thesofproject#4759 finally demonstrated the real issue: through
DEBUG_TRACE_PTR(), some tr_XXX() can indeed be called (in very unusal
debug circumstances specific to the original author) before the trace is
initialized. The previous commit in this series fixes that by simply
guarding it with if(trace_get())

     --------

I am _not_ pretending that these reverts make the tracing code bug-free
and perfect again, absolutely not and very far from it. I'm merely
saying that:

- The first reverted commit caused at least two regressions: thesofproject#4676 and
  thesofproject#4699

- These two commits added yet another variable (time) in an already
  complex situation with an already existing combinatorial "explosion":
  compile-time Kconfigs, run-time settings, platform-specific bugs
  (thesofproject#4333, thesofproject#4573, ...), various races, mbox + DMA, different DMA engines,
  Zephyr vs XTOS, etc.

- Last but not least, we don't want to invest in making the exist trace
  implementation better. We want to switch to the Zephyr implementation
  instead

So let's go back to a previous known good state, I mean _relatively_
good and stay there if we can.

Signed-off-by: Marc Herbert <marc.herbert@intel.com>
(cherry picked from commit f2c13f5)
keyonjie pushed a commit that referenced this pull request Nov 2, 2021
This reverts commit 7df3674.

This restores the ability to use CONFIG_TRACEM (copy everything to
mailbox) without crashing, in other words it fixes #4699

This also fixes the other DSP panic #4676 and removes the need for
logical changes in #4678, which can be reverted too.

commit 7df3674 ("trace: enable trace after it is ready") was meant
to fix a crash when tr_xxx() was used early. However I've used very
early tracing for months and it never caused any crash (see #4334)

I tried adding a tr_err() statement immediately after trace_init(sof) in
primary_core_init() and it works just fine. primary_core_init() runs
extremely early so I don't think it's too demanding not to use an
tr_XXX() before the trace even exists.

The reverted commits confused initializing and enabling.

Reproduction #4683 did not seem to demonstrate anything obvious,
there's not even a link to a failed test run. I don't understand how
playing with spin locks is relevant to this.

Later, reproduction #4759 finally demonstrated the real issue: through
DEBUG_TRACE_PTR(), some tr_XXX() can indeed be called (in very unusal
debug circumstances specific to the original author) before the trace is
initialized. The previous commit in this series fixes that by simply
guarding it with if(trace_get())

     --------

I am _not_ pretending that these reverts make the tracing code bug-free
and perfect again, absolutely not and very far from it. I'm merely
saying that:

- The first reverted commit caused at least two regressions: #4676 and
  #4699

- These two commits added yet another variable (time) in an already
  complex situation with an already existing combinatorial "explosion":
  compile-time Kconfigs, run-time settings, platform-specific bugs
  (#4333, #4573, ...), various races, mbox + DMA, different DMA engines,
  Zephyr vs XTOS, etc.

- Last but not least, we don't want to invest in making the exist trace
  implementation better. We want to switch to the Zephyr implementation
  instead

So let's go back to a previous known good state, I mean _relatively_
good and stay there if we can.

Signed-off-by: Marc Herbert <marc.herbert@intel.com>
(cherry picked from commit f2c13f5)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants