Restore early tracing #4334

marc-hb · 2021-06-11T06:30:48Z

~~2 main commits~~ Only one left

~~add mtrace_printf()~~ merged in Add new mtrace_printf() and use it for DMA init error handling and demoting banner to INFO level #4389
restore early tracking, "logical" revert of commit eca2089

Feb 2022 EDIT: I did this before adding mtrace_printf(). At the time there was no way to print to the etrace ONLY, so having the DMA trace working earlier was much more valuable.

... because it's not an error. Required by thesofproject/sof#4334 Signed-off-by: Marc Herbert <marc.herbert@intel.com>

marc-hb · 2021-06-11T07:53:54Z

BTW I've been using mtrace_printf() a lot to try to understand why the DMA trace works in plain SOF but not in Zephyr. It's useful for any early init problem
cc: @keyonjie

lgirdwood · 2021-06-11T09:56:33Z

BTW I've been using mtrace_printf() a lot to try to understand why the DMA trace works in plain SOF but not in Zephyr. It's useful for any early init problem
cc: @keyonjie

I assuming we would have a flow like

Zephyr early boot.
Zephyr mailbox trace is initialised. All trace now goes here.
SOF init runs and inits DMAC.
SOF init registers DMA trace backend. Trace now goes here.

... because it's not an error. Required by thesofproject/sof#4334 Signed-off-by: Marc Herbert <marc.herbert@intel.com>

marc-hb · 2021-06-11T16:11:51Z

https://sof-ci.01.org/sofpr/PR4334/build9344/devicetest/?model=CML_RVP_SDW&testcase=check-suspend-resume-5 is the usual thesofproject/linux/issues/2943

In the same run check-sof-logger.sh errors should be fixed by thesofproject/sof-test#707, re-running now.

Everything else is green in that run.

However QEMU failures like https://github.com/thesofproject/sof/pull/4334/checks?check_run_id=2800700000 , https://sof-ci.01.org/sofpr/PR4334/build9353/boottest/ and others are worrying

Kept as a draft for now because of recently found #4333 failure too

marc-hb · 2021-06-11T16:12:46Z

SOFCI TEST

kv2019i

Looks good otherwise, but the "if (false)" is a bit strange and was not in the the original-now-to-be-reverted patch.

kv2019i · 2021-06-11T18:40:27Z

src/trace/dma-trace.c

This looks a bit weird and something to address before merging.

I don't like this either but I'm very reluctant to spend a few extra days debugging the probably half-initialized memory allocator that I don't know anything about yet. With this, #4327 and others I've already spent an inordinate amount of time fixing issues in a tracing solution that we want to deprecate eventually. @lgirdwood, @keyonjie any clue about this crash or any name of an expert in this area?

The impact of this issue seems small: a tiny amount of memory wasted in the very unusual case where DMA tracing fails to initialize here = itself a much "bigger problem"! In fact I bet no one would have noticed if I had simply omitted this rfree() entirely or even worse: not tested error handling (typical). Now that I narrowed down this crash, I think it's better to have this "strange"/commented out code here to save a lot of time to whoever would try to add it back if I it were missing. It's not like we can use gdb, can we?

Thoughts?

Lets not waste time on fixing this, lest spend our time getting the Zephyr DMA backend working.

I fundamentally disagree on the decision-making process @lgirdwood. Both memory allocation and trace issues are now being pushed under the rug with a let's use Zephyr argument, that's a shortcut that needs a wider discussion.

The direction has to be one of

a) we have no plans to have Zephyr used in public and Chrome releases on APL..TGL, so we need to fix the trace and memory allocator
b) we formally decide at the TSG that we ditch the existing SOF with XTOS baseline to use Zephyr only (edit: on all APL..TGL platforms).

I am not against Zephyr, far from it, but we need a formal plan for a transition.

Direction is b. Makes no sense maintaining two RTOSes. The trace/memory issue are not critical yet and we have also been living with them for a while.
I will schedule TSC for immediately after v1.8 to discuss transition plan.

paulstelian97

Conditional approve -- in my view an improved comment stating the observed behaviour and why it is fine to ignore in this specific situation would do it justice.

marc-hb · 2021-06-18T01:08:09Z

BYT and CHT were crashing on QEMU, see https://sof-ci.01.org/sofpr/PR4334/build9353/boottest/ and https://github.com/thesofproject/sof/pull/4334/checks?check_run_id=2800699973

This was because platform_timer_get(NULL) was being invoked in vatrace_log(). Fixed with new commit and new function platform_safe_get_time()

Funny how everything else was running OK.

Back do draft because etrace mbox errors in BDW and BYT in https://sof-ci.01.org/sofpr/PR4334/build9425/devicetest

This reverts commit 7df3674. This restores the ability to use CONFIG_TRACEM (copy everything to mailbox) without crashing, in other words it fixes thesofproject#4699 This also fixes the other DSP panic thesofproject#4676 and removes the need for logical changes in thesofproject#4678, which can be reverted too. commit 7df3674 ("trace: enable trace after it is ready") was meant to fix a crash when tr_xxx() was used early. However I've used very early tracing for months and it never caused any crash (see thesofproject#4334) I tried adding a tr_err() statement immediately after trace_init(sof) in primary_core_init() and it works just fine. primary_core_init() runs extremely early so I don't think it's too demanding not to use an tr_XXX() before the trace even exists. The reverted commits confused initializing and enabling. Reproduction thesofproject#4683 did not seem to demonstrate anything obvious, there's not even a link to a failed test run. I don't understand how playing with spin locks is relevant to this. Later, reproduction thesofproject#4759 finally demonstrated the real issue: through DEBUG_TRACE_PTR(), some tr_XXX() can indeed be called (in very unusal debug circumstances specific to the original author) before the trace is initialized. The previous commit in this series fixes that by simply guarding it with if(trace_get()) -------- I am _not_ pretending that these reverts make the tracing code bug-free and perfect again, absolutely not and very far from it. I'm merely saying that: - The first reverted commit caused at least two regressions: thesofproject#4676 and thesofproject#4699 - These two commits added yet another variable (time) in an already complex situation with an already existing combinatorial "explosion": compile-time Kconfigs, run-time settings, platform-specific bugs (thesofproject#4333, thesofproject#4573, ...), various races, mbox + DMA, different DMA engines, Zephyr vs XTOS, etc. - Last but not least, we don't want to invest in making the exist trace implementation better. We want to switch to the Zephyr implementation instead So let's go back to a previous known good state, I mean _relatively_ good and stay there if we can. Signed-off-by: Marc Herbert <marc.herbert@intel.com>

This reverts commit 7df3674. This restores the ability to use CONFIG_TRACEM (copy everything to mailbox) without crashing, in other words it fixes #4699 This also fixes the other DSP panic #4676 and removes the need for logical changes in #4678, which can be reverted too. commit 7df3674 ("trace: enable trace after it is ready") was meant to fix a crash when tr_xxx() was used early. However I've used very early tracing for months and it never caused any crash (see #4334) I tried adding a tr_err() statement immediately after trace_init(sof) in primary_core_init() and it works just fine. primary_core_init() runs extremely early so I don't think it's too demanding not to use an tr_XXX() before the trace even exists. The reverted commits confused initializing and enabling. Reproduction #4683 did not seem to demonstrate anything obvious, there's not even a link to a failed test run. I don't understand how playing with spin locks is relevant to this. Later, reproduction #4759 finally demonstrated the real issue: through DEBUG_TRACE_PTR(), some tr_XXX() can indeed be called (in very unusal debug circumstances specific to the original author) before the trace is initialized. The previous commit in this series fixes that by simply guarding it with if(trace_get()) -------- I am _not_ pretending that these reverts make the tracing code bug-free and perfect again, absolutely not and very far from it. I'm merely saying that: - The first reverted commit caused at least two regressions: #4676 and #4699 - These two commits added yet another variable (time) in an already complex situation with an already existing combinatorial "explosion": compile-time Kconfigs, run-time settings, platform-specific bugs (#4333, #4573, ...), various races, mbox + DMA, different DMA engines, Zephyr vs XTOS, etc. - Last but not least, we don't want to invest in making the exist trace implementation better. We want to switch to the Zephyr implementation instead So let's go back to a previous known good state, I mean _relatively_ good and stay there if we can. Signed-off-by: Marc Herbert <marc.herbert@intel.com>

This reverts commit 7df3674. This restores the ability to use CONFIG_TRACEM (copy everything to mailbox) without crashing, in other words it fixes thesofproject#4699 This also fixes the other DSP panic thesofproject#4676 and removes the need for logical changes in thesofproject#4678, which can be reverted too. commit 7df3674 ("trace: enable trace after it is ready") was meant to fix a crash when tr_xxx() was used early. However I've used very early tracing for months and it never caused any crash (see thesofproject#4334) I tried adding a tr_err() statement immediately after trace_init(sof) in primary_core_init() and it works just fine. primary_core_init() runs extremely early so I don't think it's too demanding not to use an tr_XXX() before the trace even exists. The reverted commits confused initializing and enabling. Reproduction thesofproject#4683 did not seem to demonstrate anything obvious, there's not even a link to a failed test run. I don't understand how playing with spin locks is relevant to this. Later, reproduction thesofproject#4759 finally demonstrated the real issue: through DEBUG_TRACE_PTR(), some tr_XXX() can indeed be called (in very unusal debug circumstances specific to the original author) before the trace is initialized. The previous commit in this series fixes that by simply guarding it with if(trace_get()) -------- I am _not_ pretending that these reverts make the tracing code bug-free and perfect again, absolutely not and very far from it. I'm merely saying that: - The first reverted commit caused at least two regressions: thesofproject#4676 and thesofproject#4699 - These two commits added yet another variable (time) in an already complex situation with an already existing combinatorial "explosion": compile-time Kconfigs, run-time settings, platform-specific bugs (thesofproject#4333, thesofproject#4573, ...), various races, mbox + DMA, different DMA engines, Zephyr vs XTOS, etc. - Last but not least, we don't want to invest in making the exist trace implementation better. We want to switch to the Zephyr implementation instead So let's go back to a previous known good state, I mean _relatively_ good and stay there if we can. Signed-off-by: Marc Herbert <marc.herbert@intel.com> (cherry picked from commit f2c13f5)

marc-hb · 2021-10-27T20:52:26Z

#4333 is supposedly fixed, so maybe time to resurrect this

cc: @ujfalusi

ujfalusi · 2021-10-28T06:09:17Z

src/trace/dma-trace.c

Why not gracefully try to allocate a new buffer as best effort?

Because we I wrote this a long time ago this was never supposed to happen. The buffer was immutable and initialized once.

I think it would worth resurrecting this or a nonstop dtrace support, which is not really correct term as we do want to stop it, so probably permanent dtrace buffer support + dynamic dtrace DMA support (in start/stop terms)

This reverts commit 7df3674. This restores the ability to use CONFIG_TRACEM (copy everything to mailbox) without crashing, in other words it fixes #4699 This also fixes the other DSP panic #4676 and removes the need for logical changes in #4678, which can be reverted too. commit 7df3674 ("trace: enable trace after it is ready") was meant to fix a crash when tr_xxx() was used early. However I've used very early tracing for months and it never caused any crash (see #4334) I tried adding a tr_err() statement immediately after trace_init(sof) in primary_core_init() and it works just fine. primary_core_init() runs extremely early so I don't think it's too demanding not to use an tr_XXX() before the trace even exists. The reverted commits confused initializing and enabling. Reproduction #4683 did not seem to demonstrate anything obvious, there's not even a link to a failed test run. I don't understand how playing with spin locks is relevant to this. Later, reproduction #4759 finally demonstrated the real issue: through DEBUG_TRACE_PTR(), some tr_XXX() can indeed be called (in very unusal debug circumstances specific to the original author) before the trace is initialized. The previous commit in this series fixes that by simply guarding it with if(trace_get()) -------- I am _not_ pretending that these reverts make the tracing code bug-free and perfect again, absolutely not and very far from it. I'm merely saying that: - The first reverted commit caused at least two regressions: #4676 and #4699 - These two commits added yet another variable (time) in an already complex situation with an already existing combinatorial "explosion": compile-time Kconfigs, run-time settings, platform-specific bugs (#4333, #4573, ...), various races, mbox + DMA, different DMA engines, Zephyr vs XTOS, etc. - Last but not least, we don't want to invest in making the exist trace implementation better. We want to switch to the Zephyr implementation instead So let's go back to a previous known good state, I mean _relatively_ good and stay there if we can. Signed-off-by: Marc Herbert <marc.herbert@intel.com> (cherry picked from commit f2c13f5)

marc-hb · 2021-12-13T19:00:36Z

SOFCI TEST

marc-hb · 2021-12-13T20:01:19Z

Months and multiple DMA changes later, https://sof-ci.01.org/sofpr/PR4334/build11360/devicetest/ shows BDW and BYT still failing the same, in a "stuck ~~DMA~~etrace" #4333 way. Everything else passes.

lgirdwood · 2021-12-14T10:41:46Z

Months and multiple DMA changes later, https://sof-ci.01.org/sofpr/PR4334/build11360/devicetest/ shows BDW and BYT still failing the same, in a "stuck DMA" #4333 way. Everything else passes.

Sorry, do you mean we have the stuck DMA with or without this PR ?

marc-hb · 2021-12-14T15:08:10Z

My bad: the etrace (not DMA) is empty / stuck WITH this PR on these two platforms.

This is very strange, sometimes the DMA trace has been missing, other times it's the etrace.

EDIT: empty trace bug filed in #5385. It can happen even without this PR (but much much more rarely)

lgirdwood · 2022-01-05T13:01:06Z

@marc-hb any update here ?

marc-hb · 2022-01-05T19:32:15Z

Still blocked by mysterious trace corruption. Who knows, maybe #5120 will help?

A logical, not "real" git revert of commit eca2089 ("dma-trace: allocate trace buffer only after enabling traces") Move dma_trace_buffer_init() away from dma_trace_enable() and back to dma_trace_init_early() Restore the early traces removed by that commit. There was no rationale in the commit message and it was merged without review in PR thesofproject#255. Earlier refactorings tried hard to trace early: - commit 36e425e ("dma-trace: move to earlier initialisation point") - commit 2b86cb3 ("trace: dma: Make sure we can trace platform device initialisation.") ... so it's really not clear why early tracing was removed later. "Rage quit" to avoid unidentified bugs? This could have been to save one or two 4k HOST_PAGE_SIZE when the kernel does not enable DMA traces but I can't see what real-world use case would leverage such savings _at run-time_; what happens then when the user changes her mind? We don't have any validation for a use case so dynamic. Tracing can be disabled at _compile-time_, recently fixed by commit 571cc29 ("xtensa/cmake: fix !CONFIG_TRACE") Signed-off-by: Marc Herbert <marc.herbert@intel.com>

marc-hb · 2022-02-17T02:38:31Z

Still blocked by mysterious trace corruption. Who knows, maybe #5120 will help?

#5120 was fixed by DMA alignment fix #5329 but it did not seem to help. I rebased this and BDW is still missing its etrace in https://sof-ci.01.org/sofpr/PR4334/build12063/devicetest

BYT was not available, some reservation problem. Will re-run after it's fixed.

In fact #5329 alignment made this PR more complicated because I had to hardcode the alignment value (see 2ae3328) because of a new (circular?) DMAC dependency. @ujfalusi , @jyri any idea here?

Other unrelated errors spotted in https://sof-ci.01.org/sofpr/PR4334/build12063/devicetest

marc-hb · 2022-02-17T08:03:43Z

SOFCI TEST

Now that BYT is available again it failed exactly the same as BDW: no etrace.

https://sof-ci.01.org/sofpr/PR4334/build12071/devicetest/

EDIT: empty trace bug filed in #5385. It can happen even without this PR (but much much more rarely)

lyakh · 2022-02-17T09:14:40Z

src/trace/dma-trace.c

 	if (err < 0)
 		return err;
+#else
+	addr_align = 0x80;


at least a comment explaining the magic value, even if it was just a vague test-driven double cache-line size guess

This is a hack, I merely copied the value from the disabled code above.

ujfalusi · 2022-02-17T13:36:26Z

Still blocked by mysterious trace corruption. Who knows, maybe #5120 will help?

#5120 was fixed by DMA alignment fix #5329 but it did not seem to help. I rebased this and BDW is still missing its etrace in https://sof-ci.01.org/sofpr/PR4334/build12063/devicetest

etrace has nothing to do with dma-trace or am I missing something?

It could be possible that the dtarce init fails earlier than we would print the marker to the mtrace window (which is the etrace) and since the sof-logger is looking for the banner it is not printing out anything?

It would be interesting to see what is in there when the banner can not be found... It might just have the right information and we might want to add the SHM banner when the mtrace is initialized, which happens before the dtrace?

marc-hb · 2022-02-17T15:11:48Z

etrace has nothing to do with dma-trace or am I missing something?

Nothing in theory.

It would be interesting to see what is in there when the banner can not be found... It might just have the right information and we might want to add the SHM banner when the mtrace is initialized, which happens before the dtrace?

The etrace is completely empty. Did you look at the test logs? I think they answer several of your questions.

ujfalusi · 2022-02-17T15:54:14Z

The etrace is completely empty. Did you look at the test logs? I think they answer several of your questions.

Skipped 880 bytes after the last statement.
So it skipped 880 bytes, and what? The info could be in the 40 bytes it did not skipped? Or within the skipped bytes? Or somewhere else. I don't know how the sof-logger deals with the etrace... so likely you are right.

marc-hb · 2022-02-17T18:21:38Z

I filed new "empty etrace" bug in

[BUG] etrace sometimes empty on BDW and BYT #5385
I should have filed this a long time ago sorry.

This draft PR always reproduces an empty etrace on BDW and BYT but it can also happen WITHOUT this PR.

sys-pt1s · 2022-04-20T07:03:58Z

Can one of the admins verify this patch?

kv2019i · 2024-03-04T13:13:48Z

Stale PR

marc-hb added a commit to marc-hb/sof-test that referenced this pull request Jun 11, 2021

check-sof-logger.sh: do not assume FW ABI banner starts with ERROR

6a2e92c

... because it's not an error. Required by thesofproject/sof#4334 Signed-off-by: Marc Herbert <marc.herbert@intel.com>

marc-hb mentioned this pull request Jun 11, 2021

check-sof-logger.sh: do not assume FW ABI banner starts with ERROR thesofproject/sof-test#707

Merged

marc-hb changed the title ~~Early tracing~~ Restore early tracing Jun 11, 2021

marc-hb mentioned this pull request Jun 11, 2021

dma-trace: allocate trace buffer only after enabling traces #255

Merged

marc-hb force-pushed the early-tracing branch from ec54b32 to 14a2f9c Compare June 11, 2021 07:23

marc-hb added a commit to thesofproject/sof-test that referenced this pull request Jun 11, 2021

check-sof-logger.sh: do not assume FW ABI banner starts with ERROR

cde1ea0

... because it's not an error. Required by thesofproject/sof#4334 Signed-off-by: Marc Herbert <marc.herbert@intel.com>

marc-hb requested review from keyonjie, ktrzcinx, kv2019i, lyakh, mmaka1, paulstelian97, plbossart, ranj063, slawblauciak and ujfalusi June 11, 2021 16:21

lgirdwood approved these changes Jun 11, 2021

View reviewed changes

kv2019i reviewed Jun 11, 2021

View reviewed changes

paulstelian97 approved these changes Jun 15, 2021

View reviewed changes

marc-hb force-pushed the early-tracing branch from 14a2f9c to 740614a Compare June 18, 2021 00:58

marc-hb marked this pull request as ready for review June 18, 2021 01:06

marc-hb requested review from akloniex, dbaluta and lbetlej as code owners June 18, 2021 01:06

marc-hb marked this pull request as draft June 18, 2021 02:46

marc-hb mentioned this pull request Sep 14, 2021

Revert trace enable commits, fix DEBUG_TRACE_PTR() macro so it can be used early #4760

Merged

ujfalusi reviewed Oct 28, 2021

View reviewed changes

ujfalusi mentioned this pull request Oct 28, 2021

dma-trace: Fixes aimed to make the re-configuration robust and cleaner #4879

Closed

marc-hb force-pushed the early-tracing branch from 3dc1f1b to 2ae3328 Compare February 17, 2022 01:06

ujfalusi mentioned this pull request Feb 17, 2022

dma-trace: Full support for re-configuration and make dmatb persistent from the time it is allocated #5106

Merged

lyakh reviewed Feb 17, 2022

View reviewed changes

marc-hb mentioned this pull request Feb 17, 2022

[BUG] etrace sometimes empty on BDW and BYT #5385

Closed

marc-hb mentioned this pull request Feb 20, 2022

Possible missed trace logs at the beginning thesofproject/linux#3448

Closed

kv2019i closed this Mar 4, 2024

Restore early tracing #4334

Restore early tracing #4334

Uh oh!

Conversation

marc-hb commented Jun 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marc-hb commented Jun 11, 2021

Uh oh!

lgirdwood commented Jun 11, 2021

Uh oh!

marc-hb commented Jun 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marc-hb commented Jun 11, 2021

Uh oh!

kv2019i left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

plbossart Jun 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paulstelian97 left a comment

Choose a reason for hiding this comment

Uh oh!

marc-hb commented Jun 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marc-hb commented Oct 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marc-hb commented Dec 13, 2021

Uh oh!

marc-hb commented Dec 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lgirdwood commented Dec 14, 2021

Uh oh!

marc-hb commented Dec 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lgirdwood commented Jan 5, 2022

Uh oh!

marc-hb commented Jan 5, 2022

Uh oh!

marc-hb commented Feb 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marc-hb commented Feb 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ujfalusi commented Feb 17, 2022

Uh oh!

marc-hb commented Feb 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ujfalusi commented Feb 17, 2022

Uh oh!

marc-hb commented Feb 17, 2022

Uh oh!

sys-pt1s commented Apr 20, 2022

marc-hb commented Jun 11, 2021 •

edited

Loading

marc-hb commented Jun 11, 2021 •

edited

Loading

plbossart Jun 14, 2021 •

edited

Loading

marc-hb commented Jun 18, 2021 •

edited

Loading

marc-hb commented Oct 27, 2021 •

edited

Loading

marc-hb commented Dec 13, 2021 •

edited

Loading

marc-hb commented Dec 14, 2021 •

edited

Loading

marc-hb commented Feb 17, 2022 •

edited

Loading

marc-hb commented Feb 17, 2022 •

edited

Loading

marc-hb commented Feb 17, 2022 •

edited

Loading