Skip to content

Conversation

@marc-hb
Copy link
Collaborator

@marc-hb marc-hb commented Jun 11, 2021

2 main commits Only one left

Feb 2022 EDIT: I did this before adding mtrace_printf(). At the time there was no way to print to the etrace ONLY, so having the DMA trace working earlier was much more valuable.

marc-hb added a commit to marc-hb/sof-test that referenced this pull request Jun 11, 2021
... because it's not an error.

Required by thesofproject/sof#4334

Signed-off-by: Marc Herbert <marc.herbert@intel.com>
@marc-hb marc-hb changed the title Early tracing Restore early tracing Jun 11, 2021
@marc-hb
Copy link
Collaborator Author

marc-hb commented Jun 11, 2021

BTW I've been using mtrace_printf() a lot to try to understand why the DMA trace works in plain SOF but not in Zephyr. It's useful for any early init problem
cc: @keyonjie

@lgirdwood
Copy link
Member

BTW I've been using mtrace_printf() a lot to try to understand why the DMA trace works in plain SOF but not in Zephyr. It's useful for any early init problem
cc: @keyonjie

I assuming we would have a flow like

  1. Zephyr early boot.
  2. Zephyr mailbox trace is initialised. All trace now goes here.
  3. SOF init runs and inits DMAC.
  4. SOF init registers DMA trace backend. Trace now goes here.

marc-hb added a commit to thesofproject/sof-test that referenced this pull request Jun 11, 2021
... because it's not an error.

Required by thesofproject/sof#4334

Signed-off-by: Marc Herbert <marc.herbert@intel.com>
@marc-hb
Copy link
Collaborator Author

marc-hb commented Jun 11, 2021

https://sof-ci.01.org/sofpr/PR4334/build9344/devicetest/?model=CML_RVP_SDW&testcase=check-suspend-resume-5 is the usual thesofproject/linux/issues/2943

In the same run check-sof-logger.sh errors should be fixed by thesofproject/sof-test#707, re-running now.

Everything else is green in that run.

However QEMU failures like https://github.com/thesofproject/sof/pull/4334/checks?check_run_id=2800700000 , https://sof-ci.01.org/sofpr/PR4334/build9353/boottest/ and others are worrying

Kept as a draft for now because of recently found #4333 failure too

@marc-hb
Copy link
Collaborator Author

marc-hb commented Jun 11, 2021

SOFCI TEST

Copy link
Collaborator

@kv2019i kv2019i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good otherwise, but the "if (false)" is a bit strange and was not in the the original-now-to-be-reverted patch.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a bit weird and something to address before merging.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this either but I'm very reluctant to spend a few extra days debugging the probably half-initialized memory allocator that I don't know anything about yet. With this, #4327 and others I've already spent an inordinate amount of time fixing issues in a tracing solution that we want to deprecate eventually. @lgirdwood, @keyonjie any clue about this crash or any name of an expert in this area?

The impact of this issue seems small: a tiny amount of memory wasted in the very unusual case where DMA tracing fails to initialize here = itself a much "bigger problem"! In fact I bet no one would have noticed if I had simply omitted this rfree() entirely or even worse: not tested error handling (typical). Now that I narrowed down this crash, I think it's better to have this "strange"/commented out code here to save a lot of time to whoever would try to add it back if I it were missing. It's not like we can use gdb, can we?

Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets not waste time on fixing this, lest spend our time getting the Zephyr DMA backend working.

Copy link
Member

@plbossart plbossart Jun 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fundamentally disagree on the decision-making process @lgirdwood. Both memory allocation and trace issues are now being pushed under the rug with a let's use Zephyr argument, that's a shortcut that needs a wider discussion.

The direction has to be one of

a) we have no plans to have Zephyr used in public and Chrome releases on APL..TGL, so we need to fix the trace and memory allocator
b) we formally decide at the TSG that we ditch the existing SOF with XTOS baseline to use Zephyr only (edit: on all APL..TGL platforms).

I am not against Zephyr, far from it, but we need a formal plan for a transition.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Direction is b. Makes no sense maintaining two RTOSes. The trace/memory issue are not critical yet and we have also been living with them for a while.
I will schedule TSC for immediately after v1.8 to discuss transition plan.

Copy link
Collaborator

@paulstelian97 paulstelian97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conditional approve -- in my view an improved comment stating the observed behaviour and why it is fine to ignore in this specific situation would do it justice.

@marc-hb marc-hb marked this pull request as ready for review June 18, 2021 01:06
@marc-hb
Copy link
Collaborator Author

marc-hb commented Jun 18, 2021

BYT and CHT were crashing on QEMU, see https://sof-ci.01.org/sofpr/PR4334/build9353/boottest/ and https://github.com/thesofproject/sof/pull/4334/checks?check_run_id=2800699973

This was because platform_timer_get(NULL) was being invoked in vatrace_log(). Fixed with new commit and new function platform_safe_get_time()

Funny how everything else was running OK.

Back do draft because etrace mbox errors in BDW and BYT in https://sof-ci.01.org/sofpr/PR4334/build9425/devicetest

@marc-hb marc-hb marked this pull request as draft June 18, 2021 02:46
marc-hb added a commit to marc-hb/sof that referenced this pull request Sep 14, 2021
This reverts commit 7df3674.

This restores the ability to use CONFIG_TRACEM (copy everything to
mailbox) without crashing, in other words it fixes thesofproject#4699

This also fixes the other DSP panic thesofproject#4676 and removes the need for
logical changes in thesofproject#4678, which can be reverted too.

commit 7df3674 ("trace: enable trace after it is ready") was meant
to fix a crash when tr_xxx() was used early. However I've used very
early tracing for months and it never caused any crash (see thesofproject#4334)

I tried adding a tr_err() statement immediately after trace_init(sof) in
primary_core_init() and it works just fine. primary_core_init() runs
extremely early so I don't think it's too demanding not to use an
tr_XXX() before the trace even exists.

The reverted commits confused initializing and enabling.

Reproduction thesofproject#4683 did not seem to demonstrate anything obvious,
there's not even a link to a failed test run. I don't understand how
playing with spin locks is relevant to this.

Later, reproduction thesofproject#4759 finally demonstrated the real issue: through
DEBUG_TRACE_PTR(), some tr_XXX() can indeed be called (in very unusal
debug circumstances specific to the original author) before the trace is
initialized. The previous commit in this series fixes that by simply
guarding it with if(trace_get())

     --------

I am _not_ pretending that these reverts make the tracing code bug-free
and perfect again, absolutely not and very far from it. I'm merely
saying that:

- The first reverted commit caused at least two regressions: thesofproject#4676 and
  thesofproject#4699

- These two commits added yet another variable (time) in an already
  complex situation with an already existing combinatorial "explosion":
  compile-time Kconfigs, run-time settings, platform-specific bugs
  (thesofproject#4333, thesofproject#4573, ...), various races, mbox + DMA, different DMA engines,
  Zephyr vs XTOS, etc.

- Last but not least, we don't want to invest in making the exist trace
  implementation better. We want to switch to the Zephyr implementation
  instead

So let's go back to a previous known good state, I mean _relatively_
good and stay there if we can.

Signed-off-by: Marc Herbert <marc.herbert@intel.com>
lgirdwood pushed a commit that referenced this pull request Sep 15, 2021
This reverts commit 7df3674.

This restores the ability to use CONFIG_TRACEM (copy everything to
mailbox) without crashing, in other words it fixes #4699

This also fixes the other DSP panic #4676 and removes the need for
logical changes in #4678, which can be reverted too.

commit 7df3674 ("trace: enable trace after it is ready") was meant
to fix a crash when tr_xxx() was used early. However I've used very
early tracing for months and it never caused any crash (see #4334)

I tried adding a tr_err() statement immediately after trace_init(sof) in
primary_core_init() and it works just fine. primary_core_init() runs
extremely early so I don't think it's too demanding not to use an
tr_XXX() before the trace even exists.

The reverted commits confused initializing and enabling.

Reproduction #4683 did not seem to demonstrate anything obvious,
there's not even a link to a failed test run. I don't understand how
playing with spin locks is relevant to this.

Later, reproduction #4759 finally demonstrated the real issue: through
DEBUG_TRACE_PTR(), some tr_XXX() can indeed be called (in very unusal
debug circumstances specific to the original author) before the trace is
initialized. The previous commit in this series fixes that by simply
guarding it with if(trace_get())

     --------

I am _not_ pretending that these reverts make the tracing code bug-free
and perfect again, absolutely not and very far from it. I'm merely
saying that:

- The first reverted commit caused at least two regressions: #4676 and
  #4699

- These two commits added yet another variable (time) in an already
  complex situation with an already existing combinatorial "explosion":
  compile-time Kconfigs, run-time settings, platform-specific bugs
  (#4333, #4573, ...), various races, mbox + DMA, different DMA engines,
  Zephyr vs XTOS, etc.

- Last but not least, we don't want to invest in making the exist trace
  implementation better. We want to switch to the Zephyr implementation
  instead

So let's go back to a previous known good state, I mean _relatively_
good and stay there if we can.

Signed-off-by: Marc Herbert <marc.herbert@intel.com>
marc-hb added a commit to marc-hb/sof that referenced this pull request Sep 21, 2021
This reverts commit 7df3674.

This restores the ability to use CONFIG_TRACEM (copy everything to
mailbox) without crashing, in other words it fixes thesofproject#4699

This also fixes the other DSP panic thesofproject#4676 and removes the need for
logical changes in thesofproject#4678, which can be reverted too.

commit 7df3674 ("trace: enable trace after it is ready") was meant
to fix a crash when tr_xxx() was used early. However I've used very
early tracing for months and it never caused any crash (see thesofproject#4334)

I tried adding a tr_err() statement immediately after trace_init(sof) in
primary_core_init() and it works just fine. primary_core_init() runs
extremely early so I don't think it's too demanding not to use an
tr_XXX() before the trace even exists.

The reverted commits confused initializing and enabling.

Reproduction thesofproject#4683 did not seem to demonstrate anything obvious,
there's not even a link to a failed test run. I don't understand how
playing with spin locks is relevant to this.

Later, reproduction thesofproject#4759 finally demonstrated the real issue: through
DEBUG_TRACE_PTR(), some tr_XXX() can indeed be called (in very unusal
debug circumstances specific to the original author) before the trace is
initialized. The previous commit in this series fixes that by simply
guarding it with if(trace_get())

     --------

I am _not_ pretending that these reverts make the tracing code bug-free
and perfect again, absolutely not and very far from it. I'm merely
saying that:

- The first reverted commit caused at least two regressions: thesofproject#4676 and
  thesofproject#4699

- These two commits added yet another variable (time) in an already
  complex situation with an already existing combinatorial "explosion":
  compile-time Kconfigs, run-time settings, platform-specific bugs
  (thesofproject#4333, thesofproject#4573, ...), various races, mbox + DMA, different DMA engines,
  Zephyr vs XTOS, etc.

- Last but not least, we don't want to invest in making the exist trace
  implementation better. We want to switch to the Zephyr implementation
  instead

So let's go back to a previous known good state, I mean _relatively_
good and stay there if we can.

Signed-off-by: Marc Herbert <marc.herbert@intel.com>
(cherry picked from commit f2c13f5)
@marc-hb
Copy link
Collaborator Author

marc-hb commented Oct 27, 2021

#4333 is supposedly fixed, so maybe time to resurrect this

cc: @ujfalusi

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not gracefully try to allocate a new buffer as best effort?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we I wrote this a long time ago this was never supposed to happen. The buffer was immutable and initialized once.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would worth resurrecting this or a nonstop dtrace support, which is not really correct term as we do want to stop it, so probably permanent dtrace buffer support + dynamic dtrace DMA support (in start/stop terms)

keyonjie pushed a commit that referenced this pull request Nov 2, 2021
This reverts commit 7df3674.

This restores the ability to use CONFIG_TRACEM (copy everything to
mailbox) without crashing, in other words it fixes #4699

This also fixes the other DSP panic #4676 and removes the need for
logical changes in #4678, which can be reverted too.

commit 7df3674 ("trace: enable trace after it is ready") was meant
to fix a crash when tr_xxx() was used early. However I've used very
early tracing for months and it never caused any crash (see #4334)

I tried adding a tr_err() statement immediately after trace_init(sof) in
primary_core_init() and it works just fine. primary_core_init() runs
extremely early so I don't think it's too demanding not to use an
tr_XXX() before the trace even exists.

The reverted commits confused initializing and enabling.

Reproduction #4683 did not seem to demonstrate anything obvious,
there's not even a link to a failed test run. I don't understand how
playing with spin locks is relevant to this.

Later, reproduction #4759 finally demonstrated the real issue: through
DEBUG_TRACE_PTR(), some tr_XXX() can indeed be called (in very unusal
debug circumstances specific to the original author) before the trace is
initialized. The previous commit in this series fixes that by simply
guarding it with if(trace_get())

     --------

I am _not_ pretending that these reverts make the tracing code bug-free
and perfect again, absolutely not and very far from it. I'm merely
saying that:

- The first reverted commit caused at least two regressions: #4676 and
  #4699

- These two commits added yet another variable (time) in an already
  complex situation with an already existing combinatorial "explosion":
  compile-time Kconfigs, run-time settings, platform-specific bugs
  (#4333, #4573, ...), various races, mbox + DMA, different DMA engines,
  Zephyr vs XTOS, etc.

- Last but not least, we don't want to invest in making the exist trace
  implementation better. We want to switch to the Zephyr implementation
  instead

So let's go back to a previous known good state, I mean _relatively_
good and stay there if we can.

Signed-off-by: Marc Herbert <marc.herbert@intel.com>
(cherry picked from commit f2c13f5)
@marc-hb
Copy link
Collaborator Author

marc-hb commented Dec 13, 2021

SOFCI TEST

@marc-hb
Copy link
Collaborator Author

marc-hb commented Dec 13, 2021

Months and multiple DMA changes later, https://sof-ci.01.org/sofpr/PR4334/build11360/devicetest/ shows BDW and BYT still failing the same, in a "stuck DMAetrace" #4333 way. Everything else passes.

@lgirdwood
Copy link
Member

Months and multiple DMA changes later, https://sof-ci.01.org/sofpr/PR4334/build11360/devicetest/ shows BDW and BYT still failing the same, in a "stuck DMA" #4333 way. Everything else passes.

Sorry, do you mean we have the stuck DMA with or without this PR ?

@marc-hb
Copy link
Collaborator Author

marc-hb commented Dec 14, 2021

My bad: the etrace (not DMA) is empty / stuck WITH this PR on these two platforms.

This is very strange, sometimes the DMA trace has been missing, other times it's the etrace.

EDIT: empty trace bug filed in #5385. It can happen even without this PR (but much much more rarely)

@lgirdwood
Copy link
Member

@marc-hb any update here ?

@marc-hb
Copy link
Collaborator Author

marc-hb commented Jan 5, 2022

Still blocked by mysterious trace corruption. Who knows, maybe #5120 will help?

A logical, not "real" git revert of commit eca2089 ("dma-trace:
allocate trace buffer only after enabling traces")

Move dma_trace_buffer_init() away from dma_trace_enable() and back to
dma_trace_init_early()

Restore the early traces removed by that commit. There was no rationale in
the commit message and it was merged without review in PR thesofproject#255.

Earlier refactorings tried hard to trace early:

- commit 36e425e ("dma-trace: move to earlier initialisation point")
- commit 2b86cb3 ("trace: dma: Make sure we can trace platform
  device initialisation.")

... so it's really not clear why early tracing was removed later. "Rage
quit" to avoid unidentified bugs?

This could have been to save one or two 4k HOST_PAGE_SIZE when the
kernel does not enable DMA traces but I can't see what real-world use
case would leverage such savings _at run-time_; what happens then when
the user changes her mind? We don't have any validation for a use case
so dynamic.

Tracing can be disabled at _compile-time_, recently fixed by
commit 571cc29 ("xtensa/cmake: fix !CONFIG_TRACE")

Signed-off-by: Marc Herbert <marc.herbert@intel.com>
@marc-hb
Copy link
Collaborator Author

marc-hb commented Feb 17, 2022

Still blocked by mysterious trace corruption. Who knows, maybe #5120 will help?

#5120 was fixed by DMA alignment fix #5329 but it did not seem to help. I rebased this and BDW is still missing its etrace in https://sof-ci.01.org/sofpr/PR4334/build12063/devicetest

BYT was not available, some reservation problem. Will re-run after it's fixed.

In fact #5329 alignment made this PR more complicated because I had to hardcode the alignment value (see 2ae3328) because of a new (circular?) DMAC dependency. @ujfalusi , @jyri any idea here?

Other unrelated errors spotted in https://sof-ci.01.org/sofpr/PR4334/build12063/devicetest

@marc-hb
Copy link
Collaborator Author

marc-hb commented Feb 17, 2022

SOFCI TEST

Now that BYT is available again it failed exactly the same as BDW: no etrace.

https://sof-ci.01.org/sofpr/PR4334/build12071/devicetest/

EDIT: empty trace bug filed in #5385. It can happen even without this PR (but much much more rarely)

if (err < 0)
return err;
#else
addr_align = 0x80;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at least a comment explaining the magic value, even if it was just a vague test-driven double cache-line size guess

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a hack, I merely copied the value from the disabled code above.

@ujfalusi
Copy link
Contributor

Still blocked by mysterious trace corruption. Who knows, maybe #5120 will help?

#5120 was fixed by DMA alignment fix #5329 but it did not seem to help. I rebased this and BDW is still missing its etrace in https://sof-ci.01.org/sofpr/PR4334/build12063/devicetest

etrace has nothing to do with dma-trace or am I missing something?

It could be possible that the dtarce init fails earlier than we would print the marker to the mtrace window (which is the etrace) and since the sof-logger is looking for the banner it is not printing out anything?

It would be interesting to see what is in there when the banner can not be found... It might just have the right information and we might want to add the SHM banner when the mtrace is initialized, which happens before the dtrace?

@marc-hb
Copy link
Collaborator Author

marc-hb commented Feb 17, 2022

etrace has nothing to do with dma-trace or am I missing something?

Nothing in theory.

It would be interesting to see what is in there when the banner can not be found... It might just have the right information and we might want to add the SHM banner when the mtrace is initialized, which happens before the dtrace?

The etrace is completely empty. Did you look at the test logs? I think they answer several of your questions.

@ujfalusi
Copy link
Contributor

The etrace is completely empty. Did you look at the test logs? I think they answer several of your questions.

Skipped 880 bytes after the last statement.
So it skipped 880 bytes, and what? The info could be in the 40 bytes it did not skipped? Or within the skipped bytes? Or somewhere else. I don't know how the sof-logger deals with the etrace... so likely you are right.

@marc-hb
Copy link
Collaborator Author

marc-hb commented Feb 17, 2022

I filed new "empty etrace" bug in

This draft PR always reproduces an empty etrace on BDW and BYT but it can also happen WITHOUT this PR.

@sys-pt1s
Copy link

Can one of the admins verify this patch?

@kv2019i
Copy link
Collaborator

kv2019i commented Mar 4, 2024

Stale PR

@kv2019i kv2019i closed this Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants