-
Notifications
You must be signed in to change notification settings - Fork 349
Restore early tracing #4334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restore early tracing #4334
Conversation
... because it's not an error. Required by thesofproject/sof#4334 Signed-off-by: Marc Herbert <marc.herbert@intel.com>
|
BTW I've been using |
I assuming we would have a flow like
|
... because it's not an error. Required by thesofproject/sof#4334 Signed-off-by: Marc Herbert <marc.herbert@intel.com>
|
https://sof-ci.01.org/sofpr/PR4334/build9344/devicetest/?model=CML_RVP_SDW&testcase=check-suspend-resume-5 is the usual thesofproject/linux/issues/2943 In the same run check-sof-logger.sh errors should be fixed by thesofproject/sof-test#707, re-running now. Everything else is green in that run. However QEMU failures like https://github.com/thesofproject/sof/pull/4334/checks?check_run_id=2800700000 , https://sof-ci.01.org/sofpr/PR4334/build9353/boottest/ and others are worrying Kept as a draft for now because of recently found #4333 failure too |
|
SOFCI TEST |
kv2019i
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good otherwise, but the "if (false)" is a bit strange and was not in the the original-now-to-be-reverted patch.
src/trace/dma-trace.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks a bit weird and something to address before merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this either but I'm very reluctant to spend a few extra days debugging the probably half-initialized memory allocator that I don't know anything about yet. With this, #4327 and others I've already spent an inordinate amount of time fixing issues in a tracing solution that we want to deprecate eventually. @lgirdwood, @keyonjie any clue about this crash or any name of an expert in this area?
The impact of this issue seems small: a tiny amount of memory wasted in the very unusual case where DMA tracing fails to initialize here = itself a much "bigger problem"! In fact I bet no one would have noticed if I had simply omitted this rfree() entirely or even worse: not tested error handling (typical). Now that I narrowed down this crash, I think it's better to have this "strange"/commented out code here to save a lot of time to whoever would try to add it back if I it were missing. It's not like we can use gdb, can we?
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets not waste time on fixing this, lest spend our time getting the Zephyr DMA backend working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fundamentally disagree on the decision-making process @lgirdwood. Both memory allocation and trace issues are now being pushed under the rug with a let's use Zephyr argument, that's a shortcut that needs a wider discussion.
The direction has to be one of
a) we have no plans to have Zephyr used in public and Chrome releases on APL..TGL, so we need to fix the trace and memory allocator
b) we formally decide at the TSG that we ditch the existing SOF with XTOS baseline to use Zephyr only (edit: on all APL..TGL platforms).
I am not against Zephyr, far from it, but we need a formal plan for a transition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Direction is b. Makes no sense maintaining two RTOSes. The trace/memory issue are not critical yet and we have also been living with them for a while.
I will schedule TSC for immediately after v1.8 to discuss transition plan.
paulstelian97
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conditional approve -- in my view an improved comment stating the observed behaviour and why it is fine to ignore in this specific situation would do it justice.
|
BYT and CHT were crashing on QEMU, see https://sof-ci.01.org/sofpr/PR4334/build9353/boottest/ and https://github.com/thesofproject/sof/pull/4334/checks?check_run_id=2800699973 This was because Funny how everything else was running OK. Back do draft because etrace mbox errors in BDW and BYT in https://sof-ci.01.org/sofpr/PR4334/build9425/devicetest |
This reverts commit 7df3674. This restores the ability to use CONFIG_TRACEM (copy everything to mailbox) without crashing, in other words it fixes thesofproject#4699 This also fixes the other DSP panic thesofproject#4676 and removes the need for logical changes in thesofproject#4678, which can be reverted too. commit 7df3674 ("trace: enable trace after it is ready") was meant to fix a crash when tr_xxx() was used early. However I've used very early tracing for months and it never caused any crash (see thesofproject#4334) I tried adding a tr_err() statement immediately after trace_init(sof) in primary_core_init() and it works just fine. primary_core_init() runs extremely early so I don't think it's too demanding not to use an tr_XXX() before the trace even exists. The reverted commits confused initializing and enabling. Reproduction thesofproject#4683 did not seem to demonstrate anything obvious, there's not even a link to a failed test run. I don't understand how playing with spin locks is relevant to this. Later, reproduction thesofproject#4759 finally demonstrated the real issue: through DEBUG_TRACE_PTR(), some tr_XXX() can indeed be called (in very unusal debug circumstances specific to the original author) before the trace is initialized. The previous commit in this series fixes that by simply guarding it with if(trace_get()) -------- I am _not_ pretending that these reverts make the tracing code bug-free and perfect again, absolutely not and very far from it. I'm merely saying that: - The first reverted commit caused at least two regressions: thesofproject#4676 and thesofproject#4699 - These two commits added yet another variable (time) in an already complex situation with an already existing combinatorial "explosion": compile-time Kconfigs, run-time settings, platform-specific bugs (thesofproject#4333, thesofproject#4573, ...), various races, mbox + DMA, different DMA engines, Zephyr vs XTOS, etc. - Last but not least, we don't want to invest in making the exist trace implementation better. We want to switch to the Zephyr implementation instead So let's go back to a previous known good state, I mean _relatively_ good and stay there if we can. Signed-off-by: Marc Herbert <marc.herbert@intel.com>
This reverts commit 7df3674. This restores the ability to use CONFIG_TRACEM (copy everything to mailbox) without crashing, in other words it fixes #4699 This also fixes the other DSP panic #4676 and removes the need for logical changes in #4678, which can be reverted too. commit 7df3674 ("trace: enable trace after it is ready") was meant to fix a crash when tr_xxx() was used early. However I've used very early tracing for months and it never caused any crash (see #4334) I tried adding a tr_err() statement immediately after trace_init(sof) in primary_core_init() and it works just fine. primary_core_init() runs extremely early so I don't think it's too demanding not to use an tr_XXX() before the trace even exists. The reverted commits confused initializing and enabling. Reproduction #4683 did not seem to demonstrate anything obvious, there's not even a link to a failed test run. I don't understand how playing with spin locks is relevant to this. Later, reproduction #4759 finally demonstrated the real issue: through DEBUG_TRACE_PTR(), some tr_XXX() can indeed be called (in very unusal debug circumstances specific to the original author) before the trace is initialized. The previous commit in this series fixes that by simply guarding it with if(trace_get()) -------- I am _not_ pretending that these reverts make the tracing code bug-free and perfect again, absolutely not and very far from it. I'm merely saying that: - The first reverted commit caused at least two regressions: #4676 and #4699 - These two commits added yet another variable (time) in an already complex situation with an already existing combinatorial "explosion": compile-time Kconfigs, run-time settings, platform-specific bugs (#4333, #4573, ...), various races, mbox + DMA, different DMA engines, Zephyr vs XTOS, etc. - Last but not least, we don't want to invest in making the exist trace implementation better. We want to switch to the Zephyr implementation instead So let's go back to a previous known good state, I mean _relatively_ good and stay there if we can. Signed-off-by: Marc Herbert <marc.herbert@intel.com>
This reverts commit 7df3674. This restores the ability to use CONFIG_TRACEM (copy everything to mailbox) without crashing, in other words it fixes thesofproject#4699 This also fixes the other DSP panic thesofproject#4676 and removes the need for logical changes in thesofproject#4678, which can be reverted too. commit 7df3674 ("trace: enable trace after it is ready") was meant to fix a crash when tr_xxx() was used early. However I've used very early tracing for months and it never caused any crash (see thesofproject#4334) I tried adding a tr_err() statement immediately after trace_init(sof) in primary_core_init() and it works just fine. primary_core_init() runs extremely early so I don't think it's too demanding not to use an tr_XXX() before the trace even exists. The reverted commits confused initializing and enabling. Reproduction thesofproject#4683 did not seem to demonstrate anything obvious, there's not even a link to a failed test run. I don't understand how playing with spin locks is relevant to this. Later, reproduction thesofproject#4759 finally demonstrated the real issue: through DEBUG_TRACE_PTR(), some tr_XXX() can indeed be called (in very unusal debug circumstances specific to the original author) before the trace is initialized. The previous commit in this series fixes that by simply guarding it with if(trace_get()) -------- I am _not_ pretending that these reverts make the tracing code bug-free and perfect again, absolutely not and very far from it. I'm merely saying that: - The first reverted commit caused at least two regressions: thesofproject#4676 and thesofproject#4699 - These two commits added yet another variable (time) in an already complex situation with an already existing combinatorial "explosion": compile-time Kconfigs, run-time settings, platform-specific bugs (thesofproject#4333, thesofproject#4573, ...), various races, mbox + DMA, different DMA engines, Zephyr vs XTOS, etc. - Last but not least, we don't want to invest in making the exist trace implementation better. We want to switch to the Zephyr implementation instead So let's go back to a previous known good state, I mean _relatively_ good and stay there if we can. Signed-off-by: Marc Herbert <marc.herbert@intel.com> (cherry picked from commit f2c13f5)
src/trace/dma-trace.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not gracefully try to allocate a new buffer as best effort?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we I wrote this a long time ago this was never supposed to happen. The buffer was immutable and initialized once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would worth resurrecting this or a nonstop dtrace support, which is not really correct term as we do want to stop it, so probably permanent dtrace buffer support + dynamic dtrace DMA support (in start/stop terms)
This reverts commit 7df3674. This restores the ability to use CONFIG_TRACEM (copy everything to mailbox) without crashing, in other words it fixes #4699 This also fixes the other DSP panic #4676 and removes the need for logical changes in #4678, which can be reverted too. commit 7df3674 ("trace: enable trace after it is ready") was meant to fix a crash when tr_xxx() was used early. However I've used very early tracing for months and it never caused any crash (see #4334) I tried adding a tr_err() statement immediately after trace_init(sof) in primary_core_init() and it works just fine. primary_core_init() runs extremely early so I don't think it's too demanding not to use an tr_XXX() before the trace even exists. The reverted commits confused initializing and enabling. Reproduction #4683 did not seem to demonstrate anything obvious, there's not even a link to a failed test run. I don't understand how playing with spin locks is relevant to this. Later, reproduction #4759 finally demonstrated the real issue: through DEBUG_TRACE_PTR(), some tr_XXX() can indeed be called (in very unusal debug circumstances specific to the original author) before the trace is initialized. The previous commit in this series fixes that by simply guarding it with if(trace_get()) -------- I am _not_ pretending that these reverts make the tracing code bug-free and perfect again, absolutely not and very far from it. I'm merely saying that: - The first reverted commit caused at least two regressions: #4676 and #4699 - These two commits added yet another variable (time) in an already complex situation with an already existing combinatorial "explosion": compile-time Kconfigs, run-time settings, platform-specific bugs (#4333, #4573, ...), various races, mbox + DMA, different DMA engines, Zephyr vs XTOS, etc. - Last but not least, we don't want to invest in making the exist trace implementation better. We want to switch to the Zephyr implementation instead So let's go back to a previous known good state, I mean _relatively_ good and stay there if we can. Signed-off-by: Marc Herbert <marc.herbert@intel.com> (cherry picked from commit f2c13f5)
|
SOFCI TEST |
|
Months and multiple DMA changes later, https://sof-ci.01.org/sofpr/PR4334/build11360/devicetest/ shows BDW and BYT still failing the same, in a "stuck |
Sorry, do you mean we have the stuck DMA with or without this PR ? |
|
My bad: the This is very strange, sometimes the DMA trace has been missing, other times it's the etrace. EDIT: empty trace bug filed in #5385. It can happen even without this PR (but much much more rarely) |
|
@marc-hb any update here ? |
|
Still blocked by mysterious trace corruption. Who knows, maybe #5120 will help? |
A logical, not "real" git revert of commit eca2089 ("dma-trace: allocate trace buffer only after enabling traces") Move dma_trace_buffer_init() away from dma_trace_enable() and back to dma_trace_init_early() Restore the early traces removed by that commit. There was no rationale in the commit message and it was merged without review in PR thesofproject#255. Earlier refactorings tried hard to trace early: - commit 36e425e ("dma-trace: move to earlier initialisation point") - commit 2b86cb3 ("trace: dma: Make sure we can trace platform device initialisation.") ... so it's really not clear why early tracing was removed later. "Rage quit" to avoid unidentified bugs? This could have been to save one or two 4k HOST_PAGE_SIZE when the kernel does not enable DMA traces but I can't see what real-world use case would leverage such savings _at run-time_; what happens then when the user changes her mind? We don't have any validation for a use case so dynamic. Tracing can be disabled at _compile-time_, recently fixed by commit 571cc29 ("xtensa/cmake: fix !CONFIG_TRACE") Signed-off-by: Marc Herbert <marc.herbert@intel.com>
3dc1f1b to
2ae3328
Compare
|
SOFCI TEST Now that BYT is available again it failed exactly the same as BDW: no etrace. https://sof-ci.01.org/sofpr/PR4334/build12071/devicetest/ EDIT: empty trace bug filed in #5385. It can happen even without this PR (but much much more rarely) |
| if (err < 0) | ||
| return err; | ||
| #else | ||
| addr_align = 0x80; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at least a comment explaining the magic value, even if it was just a vague test-driven double cache-line size guess
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a hack, I merely copied the value from the disabled code above.
etrace has nothing to do with dma-trace or am I missing something? It could be possible that the dtarce init fails earlier than we would print the marker to the mtrace window (which is the etrace) and since the sof-logger is looking for the banner it is not printing out anything? It would be interesting to see what is in there when the banner can not be found... It might just have the right information and we might want to add the SHM banner when the mtrace is initialized, which happens before the dtrace? |
Nothing in theory.
The etrace is completely empty. Did you look at the test logs? I think they answer several of your questions. |
|
|
I filed new "empty etrace" bug in
This draft PR always reproduces an empty etrace on BDW and BYT but it can also happen WITHOUT this PR. |
|
Can one of the admins verify this patch? |
|
Stale PR |
2 main commitsOnly one leftadd mtrace_printf()merged in Add new mtrace_printf() and use it for DMA init error handling and demoting banner to INFO level #4389Feb 2022 EDIT: I did this before adding
mtrace_printf(). At the time there was no way to print to theetraceONLY, so having the DMA trace working earlier was much more valuable.