Skip to content

Conversation

@keyonjie
Copy link
Contributor

If the log tracing (e.g. tr_err()) is called before the trace itself is
available, the FW will crash and FW boot fail happen.

Enable the trace after it is ready, and don't try to perform tracing
when it is unavailable.

Signed-off-by: Keyon Jie yang.jie@linux.intel.com

@keyonjie keyonjie requested a review from akloniex as a code owner August 13, 2021 12:31
@keyonjie keyonjie requested a review from lgirdwood August 13, 2021 12:32
Copy link
Collaborator

@dbaluta dbaluta Aug 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, calling dma_trace_init_complete will turn the tracing on BUT calling trace_init will not turn the tracing ON.

I feel that the normal sequence of enabling tracing should be:

dma_trace_init_complete()
trace_on()

I think is called DMA trace.

And for the other trace we shold do:

trace_init()
trace_on()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, moving and unifying the trace_on() invoking to the end of primary_core_init() looks better to me.

Copy link
Member

@lgirdwood lgirdwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dbaluta good for you now ? Want to get this in the v1.9 stable branch

@lgirdwood
Copy link
Member

@keyonjie seeing this in the CI for CMl

[  554.456130] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ASoC: error at snd_soc_pcm_component_pm_runtime_get on 0000:00:1f.3: -22
[  554.456174] kernel:  SSP0-Codec: soc_pcm_open() failed (-22)
[  554.456193] kernel:  Port1: ASoC: dpcm_be_dai_startup() failed at SSP0-Codec (-22)
[  554.456212] kernel:  Port1: dpcm_fe_dai_startup() failed (-22)
[  555.096177] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: Copying cached debugfs data
[  555.096667] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: Copying cached debugfs data
[  555.098959] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: Copying cached debugfs data
[  555.098981] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: Copying cached debugfs data
[  556.647310] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: enabling device failed: -22
[  556.647348] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: fail in sof_ipc_trace_update_filter -22
[  556.647613] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: Copying cached debugfs data
[  556.648022] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: Copying cached debugfs data
[  556.650207] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: Copying cached debugfs data
[  556.650226] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: Copying cached debugfs data
[  557.242890] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: enabling device failed: -22
[  557.242925] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: fail in sof_ipc_trace_update_filter -22
[  557.243190] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: Copying cached debugfs data
[  557.243519] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: Copying cached debugfs data
[  557.245698] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: Copying cached debugfs data
[  557.245718] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: Copying cached debugfs data

Will rerun CI tests to rule out script or DUT.

@lgirdwood
Copy link
Member

SOFCI TEST

@keyonjie
Copy link
Contributor Author

@lgirdwood looks the failure on CML module unload/reload is valid, let me try to figure out it tomorrow.

@dbaluta
Copy link
Collaborator

dbaluta commented Aug 18, 2021

@lgirdwood looks good to me. When do you plan to create the v1.9 branch/tag?

Good to merge once @keyonjie sorts out the CI failure.

@keyonjie
Copy link
Contributor Author

@lgirdwood @dbaluta looks the CI failure is not related, just rebased and force push to trigger a re-test.

@lgirdwood
Copy link
Member

@keyonjie looks like a null pointer deref race after latest CI reults.
https://sof-ci.01.org/sofpr/PR4636/build10036/devicetest/?model=TGLU_RVP_SDW&testcase=check-suspend-resume-5

@lgirdwood
Copy link
Member

@lgirdwood looks good to me. When do you plan to create the v1.9 branch/tag?

Sorry, now done (everyone is on holiday so my workload is high), we have the stable-v1.9 branch, there are a couple of fixes pending though prior to rc1. I've no issues if you need to do a rc1 tag if needed. The fixes Intel fixes will probably land next week (which I could tag as rc2).

@dbaluta
Copy link
Collaborator

dbaluta commented Aug 23, 2021

Thanks, no need for tagging on our side. I will go with your pace for tagging and RCs.

If the log tracing (e.g. tr_err()) is called before the trace itself is
available, the FW will crash and FW boot fail happen.

Enable the trace after it is ready, and don't try to perform tracing
when it is unavailable.

We have the empty version of trace_init/on(), so the extra "#ifdef" in
primary_core_init() is superfluous.

Signed-off-by: Keyon Jie <yang.jie@linux.intel.com>
@lgirdwood
Copy link
Member

@kkarask @wszypelt good to merge ? looks like CI is pending ?

@marc-hb
Copy link
Collaborator

marc-hb commented Aug 25, 2021

If the log tracing (e.g. tr_err()) is called before the trace itself is available, the FW will crash and FW boot fail happen.

That's not my experience; I've seen tr_xxx() called thousands of times before the trace was initialized and this never caused any issue because the code was smart enough not to do anything in this case. See for instance PR #4334

What crashed exactly and where? Bug fixes should fix well identified bug sequences, not just "something crashes"

@marc-hb
Copy link
Collaborator

marc-hb commented Aug 25, 2021

... and now the trace is crashing!! See #4676 report.

Revert this ASAP?

@plbossart
Copy link
Member

@marc-hb a 'fix' was merged already with PR #4678

I still have strong questions on how this slipped through the cracks. Something's not right here.

@marc-hb
Copy link
Collaborator

marc-hb commented Aug 25, 2021

Something's not right here.

Easy: merging bug fixes for bugs that have no description, no test and no reproduction steps.

@keyonjie
Copy link
Contributor Author

If the log tracing (e.g. tr_err()) is called before the trace itself is available, the FW will crash and FW boot fail happen.

That's not my experience; I've seen tr_xxx() called thousands of times before the trace was initialized and this never caused any issue because the code was smart enough not to do anything in this case. See for instance PR #4334

What crashed exactly and where? Bug fixes should fix well identified bug sequences, not just "something crashes"

I hit this during the memory allocation debugging, let me submit a PR to demonstrate it.

@keyonjie
Copy link
Contributor Author

@marc-hb please check the APL result here: #4683
Didn't file an issue for it as it requires code change to find it, and you were not here :)

@keyonjie
Copy link
Contributor Author

@marc-hb a 'fix' was merged already with PR #4678

I still have strong questions on how this slipped through the cracks. Something's not right here.

I just recheck the CI result of the PR, we didn't capture failure on multiple platforms but it did observe the issue on ADLP_RVP_NOCODEC, maybe the "always failure" on _ZEPHYR platforms had reduced attentions from reviewers and it got slipped.

@marc-hb
Copy link
Collaborator

marc-hb commented Aug 31, 2021

Another regression in #4699

@lgirdwood
Copy link
Member

Another regression in #4699

@marc-hb Is this blocking you with the Zephyr trace fix ?

@marc-hb
Copy link
Collaborator

marc-hb commented Aug 31, 2021

I don't know that yet, I'm just getting back into it. I was just connecting github dots above.

@keyonjie
Copy link
Contributor Author

keyonjie commented Sep 1, 2021

Another regression in #4699

indeed, @ranj063 raised this to me yesterday, I didn't observe it as the mailbox trace is disabled by default, so as the CI.

@marc-hb
Copy link
Collaborator

marc-hb commented Sep 1, 2021

I didn't observe it as the mailbox trace is disabled by default, so as the CI.

I don't understand what this means. If you look at the sof-logger results in some random, recent PR you can see that the logger.etrace header is present on every platform: https://sof-ci.01.org/sofpr/PR4701/build10153/devicetest/

@paulstelian97
Copy link
Collaborator

@marc-hb Mailbox trace being disabled by default means that etrace will not get the regular traces, although error traces might go through it anyway. That's the default behaviour I observe on my end.

@marc-hb
Copy link
Collaborator

marc-hb commented Sep 1, 2021

Yes that's correct, I remember now. I had forgotten that very poor name in the Kconfig interface, if someone had said TRACEM instead I would have understood immediately :-)

marc-hb added a commit to marc-hb/sof that referenced this pull request Sep 14, 2021
As reported in thesofproject#4759, thesofproject#4636 and a few others linked from there.

Signed-off-by: Marc Herbert <marc.herbert@intel.com>
marc-hb added a commit to marc-hb/sof that referenced this pull request Sep 14, 2021
This shows that thesofproject#4636 did not fix any real-world issue. thesofproject#4636 changed
some sof->trace logic, however sof->trace does not exist before
trace_init() and calling tr_err() immediately after trace_init() works.

Note DSP panics were detected immediately after thesofproject#4636 was merged, see
reports in thesofproject#4676

Signed-off-by: Marc Herbert <marc.herbert@intel.com>
@marc-hb
Copy link
Collaborator

marc-hb commented Sep 14, 2021

Finally submitted a revert in #4760

lgirdwood pushed a commit that referenced this pull request Sep 15, 2021
As reported in #4759, #4636 and a few others linked from there.

Signed-off-by: Marc Herbert <marc.herbert@intel.com>
marc-hb added a commit to marc-hb/sof that referenced this pull request Sep 21, 2021
As reported in thesofproject#4759, thesofproject#4636 and a few others linked from there.

Signed-off-by: Marc Herbert <marc.herbert@intel.com>
(cherry picked from commit 3ff1dc0)
keyonjie pushed a commit that referenced this pull request Nov 2, 2021
As reported in #4759, #4636 and a few others linked from there.

Signed-off-by: Marc Herbert <marc.herbert@intel.com>
(cherry picked from commit 3ff1dc0)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants