trace: enable trace after it is ready #4636

keyonjie · 2021-08-13T12:31:55Z

If the log tracing (e.g. tr_err()) is called before the trace itself is
available, the FW will crash and FW boot fail happen.

Enable the trace after it is ready, and don't try to perform tracing
when it is unavailable.

Signed-off-by: Keyon Jie yang.jie@linux.intel.com

dbaluta · 2021-08-13T12:54:23Z

src/trace/dma-trace.c

So, calling dma_trace_init_complete will turn the tracing on BUT calling trace_init will not turn the tracing ON.

I feel that the normal sequence of enabling tracing should be:

dma_trace_init_complete() trace_on()

I think is called DMA trace.

And for the other trace we shold do:

trace_init() trace_on()

good point, moving and unifying the trace_on() invoking to the end of primary_core_init() looks better to me.

lgirdwood

@dbaluta good for you now ? Want to get this in the v1.9 stable branch

lgirdwood · 2021-08-17T16:11:00Z

@keyonjie seeing this in the CI for CMl

[  554.456130] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ASoC: error at snd_soc_pcm_component_pm_runtime_get on 0000:00:1f.3: -22
[  554.456174] kernel:  SSP0-Codec: soc_pcm_open() failed (-22)
[  554.456193] kernel:  Port1: ASoC: dpcm_be_dai_startup() failed at SSP0-Codec (-22)
[  554.456212] kernel:  Port1: dpcm_fe_dai_startup() failed (-22)
[  555.096177] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: Copying cached debugfs data
[  555.096667] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: Copying cached debugfs data
[  555.098959] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: Copying cached debugfs data
[  555.098981] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: Copying cached debugfs data
[  556.647310] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: enabling device failed: -22
[  556.647348] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: fail in sof_ipc_trace_update_filter -22
[  556.647613] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: Copying cached debugfs data
[  556.648022] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: Copying cached debugfs data
[  556.650207] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: Copying cached debugfs data
[  556.650226] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: Copying cached debugfs data
[  557.242890] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: enabling device failed: -22
[  557.242925] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: fail in sof_ipc_trace_update_filter -22
[  557.243190] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: Copying cached debugfs data
[  557.243519] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: Copying cached debugfs data
[  557.245698] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: Copying cached debugfs data
[  557.245718] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: Copying cached debugfs data

Will rerun CI tests to rule out script or DUT.

lgirdwood · 2021-08-17T16:11:07Z

SOFCI TEST

keyonjie · 2021-08-18T05:13:52Z

@lgirdwood looks the failure on CML module unload/reload is valid, let me try to figure out it tomorrow.

dbaluta · 2021-08-18T07:00:47Z

@lgirdwood looks good to me. When do you plan to create the v1.9 branch/tag?

Good to merge once @keyonjie sorts out the CI failure.

keyonjie · 2021-08-18T10:11:38Z

@lgirdwood @dbaluta looks the CI failure is not related, just rebased and force push to trigger a re-test.

lgirdwood · 2021-08-18T12:00:57Z

@keyonjie looks like a null pointer deref race after latest CI reults.
https://sof-ci.01.org/sofpr/PR4636/build10036/devicetest/?model=TGLU_RVP_SDW&testcase=check-suspend-resume-5

lgirdwood · 2021-08-19T10:36:09Z

@lgirdwood looks good to me. When do you plan to create the v1.9 branch/tag?

Sorry, now done (everyone is on holiday so my workload is high), we have the stable-v1.9 branch, there are a couple of fixes pending though prior to rc1. I've no issues if you need to do a rc1 tag if needed. The fixes Intel fixes will probably land next week (which I could tag as rc2).

dbaluta · 2021-08-23T15:58:55Z

Thanks, no need for tagging on our side. I will go with your pace for tagging and RCs.

If the log tracing (e.g. tr_err()) is called before the trace itself is available, the FW will crash and FW boot fail happen. Enable the trace after it is ready, and don't try to perform tracing when it is unavailable. We have the empty version of trace_init/on(), so the extra "#ifdef" in primary_core_init() is superfluous. Signed-off-by: Keyon Jie <yang.jie@linux.intel.com>

lgirdwood · 2021-08-24T16:33:57Z

@kkarask @wszypelt good to merge ? looks like CI is pending ?

marc-hb · 2021-08-25T15:24:02Z

If the log tracing (e.g. tr_err()) is called before the trace itself is available, the FW will crash and FW boot fail happen.

That's not my experience; I've seen tr_xxx() called thousands of times before the trace was initialized and this never caused any issue because the code was smart enough not to do anything in this case. See for instance PR #4334

What crashed exactly and where? Bug fixes should fix well identified bug sequences, not just "something crashes"

marc-hb · 2021-08-25T15:26:15Z

... and now the trace is crashing!! See #4676 report.

Revert this ASAP?

plbossart · 2021-08-25T15:28:25Z

@marc-hb a 'fix' was merged already with PR #4678

I still have strong questions on how this slipped through the cracks. Something's not right here.

marc-hb · 2021-08-25T15:32:15Z

Something's not right here.

Easy: merging bug fixes for bugs that have no description, no test and no reproduction steps.

keyonjie · 2021-08-26T02:00:56Z

If the log tracing (e.g. tr_err()) is called before the trace itself is available, the FW will crash and FW boot fail happen.

That's not my experience; I've seen tr_xxx() called thousands of times before the trace was initialized and this never caused any issue because the code was smart enough not to do anything in this case. See for instance PR #4334

What crashed exactly and where? Bug fixes should fix well identified bug sequences, not just "something crashes"

I hit this during the memory allocation debugging, let me submit a PR to demonstrate it.

keyonjie · 2021-08-26T02:24:07Z

@marc-hb please check the APL result here: #4683
Didn't file an issue for it as it requires code change to find it, and you were not here :)

keyonjie · 2021-08-26T02:58:07Z

@marc-hb a 'fix' was merged already with PR #4678

I still have strong questions on how this slipped through the cracks. Something's not right here.

I just recheck the CI result of the PR, we didn't capture failure on multiple platforms but it did observe the issue on ADLP_RVP_NOCODEC, maybe the "always failure" on _ZEPHYR platforms had reduced attentions from reviewers and it got slipped.

marc-hb · 2021-08-31T20:19:55Z

Another regression in #4699

lgirdwood · 2021-08-31T20:47:57Z

Another regression in #4699

@marc-hb Is this blocking you with the Zephyr trace fix ?

marc-hb · 2021-08-31T21:18:05Z

I don't know that yet, I'm just getting back into it. I was just connecting github dots above.

keyonjie · 2021-09-01T02:01:15Z

Another regression in #4699

indeed, @ranj063 raised this to me yesterday, I didn't observe it as the mailbox trace is disabled by default, so as the CI.

marc-hb · 2021-09-01T02:54:50Z

I didn't observe it as the mailbox trace is disabled by default, so as the CI.

I don't understand what this means. If you look at the sof-logger results in some random, recent PR you can see that the logger.etrace header is present on every platform: https://sof-ci.01.org/sofpr/PR4701/build10153/devicetest/

paulstelian97 · 2021-09-01T07:32:56Z

@marc-hb Mailbox trace being disabled by default means that etrace will not get the regular traces, although error traces might go through it anyway. That's the default behaviour I observe on my end.

marc-hb · 2021-09-01T16:05:31Z

Yes that's correct, I remember now. I had forgotten that very poor name in the Kconfig interface, if someone had said TRACEM instead I would have understood immediately :-)

As reported in thesofproject#4759, thesofproject#4636 and a few others linked from there. Signed-off-by: Marc Herbert <marc.herbert@intel.com>

This shows that thesofproject#4636 did not fix any real-world issue. thesofproject#4636 changed some sof->trace logic, however sof->trace does not exist before trace_init() and calling tr_err() immediately after trace_init() works. Note DSP panics were detected immediately after thesofproject#4636 was merged, see reports in thesofproject#4676 Signed-off-by: Marc Herbert <marc.herbert@intel.com>

marc-hb · 2021-09-14T07:34:52Z

Finally submitted a revert in #4760

As reported in #4759, #4636 and a few others linked from there. Signed-off-by: Marc Herbert <marc.herbert@intel.com>

As reported in thesofproject#4759, thesofproject#4636 and a few others linked from there. Signed-off-by: Marc Herbert <marc.herbert@intel.com> (cherry picked from commit 3ff1dc0)

As reported in #4759, #4636 and a few others linked from there. Signed-off-by: Marc Herbert <marc.herbert@intel.com> (cherry picked from commit 3ff1dc0)

keyonjie requested a review from akloniex as a code owner August 13, 2021 12:31

keyonjie requested a review from lgirdwood August 13, 2021 12:32

dbaluta reviewed Aug 13, 2021

View reviewed changes

keyonjie force-pushed the master branch from fa6489e to f071af1 Compare August 16, 2021 01:20

keyonjie requested review from lbetlej, mmaka1 and plbossart as code owners August 16, 2021 01:20

keyonjie requested a review from dbaluta August 16, 2021 01:20

lgirdwood approved these changes Aug 16, 2021

View reviewed changes

dbaluta approved these changes Aug 18, 2021

View reviewed changes

keyonjie force-pushed the master branch from f071af1 to 5822c9f Compare August 18, 2021 10:00

keyonjie force-pushed the master branch from 5822c9f to 03059c1 Compare August 24, 2021 11:41

lgirdwood merged commit 7df3674 into thesofproject:main Aug 24, 2021

keyonjie mentioned this pull request Aug 25, 2021

dma-trace: add check to avoid dereference from NULL #4678

Merged

plbossart mentioned this pull request Aug 25, 2021

[BUG] DSP panic when initializing the DMA trace #4676

Closed

lgirdwood pushed a commit that referenced this pull request Sep 15, 2021

alloc.c: fix DEBUG_TRACE_PTR() not to trace before trace is initialized

3ff1dc0

As reported in #4759, #4636 and a few others linked from there. Signed-off-by: Marc Herbert <marc.herbert@intel.com>

trace: enable trace after it is ready #4636

trace: enable trace after it is ready #4636

Uh oh!

Conversation

keyonjie commented Aug 13, 2021

Uh oh!

dbaluta Aug 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

keyonjie Aug 16, 2021

Choose a reason for hiding this comment

Uh oh!

lgirdwood left a comment

Choose a reason for hiding this comment

Uh oh!

lgirdwood commented Aug 17, 2021

Uh oh!

lgirdwood commented Aug 17, 2021

Uh oh!

keyonjie commented Aug 18, 2021

Uh oh!

dbaluta commented Aug 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

keyonjie commented Aug 18, 2021

Uh oh!

lgirdwood commented Aug 18, 2021

Uh oh!

lgirdwood commented Aug 19, 2021

Uh oh!

dbaluta commented Aug 23, 2021

Uh oh!

lgirdwood commented Aug 24, 2021

Uh oh!

marc-hb commented Aug 25, 2021

Uh oh!

marc-hb commented Aug 25, 2021

Uh oh!

plbossart commented Aug 25, 2021

Uh oh!

marc-hb commented Aug 25, 2021

Uh oh!

keyonjie commented Aug 26, 2021

Uh oh!

keyonjie commented Aug 26, 2021

Uh oh!

keyonjie commented Aug 26, 2021

Uh oh!

marc-hb commented Aug 31, 2021

Uh oh!

lgirdwood commented Aug 31, 2021

Uh oh!

marc-hb commented Aug 31, 2021

Uh oh!

keyonjie commented Sep 1, 2021

Uh oh!

marc-hb commented Sep 1, 2021

Uh oh!

paulstelian97 commented Sep 1, 2021

Uh oh!

marc-hb commented Sep 1, 2021

Uh oh!

marc-hb commented Sep 14, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

dbaluta Aug 13, 2021 •

edited

Loading

dbaluta commented Aug 18, 2021 •

edited

Loading