Skip to content

Conversation

@xiulipan
Copy link
Contributor

@xiulipan xiulipan commented Nov 4, 2020

latest test result got DSP reset failure with random status and panic value
sof-audio-pci 0000:00:0e.0: status = 0xecc00301 panic = 0x00000000
update the regex for this.

Signed-off-by: Pan Xiuli xiuli.pan@linux.intel.com

@xiulipan xiulipan requested a review from a team as a code owner November 4, 2020 09:02
@xiulipan
Copy link
Contributor Author

xiulipan commented Nov 4, 2020

latest random error log [ 1279.579151] sof-audio-pci 0000:00:0e.0: status = 0xecc00301 panic = 0x00000000

@marc-hb
Copy link
Collaborator

marc-hb commented Nov 4, 2020

@xiulipan can you confirm that the rest of the logs look exactly like thesofproject/sof#3395 ?

Can you also please share some approximative version/date for "the latest DSP"?

@kv2019i , @fredoh9 , @ranj063 , @lgirdwood , others any idea what caused this sudden change?

marc-hb
marc-hb previously requested changes Nov 4, 2020
Copy link
Collaborator

@marc-hb marc-hb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have at least a couple specific words in the commit subject and message that refer to the underlying bug? If lacking imagination you can copy some keywords from thesofproject/sof#3395

# Buglink: https://github.com/thesofproject/sof/issues/3395

ignore_str="$ignore_str"'|sof-audio-pci 0000:00:..\..: status = 0x[0]{8} panic = 0x[0]{8}'
ignore_str="$ignore_str"'|sof-audio-pci 0000:00:..\..: status = 0x[0-f]{8} panic = 0x[0-f]{8}'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you observe random panic values too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could not remember, make all into regex to avoid random failure.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will ignoring any status and any panic not ignore other, totally unrelated panics?

Copy link
Contributor Author

@xiulipan xiulipan Nov 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All sof-audio-pci 0000:00:..\..: status should be ignored as they are all from DSP reset attempts. Real panic dump start with sof-audio-pci 0000:00:..\..: error: status thesofproject/linux#2382

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It indeed seems we are getting multiple panic codecs for #3395 . @marc-hb This is ok potentially ok as in all current cases, dump is printed with sof_dev_dbg_or_err() and will always has "error: " prefix in the message if the dump is really for an error case. The problem is that this relies on the callers (of hda_dsp_dump()) to set the error flag, which might break in the future.

@xiulipan Would it be ok to limit this only for ICL platform? We shouldn't have random DSP resets on any other platforms currently, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For ICL,
sof-audio-pci 0000:00:1f.3: status = 0x00000000 panic = 0x00000000
For GLK,
sof-audio-pci 0000:00:0e.0: status = 0xecc00301 panic = 0x00000000
then, should we NOT ignore panic code other than 0? I can't imagine there is panic code to ignore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kv2019i The DSP reset retry attempts is not only added for ICL platforms, it is a common issue for all platforms (but lower rate on others). We do see issue on GLK, CNL, ICL and TGL. If ICCMAX is enabled, we may have same fail rate for DSP reset.

PS: DSP reset is not only DSP reset, it also include init communication with CSME

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiulipan Sorry for late response to this. Ack, I can see you added mention of other platforms to linux#3395. So that does complicate matters.

I'm still a bit concerned that this could hide problems at some point in future as dbg_dump() method is a generic one. E.g. we have snd_sof_handle_fw_exception() which calls snd_sof_dsp_dbg_dump(). I did check all current instances where this is called, and in each and every place, there is a dev_err() print on the same code path, so CI would catch the error.

Given options on the tables, I'd say it's ok to ignore panic prints that are not tagged as errors. And reverse, if DSP status is dumped because of an error (like exception), driver is expected to emit at least one error trace, which CI can catch.

@xiulipan
Copy link
Contributor Author

xiulipan commented Nov 5, 2020

@marc-hb logs paste below, same pattern with thesofproject/sof#3395 but different value. Update the commit message to make issue more clear.

sh-glk-bob-da7219-03 kernel: [ 1279.078255] sof-audio-pci 0000:00:0e.0: Attempting iteration 0 of Core En/ROM load...
sh-glk-bob-da7219-03 kernel: [ 1279.078605] sof-audio-pci 0000:00:0e.0: FW Poll Status: reg=0x3030303 successful
sh-glk-bob-da7219-03 kernel: [ 1279.078628] sof-audio-pci 0000:00:0e.0: FW Poll Status: reg=0x3030302 successful
sh-glk-bob-da7219-03 kernel: [ 1279.078633] sof-audio-pci 0000:00:0e.0: unstall/run core: core_mask = 1
sh-glk-bob-da7219-03 kernel: [ 1279.078639] sof-audio-pci 0000:00:0e.0: DSP core(s) enabled? 1 : core_mask 1
sh-glk-bob-da7219-03 kernel: [ 1279.579125] sof-audio-pci 0000:00:0e.0: FW Poll Status: reg=0x0 timedout
sh-glk-bob-da7219-03 kernel: [ 1279.579140] sof-audio-pci 0000:00:0e.0: unknown ROM status value 80000000
sh-glk-bob-da7219-03 kernel: [ 1279.579151] sof-audio-pci 0000:00:0e.0: status = 0xecc00301 panic = 0x00000000
sh-glk-bob-da7219-03 kernel: [ 1279.579168] sof-audio-pci 0000:00:0e.0: extended rom status:  0x80000000 0xecc00301 0x0 0x0 0x0 0x0 0x0 0x0
sh-glk-bob-da7219-03 kernel: [ 1279.579185] sof-audio-pci 0000:00:0e.0: unknown ROM status value 80000000
sh-glk-bob-da7219-03 kernel: [ 1279.579196] sof-audio-pci 0000:00:0e.0: FW Poll Status: reg=0x3030303 successful
sh-glk-bob-da7219-03 kernel: [ 1279.579207] sof-audio-pci 0000:00:0e.0: FW Poll Status: reg=0x303 successful
sh-glk-bob-da7219-03 kernel: [ 1279.579217] sof-audio-pci 0000:00:0e.0: DSP core(s) enabled? 0 : core_mask 3
sh-glk-bob-da7219-03 kernel: [ 1279.579225] sof-audio-pci 0000:00:0e.0: Attempting iteration 1 of Core En/ROM load...
sh-glk-bob-da7219-03 kernel: [ 1279.579240] sof-audio-pci 0000:00:0e.0: FW Poll Status: reg=0x3030303 successful
sh-glk-bob-da7219-03 kernel: [ 1279.579256] sof-audio-pci 0000:00:0e.0: FW Poll Status: reg=0x3030302 successful
sh-glk-bob-da7219-03 kernel: [ 1279.579264] sof-audio-pci 0000:00:0e.0: unstall/run core: core_mask = 1
sh-glk-bob-da7219-03 kernel: [ 1279.579284] sof-audio-pci 0000:00:0e.0: DSP core(s) enabled? 1 : core_mask 1
sh-glk-bob-da7219-03 kernel: [ 1279.580448] sof-audio-pci 0000:00:0e.0: FW Poll Status: reg=0x40000000 successful
sh-glk-bob-da7219-03 kernel: [ 1279.580476] sof-audio-pci 0000:00:0e.0: FW Poll Status: reg=0x1010202 successful
sh-glk-bob-da7219-03 kernel: [ 1279.608595] sof-audio-pci 0000:00:0e.0: FW Poll Status: reg=0x5000001 successful
sh-glk-bob-da7219-03 kernel: [ 1279.624602] sof-audio-pci 0000:00:0e.0: FW Poll Status: reg=0x5 successful
sh-glk-bob-da7219-03 kernel: [ 1279.624614] sof-audio-pci 0000:00:0e.0: FW Poll Status: reg=0x140000 successful
sh-glk-bob-da7219-03 kernel: [ 1279.624623] sof-audio-pci 0000:00:0e.0: Firmware download successful, booting...
sh-glk-bob-da7219-03 kernel: [ 1279.628071] sof-audio-pci 0000:00:0e.0: ipc rx: 0x70000000: FW_READY
sh-glk-bob-da7219-03 kernel: [ 1279.628078] sof-audio-pci 0000:00:0e.0: ipc: DSP is ready 0x70000000 offset 0x81000
sh-glk-bob-da7219-03 kernel: [ 1279.628087] sof-audio-pci 0000:00:0e.0: ipc rx done: 0x70000000: FW_READY
sh-glk-bob-da7219-03 kernel: [ 1279.628109] sof-audio-pci 0000:00:0e.0: firmware boot complete

…logs

latest test result got DSP reset failure with random status and panic value
sof-audio-pci 0000:00:0e.0: status = 0xecc00301 panic = 0x00000000
update the regex for this.

Signed-off-by: Pan Xiuli <xiuli.pan@linux.intel.com>
@xiulipan xiulipan changed the title sof-kernel-log-check.sh: update the regex with latest random log sof-kernel-log-check.sh: update the regex for DSP reset fails random logs Nov 6, 2020
@xiulipan
Copy link
Contributor Author

xiulipan commented Nov 6, 2020

@marc-hb updated commit message.

@xiulipan xiulipan requested a review from marc-hb November 6, 2020 03:28
Copy link
Contributor

@kv2019i kv2019i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does do the right thing with current codebase, but to make this more robust against possible future changes in driver (that could hide erorrs), proposal inline.

# Buglink: https://github.com/thesofproject/sof/issues/3395

ignore_str="$ignore_str"'|sof-audio-pci 0000:00:..\..: status = 0x[0]{8} panic = 0x[0]{8}'
ignore_str="$ignore_str"'|sof-audio-pci 0000:00:..\..: status = 0x[0-f]{8} panic = 0x[0-f]{8}'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It indeed seems we are getting multiple panic codecs for #3395 . @marc-hb This is ok potentially ok as in all current cases, dump is printed with sof_dev_dbg_or_err() and will always has "error: " prefix in the message if the dump is really for an error case. The problem is that this relies on the callers (of hda_dsp_dump()) to set the error flag, which might break in the future.

@xiulipan Would it be ok to limit this only for ICL platform? We shouldn't have random DSP resets on any other platforms currently, right?

@marc-hb marc-hb dismissed their stale review November 12, 2020 01:52

github sucks. It shows no requested changes in the top right but it does at the bottom

@xiulipan
Copy link
Contributor Author

update with a random log

Nov 13 00:03:16 sh-icl-rvp-hda-06 kernel: [ 9287.931455] sof-audio-pci 0000:00:1f.3: status = 0x0cbada8f panic = 0x1303de51

@marc-hb Are we ok to merge this

@marc-hb
Copy link
Collaborator

marc-hb commented Nov 13, 2020

@marc-hb Are we ok to merge this

I have obviously no problem with the "code style" considering there's no "real" code change, just a regular expression/configuration change.

I've been trying but I am still nowhere close to a serious understanding of what these error(s) are so I won't approve myself. People who understand these errors should approve.

Copy link
Contributor

@kv2019i kv2019i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my rationale inline.

# Buglink: https://github.com/thesofproject/sof/issues/3395

ignore_str="$ignore_str"'|sof-audio-pci 0000:00:..\..: status = 0x[0]{8} panic = 0x[0]{8}'
ignore_str="$ignore_str"'|sof-audio-pci 0000:00:..\..: status = 0x[0-f]{8} panic = 0x[0-f]{8}'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiulipan Sorry for late response to this. Ack, I can see you added mention of other platforms to linux#3395. So that does complicate matters.

I'm still a bit concerned that this could hide problems at some point in future as dbg_dump() method is a generic one. E.g. we have snd_sof_handle_fw_exception() which calls snd_sof_dsp_dbg_dump(). I did check all current instances where this is called, and in each and every place, there is a dev_err() print on the same code path, so CI would catch the error.

Given options on the tables, I'd say it's ok to ignore panic prints that are not tagged as errors. And reverse, if DSP status is dumped because of an error (like exception), driver is expected to emit at least one error trace, which CI can catch.

@xiulipan xiulipan merged commit 91a5604 into thesofproject:master Nov 18, 2020
@marc-hb marc-hb added area:logs Log and results collection, storage, etc. False Pass / green failure labels Jul 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:logs Log and results collection, storage, etc. False Pass / green failure

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants