-
Notifications
You must be signed in to change notification settings - Fork 59
sof-kernel-log-check.sh: update the regex for DSP reset fails random logs #487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
latest random error log |
|
@xiulipan can you confirm that the rest of the logs look exactly like thesofproject/sof#3395 ? Can you also please share some approximative version/date for "the latest DSP"? @kv2019i , @fredoh9 , @ranj063 , @lgirdwood , others any idea what caused this sudden change? |
marc-hb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have at least a couple specific words in the commit subject and message that refer to the underlying bug? If lacking imagination you can copy some keywords from thesofproject/sof#3395
| # Buglink: https://github.com/thesofproject/sof/issues/3395 | ||
|
|
||
| ignore_str="$ignore_str"'|sof-audio-pci 0000:00:..\..: status = 0x[0]{8} panic = 0x[0]{8}' | ||
| ignore_str="$ignore_str"'|sof-audio-pci 0000:00:..\..: status = 0x[0-f]{8} panic = 0x[0-f]{8}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you observe random panic values too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could not remember, make all into regex to avoid random failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will ignoring any status and any panic not ignore other, totally unrelated panics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All sof-audio-pci 0000:00:..\..: status should be ignored as they are all from DSP reset attempts. Real panic dump start with sof-audio-pci 0000:00:..\..: error: status thesofproject/linux#2382
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It indeed seems we are getting multiple panic codecs for #3395 . @marc-hb This is ok potentially ok as in all current cases, dump is printed with sof_dev_dbg_or_err() and will always has "error: " prefix in the message if the dump is really for an error case. The problem is that this relies on the callers (of hda_dsp_dump()) to set the error flag, which might break in the future.
@xiulipan Would it be ok to limit this only for ICL platform? We shouldn't have random DSP resets on any other platforms currently, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For ICL,
sof-audio-pci 0000:00:1f.3: status = 0x00000000 panic = 0x00000000
For GLK,
sof-audio-pci 0000:00:0e.0: status = 0xecc00301 panic = 0x00000000
then, should we NOT ignore panic code other than 0? I can't imagine there is panic code to ignore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kv2019i The DSP reset retry attempts is not only added for ICL platforms, it is a common issue for all platforms (but lower rate on others). We do see issue on GLK, CNL, ICL and TGL. If ICCMAX is enabled, we may have same fail rate for DSP reset.
PS: DSP reset is not only DSP reset, it also include init communication with CSME
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xiulipan Sorry for late response to this. Ack, I can see you added mention of other platforms to linux#3395. So that does complicate matters.
I'm still a bit concerned that this could hide problems at some point in future as dbg_dump() method is a generic one. E.g. we have snd_sof_handle_fw_exception() which calls snd_sof_dsp_dbg_dump(). I did check all current instances where this is called, and in each and every place, there is a dev_err() print on the same code path, so CI would catch the error.
Given options on the tables, I'd say it's ok to ignore panic prints that are not tagged as errors. And reverse, if DSP status is dumped because of an error (like exception), driver is expected to emit at least one error trace, which CI can catch.
|
@marc-hb logs paste below, same pattern with thesofproject/sof#3395 but different value. Update the commit message to make issue more clear. |
de2cbe9 to
f828a31
Compare
…logs latest test result got DSP reset failure with random status and panic value sof-audio-pci 0000:00:0e.0: status = 0xecc00301 panic = 0x00000000 update the regex for this. Signed-off-by: Pan Xiuli <xiuli.pan@linux.intel.com>
f828a31 to
8fca82b
Compare
|
@marc-hb updated commit message. |
kv2019i
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does do the right thing with current codebase, but to make this more robust against possible future changes in driver (that could hide erorrs), proposal inline.
| # Buglink: https://github.com/thesofproject/sof/issues/3395 | ||
|
|
||
| ignore_str="$ignore_str"'|sof-audio-pci 0000:00:..\..: status = 0x[0]{8} panic = 0x[0]{8}' | ||
| ignore_str="$ignore_str"'|sof-audio-pci 0000:00:..\..: status = 0x[0-f]{8} panic = 0x[0-f]{8}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It indeed seems we are getting multiple panic codecs for #3395 . @marc-hb This is ok potentially ok as in all current cases, dump is printed with sof_dev_dbg_or_err() and will always has "error: " prefix in the message if the dump is really for an error case. The problem is that this relies on the callers (of hda_dsp_dump()) to set the error flag, which might break in the future.
@xiulipan Would it be ok to limit this only for ICL platform? We shouldn't have random DSP resets on any other platforms currently, right?
github sucks. It shows no requested changes in the top right but it does at the bottom
|
update with a random log
@marc-hb Are we ok to merge this |
I have obviously no problem with the "code style" considering there's no "real" code change, just a regular expression/configuration change. I've been trying but I am still nowhere close to a serious understanding of what these error(s) are so I won't approve myself. People who understand these errors should approve. |
kv2019i
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my rationale inline.
| # Buglink: https://github.com/thesofproject/sof/issues/3395 | ||
|
|
||
| ignore_str="$ignore_str"'|sof-audio-pci 0000:00:..\..: status = 0x[0]{8} panic = 0x[0]{8}' | ||
| ignore_str="$ignore_str"'|sof-audio-pci 0000:00:..\..: status = 0x[0-f]{8} panic = 0x[0-f]{8}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xiulipan Sorry for late response to this. Ack, I can see you added mention of other platforms to linux#3395. So that does complicate matters.
I'm still a bit concerned that this could hide problems at some point in future as dbg_dump() method is a generic one. E.g. we have snd_sof_handle_fw_exception() which calls snd_sof_dsp_dbg_dump(). I did check all current instances where this is called, and in each and every place, there is a dev_err() print on the same code path, so CI would catch the error.
Given options on the tables, I'd say it's ok to ignore panic prints that are not tagged as errors. And reverse, if DSP status is dumped because of an error (like exception), driver is expected to emit at least one error trace, which CI can catch.
latest test result got DSP reset failure with random status and panic value
sof-audio-pci 0000:00:0e.0: status = 0xecc00301 panic = 0x00000000
update the regex for this.
Signed-off-by: Pan Xiuli xiuli.pan@linux.intel.com