Skip to content

Conversation

@aiChaoSONG
Copy link

On ICL, the info "sof-audio-pci 0000:00:1f.3:
status = 0x00000000 panic = 0x00000000" causes
test failure because we detect the word "panic".

This patch ignores this pattern.

Signed-off-by: Amery Song chao.song@intel.com

@aiChaoSONG aiChaoSONG requested a review from a team as a code owner September 8, 2020 02:25
@marc-hb marc-hb requested review from kv2019i and plbossart September 8, 2020 03:18
Copy link
Collaborator

@marc-hb marc-hb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I don't think we want to ignore every panic message like this but only the ones with
    status = 0x00000000 panic = 0x00000000.

  • Are we sure all ICL devices are located on PCI 0000:00:1f.3?

  • @kv2019i do we have a bug about this?

  • Can you please help review and merge #362 first? I bet it will conflict with this. #362 has been submitted for longer and this is much smaller so it will easier to rebase. UPDATE: done, thanks.

@aiChaoSONG
Copy link
Author

@marc-hb

*I don't think we want to ignore every panic message like this but only the ones with status = 0x00000000 panic = 0x00000000.

Pattern like status = 0x???????? panic = 0x???????? is just info, not error, and as Xiuli stated, we do see other code, so we need to ignore this pattern.

Are we sure all ICL devices are located on PCI 0000:00:1f.3

Yes, all ICL devices have the same PCI ID: 0000:00:1f.3, but other platforms may have a different PCI ID, and false error message also seen on other platform, so here I will use an RE also.

@marc-hb
Copy link
Collaborator

marc-hb commented Sep 8, 2020

Pattern like status = 0x???????? panic = 0x???????? is just info, not error,

Then why not just add --level like this in the right place:

dmesg --level=err,warn

Grepping for "panic" in info messages seems silly.

(In fact searching for keywords at all instead of using levels seems a bit silly but let's not go there yet)

@aiChaoSONG
Copy link
Author

why not just add --level like this in the right place

I regard this PR (or even other PR related to ignore some specified message) as a workaround. Just as you said, the best way is to use journalctl. I will try dmesg --level, and see if it has bad impact. The only concern I have is that we may lose error context.

Copy link
Contributor

@kv2019i kv2019i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a reference to the bug in the filter comment, but with that addressed, this seems good to go for me.

@marc-hb
Copy link
Collaborator

marc-hb commented Sep 8, 2020

The only concern I have is that we may lose error context.

Agreed: we should use --level to decide whether there is an error or not but we also want to collect all levels anyway for the record. If this requires a big code change then this is off-topic, forget it for now.

journalctl will have the same requirement.

BTW #361 could be helpful here.

On some platforms, the info "sof-audio-pci 0000:00:1f.3:
status = 0x00000000 panic = 0x00000000" causes test
failure because we detect the word "panic".

This patch ignores this pattern.

Signed-off-by: Amery Song <chao.song@intel.com>
@xiulipan xiulipan self-requested a review September 9, 2020 03:12
@xiulipan
Copy link
Contributor

xiulipan commented Sep 9, 2020

@kv2019i @marc-hb This workaround LGTM, so I will merge.
Let's force on the journalctl now. I think with journalctl we can have full support for the level and ignore the dmesg size limitation.

@xiulipan xiulipan merged commit ac415de into thesofproject:master Sep 9, 2020
@ranj063
Copy link
Contributor

ranj063 commented Sep 9, 2020

@aiChaoSONG can you please add a fix to this change to only filter this message when it is not prefixed by "error:" and not otherwise?

@aiChaoSONG
Copy link
Author

@aiChaoSONG can you please add a fix to this change to only ignore this message when it is prefixed by "error:" and not otherwise?

We have to ignore this pattern even it is not prefixed by "error", because the word "panic" also causes test case failure.

@marc-hb
Copy link
Collaborator

marc-hb commented Sep 9, 2020

Pattern like status = 0x???????? panic = 0x???????? is just info, not error, and as Xiuli stated, we do see other code, so we need to ignore this pattern.

Since this message appeared in CI (see thesofproject/sof#3395 for some history) I've never seen any other values than status = 0x00000000 panic = 0x00000000. Do you have examples of other values? I mean examples that should also be ignored for some (other?) well known reason? Which other reason besides 3395?

@marc-hb
Copy link
Collaborator

marc-hb commented Sep 9, 2020

I will try dmesg --level, and see if it has bad impact.

@aiChaoSONG did you try --level before @xiulipan merged this?

Let's force on the journalctl now.

Do you mean "accelerate"?

I think with journalctl we can have full support for the level

I'm not sure what better --level option journalctl offers that we would use. The value of journalctl is (as you wrote) unlimited log size and also interleaving with user space errors.

UPDATE, thanks @aiChaoSONG : journalctl may not have better a --level option than dmesg but for sure it has a "better" one than kern.log which has... none.

@aiChaoSONG
Copy link
Author

@marc-hb I am afraid it doesn't work.

[[ "$begin_line" -eq 0 ]] && cmd="dmesg || cmd="sed -n '$begin_line,\$p' /var/log/kern.log"

dmesg is used only when there is no new logs in /var/log/kern.log, even I add dmesg --level=warn here, it may not function.

that's why we want journalctl to replace all the log thing.

@kv2019i
Copy link
Contributor

kv2019i commented Sep 9, 2020

@ranj063 wrote:

@aiChaoSONG can you please add a fix to this change to only ignore this message when it is prefixed by "error:" and not otherwise?

You meant the reverse, right @ranj063 ? We want to raise CI errors only if this trace is printed with "error:" prefix. I..e we should filter out traces that do not have "error:" (the 0...2 attempts).

@marc-hb
Copy link
Collaborator

marc-hb commented Sep 9, 2020

You meant the reverse, right @ranj063 ?

@ranj063 confirmed she did the reverse in another, private conversation. Anyway we don't have levels in kern.log so this will have to wait for journalctl as clarified by @aiChaoSONG

@marc-hb
Copy link
Collaborator

marc-hb commented Sep 10, 2020

After some more online and offline discussions it's clear this PR ignores way too many possible errors / potential "green failures". Also, hiding which platforms experience the issue and what status they print is not going to help address it in the long run. So I'm making this exclude much narrower in #372 so we can spot and record other occurrences, see source code comment there (@xiuli please wait until that all questions are answered and all discussions are finished before merging a PR)

BTW this is not a "false" error, this is a known error.

@marc-hb marc-hb added the area:non-audio Failure False positives: failing when we don't want to label Oct 15, 2021
@aiChaoSONG aiChaoSONG deleted the icl_fix branch October 24, 2022 08:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:non-audio Failure False positives: failing when we don't want to

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants