Fix error filters and use journaltcl #354

plbossart · 2020-08-27T21:16:02Z

Remove useless filters, add new ones for Dell CML-U

marc-hb

Can you please add the bug numbers in the code itself?

plbossart · 2020-08-27T21:53:34Z

Can you please add the bug numbers in the code itself?

Done

tools/sof-kernel-log-check.sh

case-lib/lib.sh

The kernel was modified to only throw errors after the last boot attempt, there should be no need to filter in sof-test. Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>

This error is now handled in the topic/sof-dev kernel. Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>

TPM, iwlwifi and thermal zone warnings All warnings are checked with '.', '(' and ')' treated as escapes to avoid regexp issues. BugLink: thesofproject#307 BugLink: thesofproject#343 Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>

'.' is a special character in a regexp. Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>

Not sure what this is except legacy cruft. Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>

plbossart · 2020-08-28T22:55:27Z

I will need some help here, I am completely out of my depth and I don't have a clue how these scripts are supposed to work. It's just horrible to even look at this code, sorry.
@marc-hb @aiChaoSONG @xiulipan please be mindful of folks like me who are users and don't have any plans to learn advanced bash. CI tests work and run-all-tests.sh doesn't but says PASS.

/underrun!!! (at least 177.843 ms long)
2020-08-28 22:50:48 UTC Sub-Test: [COMMAND] kill process: kill -9 36819
/root/sof-test/tools/sof-kernel-log-check.sh: line 3: 1: "NONE": syntax error: operand expected (error token is ""NONE"")
Failed to parse timestamp: 
Failed to parse timestamp: 
2020-08-28 22:50:49 UTC Sub-Test: [INFO] Test Result: PASS!
```

marc-hb · 2020-08-29T05:39:23Z

Not many people have a clue how these possibly buggy scripts work, and some of these people are gone. Switching to journalctl and error levels is awesome but it's not so easy to implement and very difficult to test because it requires injecting some errors manually. This (past) mistake caught by @aiChaoSONG (thx!) tells me that testing didn't happen yet:

KERNEL_LAST_TIMESTAMP=date "+%Y-%m-%d %H:%M:%S.%6N"

So I don't think the switch to journalctl and error levels should be combined with unrelated changes and refactorings in the same PR and - hopefully no offence - I think it should be performed by someone who is (unfortunately) willing to "learn some advanced bash" or has already done so.

Fred said he already started to work on this, can we please wait until he's back from vacation next week and keep filtering "as usual" in the mean time? I already offered to help him any way I can.

A middle-ground is maybe to simply append a --level option to the existing dmesg and not perform any major surgery.

xiulipan

Thanks Pierre to implement the journalctl feature. Comments for fix inline.

tools/sof-kernel-log-check.sh

xiulipan

Some bash string issue with quote.

tools/sof-kernel-log-check.sh

test-case/check-ipc-flood.sh

plbossart · 2020-09-02T17:13:11Z

@xiulipan your suggestions don't work

2020-09-02 16:12:21 UTC [REMOTE_COMMAND] last timestamp check: 2020-09-02 16:12:20.480743
check begin_timestamp: 2020-09-02
check cmd: journalctl -k --since='2020-09-02'

I really hate this bash thing, it's just horrible.

Using dmesg lines precludes the filtering of dmesg error levels. Move to journalctl, use a severity filter defined as warning and use a timestamp to record previous checks. Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>

A lot more errors reported by CI, maybe due to longer logs? Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>

plbossart · 2020-09-02T20:18:57Z

@xiulipan your suggestions don't work
2020-09-02 16:12:21 UTC [REMOTE_COMMAND] last timestamp check: 2020-09-02 16:12:20.480743
check begin_timestamp: 2020-09-02
check cmd: journalctl -k --since='2020-09-02'
I really hate this bash thing, it's just horrible.

It turns out the parameter also needed to be in double-quotes. Not all scripts did it. Fixed now.

plbossart · 2020-09-02T20:19:54Z

@marc-hb @xiulipan @aiChaoSONG this should be ready to review now, I cleaned-up the commits. Phew.

plbossart · 2020-09-02T21:05:24Z

SOFCI TEST

marc-hb

I'm afraid the --since logic doesn't work and falls back on scanning the entire log in every test.

How did you test this locally? To inject errors easily I would locally replace the -k option in journalctl -k with --boot and add logger hello $0 statements in tests.

marc-hb · 2020-09-02T20:46:34Z

case-lib/lib.sh

 {
-    # shellcheck disable=SC2034 # external script will use it
-    KERNEL_LAST_LINE=$(wc -l /var/log/kern.log|awk '{print $1;}')
+    KERNEL_LAST_TIMESTAMP=$(date "+%Y-%m-%d %H:%M:%S.%6N")


Note this will break twice a year on system with daylight savings, see systemd/systemd#5194. Can you please add a comment with this URL? I can also add it later.

I believe all our CI systems are now configured with UTC, i.e. no daylight savings. @xiulipan , @aiChaoSONG confirm? UTC is the only sensible choice for an international project anyway...

I couldn't find a way to make journalctl understand a date in UTC time. If you find a solution this can be a follow-up PR.

I don't think there is any simple solution either, I'm merely asking to add a comment.
I'll do it in a follow-up PR (please don't "resolve" this thread)

marc-hb · 2020-09-02T21:41:02Z

test-case/check-capture.sh

-        # clean up dmesg
-        sudo dmesg -C
+        # discard old kernel logs
+	func_lib_setup_kernel_last_timestamp


Nit: nothing gets "discarded"

marc-hb · 2020-09-02T21:41:25Z

test-case/check-ipc-flood.sh

-# cleanup dmesg buffer before test
-sudo dmesg -c > /dev/null
+# discard old kernel logs
+func_lib_setup_kernel_last_timestamp


Nit: nothing gets "discarded".

marc-hb · 2020-09-02T21:48:48Z

tools/sof-kernel-log-check.sh

+# confirm begin_timestamp is valid, if it is not the number, search full log
+re='^[0-9]+$'
+if [[ $begin_timestamp =~ $re ]] ; then
+    cmd="journalctl -k"


I don't see how this can work because you have defined above: KERNEL_LAST_TIMESTAMP=$(date "+%Y-%m-%d %H:%M:%S.%6N")

Did you mean KERNEL_LAST_TIMESTAMP=$(date +%s.%6N)?

I'm afraid this code scans the complete kernel log since boot in each test and that we don't see it because you carefully filtered out all known issues. This would also explain why you had KERNEL_LAST_TIMESTAMP=date "+%Y-%m-%d %H:%M:%S.%6N" before and didn't notice.

The commit message of the next commit is also a clue:

A lot more errors reported by CI, maybe due to longer logs?

The only (painful) way to test error handling is to make tests fail, there is no other option. It doesn't have to be in CI, instead you can just locally replace the -k option in journalctl -k with --boot and insert some logger hello1 commands in some tests.

this is for the case where no argument is passed. But in practice we always pass one, so that branch is never taken.

I still don't get all the details but for sure the comment is either very confusing or wrong because $begin_timestamp =~ $re doesn't check that the timestamp is "valid", I mean not valid in a way compatible with journalctl --since

If this merely supposed to check the default value then the following code is enough and much simpler and easier to read:

DEFAULT_BEGIN=0 # or any better name begin_timestamp=${1:-${DEFAULT_BEGIN}} ... ... ... if [ "$begin_timestamp" = "$DEFAULT_BEGIN" ]; then

(BTW $begin_timestamp has spaces so it most definitely needs double quoting. Use shellcheck)

the test means that if there's anything other than numbers, it's not a timestamp.

I don't know why we are arguing about this.

tools/sof-kernel-log-check.sh

marc-hb · 2020-09-02T22:06:00Z

case-lib/lib.sh

 fi

-func_lib_setup_kernel_last_line()
+func_lib_setup_kernel_last_timestamp()


I think you could have kept this function name for now, it would have made this (risky) refactoring so much smaller. Too late.

it's just a rename, not a refactoring.

Your answer is only about the least relevant and accurate word of my comment.
I'm just saying this could have been much shorter hence significantly easier to review.

marc-hb · 2020-09-02T22:06:53Z

tools/sof-kernel-log-check.sh

    builtin exit 1
 fi

+# Handle Call Trace separately


Can you help me understand how this was done before? (if at all)

I have no idea what the previous code did:

err=$(eval $cmd|grep 'Call Trace' -A5 -B3)$(eval $cmd | grep -E "$err_str"|grep -vE "$ignore_str")

This previous code simply:

looked for Call Traces

looked for errors, ignoring known errors

concatenated 1. and 2. into $err

If $err is empty then it means both 1. and 2. are empty.

If either is non-empty then both were shown whereas you're not showing Call Traces anymore when there is an error.

Can you explain why you changed that code? Only because you didn't understand it?

yes, I didn't understand it, and that's a good enough reason, no?

Needed to (manually and locally) test our overly complex error handling framework, see for instance PR thesofproject#354 Signed-off-by: Marc Herbert <marc.herbert@intel.com>

marc-hb

tl;dr: I finally tested this PR for real and it introduces at least 1 or 2 regressions.

Based on some previous issues (now fixed) and discussions I think it's clear you haven't been testing this PR by making some tests FAIL. Tests that don't fail are not tested, so for such a risky change of our core (and messy) error detection "framework", green results in CI don't prove anything.

So I bit the bullet, implemented and tested a (small but very useful) function that injects fake kernel errors in the logs to properly test this part of the code at last. It's in PR #361, everyone please help review. It's small.

Now of course I used this new fake_kern_error() function to manually and locally add one-line errors in a few tests. Then I used run-all-tests.sh to locally check that all the tests where I injected errors were failing; and only those.

So far I had time to try only one combination of artificially failing tests yet this very small scale testing was enough to immediately find that you're introducing a "green failure" regression in check-playback.sh, see more details below.

I also observed a "red success" that is also introduced by this PR in another test but I didn't have time to analyze it yet. Hopefully tomorrow.

Please use PR #361 (or better) too to make some tests fail and actually test this PR. Draft PRs with fake errors should help too, I will likely launch a few tomorrow too.

I really do appreciate the huge effort to attempt to clean up this mess but we really can't afford to make it worse than it already is and the only way to avoid making it worse is to test any change very thoroughly. The worse the code is, the more testing any change requires.

marc-hb · 2020-09-03T05:59:02Z

case-lib/lib.sh

 fi

-func_lib_setup_kernel_last_line()
+func_lib_setup_kernel_last_timestamp()


Your answer is only about the least relevant and accurate word of my comment.
I'm just saying this could have been much shorter hence significantly easier to review.

marc-hb · 2020-09-03T06:00:41Z

case-lib/lib.sh

 {
-    # shellcheck disable=SC2034 # external script will use it
-    KERNEL_LAST_LINE=$(wc -l /var/log/kern.log|awk '{print $1;}')
+    KERNEL_LAST_TIMESTAMP=$(date "+%Y-%m-%d %H:%M:%S.%6N")


I don't think there is any simple solution either, I'm merely asking to add a comment.
I'll do it in a follow-up PR (please don't "resolve" this thread)

tools/sof-kernel-log-check.sh

marc-hb · 2020-09-03T07:20:35Z

tools/sof-kernel-log-check.sh

    builtin exit 1
 fi

+# Handle Call Trace separately


This previous code simply:

looked for Call Traces

looked for errors, ignoring known errors

concatenated 1. and 2. into $err

If $err is empty then it means both 1. and 2. are empty.

If either is non-empty then both were shown whereas you're not showing Call Traces anymore when there is an error.

Can you explain why you changed that code? Only because you didn't understand it?

marc-hb · 2020-09-03T07:34:24Z

test-case/multiple-pipeline-capture.sh

 done

-sof-kernel-log-check.sh $KERNEL_LAST_LINE >/dev/null
+sof-kernel-log-check.sh "$KERNEL_LAST_TIMESTAMP" >/dev/null


This call here did not seem to make much sense and still does not...

I just performed a dmesg->journactl change. You can't possibly expect me to transform this bag of idiotic scripts into something perfect in one shot.

marc-hb · 2020-09-03T07:36:44Z

test-case/check-playback.sh

-        sudo dmesg -C
+        # discard old kernel logs
+	func_lib_setup_kernel_last_timestamp
+


dmesg -C here was a bug but a harmless bug because the sof-kernel-log-check.sh call at the end generally checks kern.log, not dmesg. Now you're turning this into a real, "green failure" bug.

I don't understand the comment, sorry. what is the issue?

marc-hb · 2020-09-03T07:59:04Z

tools/sof-kernel-log-check.sh

+# confirm begin_timestamp is valid, if it is not the number, search full log
+re='^[0-9]+$'
+if [[ $begin_timestamp =~ $re ]] ; then
+    cmd="journalctl -k"


I still don't get all the details but for sure the comment is either very confusing or wrong because $begin_timestamp =~ $re doesn't check that the timestamp is "valid", I mean not valid in a way compatible with journalctl --since

If this merely supposed to check the default value then the following code is enough and much simpler and easier to read:

DEFAULT_BEGIN=0 # or any better name begin_timestamp=${1:-${DEFAULT_BEGIN}} ... ... ... if [ "$begin_timestamp" = "$DEFAULT_BEGIN" ]; then

(BTW $begin_timestamp has spaces so it most definitely needs double quoting. Use shellcheck)

plbossart · 2020-09-03T15:45:36Z

I AM DONE.

marc-hb · 2020-09-03T18:02:51Z

You can't possibly expect me to transform this bag of idiotic scripts into something perfect in one shot.

Not "perfect", certainly not. But not giving more misleading results than it gives now, which is unfortunately a huge challenge that I'm afraid you underestimated despite my "don't go there" warnings and my various, less ambitious suggestions. BTW I will resubmit now the "simpler" yet very useful commits of this PR in a smaller PR.

"Not giving more misleading results" = not adding any new green failure which unfortunately requires a significant test effort involving some fake errors (see small #361)

marc-hb · 2020-09-03T23:42:34Z

BTW I will resubmit now the "simpler" yet very useful commits of this PR in a smaller PR.

Done in #362

plbossart requested a review from a team as a code owner August 27, 2020 21:16

marc-hb reviewed Aug 27, 2020

View reviewed changes

plbossart force-pushed the fix/error-filters branch from 6f98c6c to 75da0ce Compare August 27, 2020 21:53

plbossart requested review from aiChaoSONG, fredoh9 and marc-hb August 27, 2020 21:53

marc-hb reviewed Aug 27, 2020

View reviewed changes

tools/sof-kernel-log-check.sh Outdated Show resolved Hide resolved

plbossart force-pushed the fix/error-filters branch 2 times, most recently from 01e350d to 9f549a9 Compare August 28, 2020 02:00

aiChaoSONG mentioned this pull request Aug 28, 2020

Ignore iwlwifi, thermal errors on Dell laptop #343

Closed

plbossart changed the title ~~Fix error filters~~ [WIP] Fix error filters and use journaltcl Aug 28, 2020

plbossart mentioned this pull request Aug 28, 2020

tools: sof-kernel-log-check: use journalctl command to query kernel log #262

Closed

aiChaoSONG reviewed Aug 28, 2020

View reviewed changes

case-lib/lib.sh Outdated Show resolved Hide resolved

aiChaoSONG mentioned this pull request Aug 28, 2020

tools: update the checklist and ignore list with boot time #355

Closed

plbossart added 5 commits August 28, 2020 15:38

sof-kernel-log-check: remove filter on initial boot issues

65c125b

The kernel was modified to only throw errors after the last boot attempt, there should be no need to filter in sof-test. Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>

sof-kernel-log-check.sh: remove Realtek parity filter

f9fc06c

This error is now handled in the topic/sof-dev kernel. Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>

sof-kernel-log-check.sh: handle escape characters

f5bf952

'.' is a special character in a regexp. Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>

sof-kernel-log-check: remove commented out code

418d6dd

Not sure what this is except legacy cruft. Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>

plbossart force-pushed the fix/error-filters branch 2 times, most recently from 3f65a37 to 38a7b8c Compare August 28, 2020 21:24

xiulipan reviewed Aug 31, 2020

View reviewed changes

tools/sof-kernel-log-check.sh Outdated Show resolved Hide resolved

tools/sof-kernel-log-check.sh Show resolved Hide resolved

tools/sof-kernel-log-check.sh Outdated Show resolved Hide resolved

tools/sof-kernel-log-check.sh Outdated Show resolved Hide resolved

plbossart force-pushed the fix/error-filters branch 5 times, most recently from d18c272 to 93a03fa Compare September 1, 2020 00:07

plbossart force-pushed the fix/error-filters branch from 8e61fae to 2f34fd2 Compare September 1, 2020 19:30

xiulipan suggested changes Sep 2, 2020

View reviewed changes

tools/sof-kernel-log-check.sh Outdated Show resolved Hide resolved

test-case/check-ipc-flood.sh Outdated Show resolved Hide resolved

kv2019i mentioned this pull request Sep 2, 2020

Use inclusive language for DSP cores thesofproject/linux#2415

Merged

plbossart added 2 commits September 2, 2020 15:12

sof-kernel-log-check: add more filters

e25c6b3

A lot more errors reported by CI, maybe due to longer logs? Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>

plbossart force-pushed the fix/error-filters branch from 788ea40 to e25c6b3 Compare September 2, 2020 20:17

plbossart requested review from aiChaoSONG, marc-hb and xiulipan September 2, 2020 20:17

plbossart changed the title ~~[WIP] Fix error filters and use journaltcl~~ Fix error filters and use journaltcl Sep 2, 2020

marc-hb requested a review from a team September 2, 2020 22:08

marc-hb reviewed Sep 2, 2020

View reviewed changes

marc-hb mentioned this pull request Sep 3, 2020

case-lib/lib.sh: add new fake_kern_error() #361

Merged

marc-hb requested changes Sep 3, 2020

View reviewed changes

marc-hb requested a review from a team September 3, 2020 08:21

plbossart closed this Sep 3, 2020

marc-hb mentioned this pull request Sep 3, 2020

ignore_str filter updates from Pierre #362

Merged

marc-hb mentioned this pull request Oct 16, 2020

sof-kernel-log-check.sh: change sed -n $begin,\$p to tail -n +$begin #435

Merged

marc-hb mentioned this pull request Oct 24, 2020

[RFC] use journalctl for kernel log #468

Closed

marc-hb mentioned this pull request Apr 12, 2021

ASoC: Intel: boards: use software node API thesofproject/linux#2810

Merged

marc-hb added the False Pass / green failure label Jul 3, 2021

marc-hb added the area:non-audio Failure False positives: failing when we don't want to label Oct 15, 2021

Fix error filters and use journaltcl #354

Fix error filters and use journaltcl #354

Uh oh!

Conversation

plbossart commented Aug 27, 2020

Uh oh!

marc-hb left a comment

Choose a reason for hiding this comment

Uh oh!

plbossart commented Aug 27, 2020

Uh oh!

Uh oh!

Uh oh!

plbossart commented Aug 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marc-hb commented Aug 29, 2020

Uh oh!

xiulipan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xiulipan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

plbossart commented Sep 2, 2020

Uh oh!

plbossart commented Sep 2, 2020

Uh oh!

plbossart commented Sep 2, 2020

Uh oh!

plbossart commented Sep 2, 2020

Uh oh!

marc-hb left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marc-hb Sep 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marc-hb left a comment

plbossart commented Aug 28, 2020 •

edited

Loading

marc-hb left a comment •

edited

Loading

marc-hb Sep 2, 2020 •

edited

Loading