Skip to content

Conversation

@plbossart
Copy link
Member

Remove useless filters, add new ones for Dell CML-U

@plbossart plbossart requested a review from a team as a code owner August 27, 2020 21:16
Copy link
Collaborator

@marc-hb marc-hb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add the bug numbers in the code itself?

@plbossart
Copy link
Member Author

Can you please add the bug numbers in the code itself?

Done

@plbossart plbossart force-pushed the fix/error-filters branch 2 times, most recently from 01e350d to 9f549a9 Compare August 28, 2020 02:00
@plbossart plbossart changed the title Fix error filters [WIP] Fix error filters and use journaltcl Aug 28, 2020
The kernel was modified to only throw errors after the last boot
attempt, there should be no need to filter in sof-test.

Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
This error is now handled in the topic/sof-dev kernel.

Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
TPM, iwlwifi and thermal zone warnings

All warnings are checked with '.', '(' and ')' treated as escapes to
avoid regexp issues.

BugLink: thesofproject#307
BugLink: thesofproject#343
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
'.' is a special character in a regexp.

Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Not sure what this is except legacy cruft.

Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
@plbossart plbossart force-pushed the fix/error-filters branch 2 times, most recently from 3f65a37 to 38a7b8c Compare August 28, 2020 21:24
@plbossart
Copy link
Member Author

plbossart commented Aug 28, 2020

I will need some help here, I am completely out of my depth and I don't have a clue how these scripts are supposed to work. It's just horrible to even look at this code, sorry.
@marc-hb @aiChaoSONG @xiulipan please be mindful of folks like me who are users and don't have any plans to learn advanced bash. CI tests work and run-all-tests.sh doesn't but says PASS.

/underrun!!! (at least 177.843 ms long)
2020-08-28 22:50:48 UTC Sub-Test: [COMMAND] kill process: kill -9 36819
/root/sof-test/tools/sof-kernel-log-check.sh: line 3: 1: "NONE": syntax error: operand expected (error token is ""NONE"")
Failed to parse timestamp: 
Failed to parse timestamp: 
2020-08-28 22:50:49 UTC Sub-Test: [INFO] Test Result: PASS!
```
 

@marc-hb
Copy link
Collaborator

marc-hb commented Aug 29, 2020

Not many people have a clue how these possibly buggy scripts work, and some of these people are gone. Switching to journalctl and error levels is awesome but it's not so easy to implement and very difficult to test because it requires injecting some errors manually. This (past) mistake caught by @aiChaoSONG (thx!) tells me that testing didn't happen yet:

KERNEL_LAST_TIMESTAMP=date "+%Y-%m-%d %H:%M:%S.%6N"

So I don't think the switch to journalctl and error levels should be combined with unrelated changes and refactorings in the same PR and - hopefully no offence - I think it should be performed by someone who is (unfortunately) willing to "learn some advanced bash" or has already done so.

Fred said he already started to work on this, can we please wait until he's back from vacation next week and keep filtering "as usual" in the mean time? I already offered to help him any way I can.

A middle-ground is maybe to simply append a --level option to the existing dmesg and not perform any major surgery.

Copy link
Contributor

@xiulipan xiulipan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Pierre to implement the journalctl feature. Comments for fix inline.

@plbossart plbossart force-pushed the fix/error-filters branch 5 times, most recently from d18c272 to 93a03fa Compare September 1, 2020 00:07
Copy link
Contributor

@xiulipan xiulipan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some bash string issue with quote.

@plbossart
Copy link
Member Author

@xiulipan your suggestions don't work

2020-09-02 16:12:21 UTC [REMOTE_COMMAND] last timestamp check: 2020-09-02 16:12:20.480743
check begin_timestamp: 2020-09-02
check cmd: journalctl -k --since='2020-09-02'

I really hate this bash thing, it's just horrible.

Using dmesg lines precludes the filtering of dmesg error levels.

Move to journalctl, use a severity filter defined as warning and use a
timestamp to record previous checks.

Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
A lot more errors reported by CI, maybe due to longer logs?

Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
@plbossart plbossart changed the title [WIP] Fix error filters and use journaltcl Fix error filters and use journaltcl Sep 2, 2020
@plbossart
Copy link
Member Author

@xiulipan your suggestions don't work

2020-09-02 16:12:21 UTC [REMOTE_COMMAND] last timestamp check: 2020-09-02 16:12:20.480743
check begin_timestamp: 2020-09-02
check cmd: journalctl -k --since='2020-09-02'

I really hate this bash thing, it's just horrible.

It turns out the parameter also needed to be in double-quotes. Not all scripts did it. Fixed now.

@plbossart
Copy link
Member Author

@marc-hb @xiulipan @aiChaoSONG this should be ready to review now, I cleaned-up the commits. Phew.

@plbossart
Copy link
Member Author

SOFCI TEST

@marc-hb marc-hb requested a review from a team September 2, 2020 22:08
Copy link
Collaborator

@marc-hb marc-hb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid the --since logic doesn't work and falls back on scanning the entire log in every test.

How did you test this locally? To inject errors easily I would locally replace the -k option in journalctl -k with --boot and add logger hello $0 statements in tests.

{
# shellcheck disable=SC2034 # external script will use it
KERNEL_LAST_LINE=$(wc -l /var/log/kern.log|awk '{print $1;}')
KERNEL_LAST_TIMESTAMP=$(date "+%Y-%m-%d %H:%M:%S.%6N")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this will break twice a year on system with daylight savings, see systemd/systemd#5194. Can you please add a comment with this URL? I can also add it later.

I believe all our CI systems are now configured with UTC, i.e. no daylight savings. @xiulipan , @aiChaoSONG confirm? UTC is the only sensible choice for an international project anyway...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find a way to make journalctl understand a date in UTC time. If you find a solution this can be a follow-up PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is any simple solution either, I'm merely asking to add a comment.
I'll do it in a follow-up PR (please don't "resolve" this thread)

# clean up dmesg
sudo dmesg -C
# discard old kernel logs
func_lib_setup_kernel_last_timestamp
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: nothing gets "discarded"

# cleanup dmesg buffer before test
sudo dmesg -c > /dev/null
# discard old kernel logs
func_lib_setup_kernel_last_timestamp
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: nothing gets "discarded".

# confirm begin_timestamp is valid, if it is not the number, search full log
re='^[0-9]+$'
if [[ $begin_timestamp =~ $re ]] ; then
cmd="journalctl -k"
Copy link
Collaborator

@marc-hb marc-hb Sep 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how this can work because you have defined above: KERNEL_LAST_TIMESTAMP=$(date "+%Y-%m-%d %H:%M:%S.%6N")

Did you mean KERNEL_LAST_TIMESTAMP=$(date +%s.%6N)?

I'm afraid this code scans the complete kernel log since boot in each test and that we don't see it because you carefully filtered out all known issues. This would also explain why you had KERNEL_LAST_TIMESTAMP=date "+%Y-%m-%d %H:%M:%S.%6N" before and didn't notice.

The commit message of the next commit is also a clue:

A lot more errors reported by CI, maybe due to longer logs?

The only (painful) way to test error handling is to make tests fail, there is no other option. It doesn't have to be in CI, instead you can just locally replace the -k option in journalctl -k with --boot and insert some logger hello1 commands in some tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is for the case where no argument is passed. But in practice we always pass one, so that branch is never taken.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't get all the details but for sure the comment is either very confusing or wrong because $begin_timestamp =~ $re doesn't check that the timestamp is "valid", I mean not valid in a way compatible with journalctl --since

If this merely supposed to check the default value then the following code is enough and much simpler and easier to read:

DEFAULT_BEGIN=0 # or any better name
begin_timestamp=${1:-${DEFAULT_BEGIN}}
...
...
...
if [ "$begin_timestamp" = "$DEFAULT_BEGIN" ]; then

(BTW $begin_timestamp has spaces so it most definitely needs double quoting. Use shellcheck)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the test means that if there's anything other than numbers, it's not a timestamp.

I don't know why we are arguing about this.

fi

func_lib_setup_kernel_last_line()
func_lib_setup_kernel_last_timestamp()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could have kept this function name for now, it would have made this (risky) refactoring so much smaller. Too late.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's just a rename, not a refactoring.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your answer is only about the least relevant and accurate word of my comment.
I'm just saying this could have been much shorter hence significantly easier to review.

builtin exit 1
fi

# Handle Call Trace separately
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me understand how this was done before? (if at all)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea what the previous code did:

    err=$(eval $cmd|grep 'Call Trace' -A5 -B3)$(eval $cmd | grep -E "$err_str"|grep -vE "$ignore_str")	

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This previous code simply:

  1. looked for Call Traces
  2. looked for errors, ignoring known errors
  3. concatenated 1. and 2. into $err

If $err is empty then it means both 1. and 2. are empty.

If either is non-empty then both were shown whereas you're not showing Call Traces anymore when there is an error.

Can you explain why you changed that code? Only because you didn't understand it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I didn't understand it, and that's a good enough reason, no?

marc-hb added a commit to marc-hb/sof-test that referenced this pull request Sep 3, 2020
Needed to (manually and locally) test our overly complex error handling
framework, see for instance PR thesofproject#354

Signed-off-by: Marc Herbert <marc.herbert@intel.com>
Copy link
Collaborator

@marc-hb marc-hb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tl;dr: I finally tested this PR for real and it introduces at least 1 or 2 regressions.

Based on some previous issues (now fixed) and discussions I think it's clear you haven't been testing this PR by making some tests FAIL. Tests that don't fail are not tested, so for such a risky change of our core (and messy) error detection "framework", green results in CI don't prove anything.

So I bit the bullet, implemented and tested a (small but very useful) function that injects fake kernel errors in the logs to properly test this part of the code at last. It's in PR #361, everyone please help review. It's small.

Now of course I used this new fake_kern_error() function to manually and locally add one-line errors in a few tests. Then I used run-all-tests.sh to locally check that all the tests where I injected errors were failing; and only those.

So far I had time to try only one combination of artificially failing tests yet this very small scale testing was enough to immediately find that you're introducing a "green failure" regression in check-playback.sh, see more details below.

I also observed a "red success" that is also introduced by this PR in another test but I didn't have time to analyze it yet. Hopefully tomorrow.

Please use PR #361 (or better) too to make some tests fail and actually test this PR. Draft PRs with fake errors should help too, I will likely launch a few tomorrow too.

I really do appreciate the huge effort to attempt to clean up this mess but we really can't afford to make it worse than it already is and the only way to avoid making it worse is to test any change very thoroughly. The worse the code is, the more testing any change requires.

fi

func_lib_setup_kernel_last_line()
func_lib_setup_kernel_last_timestamp()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your answer is only about the least relevant and accurate word of my comment.
I'm just saying this could have been much shorter hence significantly easier to review.

{
# shellcheck disable=SC2034 # external script will use it
KERNEL_LAST_LINE=$(wc -l /var/log/kern.log|awk '{print $1;}')
KERNEL_LAST_TIMESTAMP=$(date "+%Y-%m-%d %H:%M:%S.%6N")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is any simple solution either, I'm merely asking to add a comment.
I'll do it in a follow-up PR (please don't "resolve" this thread)

builtin exit 1
fi

# Handle Call Trace separately
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This previous code simply:

  1. looked for Call Traces
  2. looked for errors, ignoring known errors
  3. concatenated 1. and 2. into $err

If $err is empty then it means both 1. and 2. are empty.

If either is non-empty then both were shown whereas you're not showing Call Traces anymore when there is an error.

Can you explain why you changed that code? Only because you didn't understand it?

done

sof-kernel-log-check.sh $KERNEL_LAST_LINE >/dev/null
sof-kernel-log-check.sh "$KERNEL_LAST_TIMESTAMP" >/dev/null
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call here did not seem to make much sense and still does not...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just performed a dmesg->journactl change. You can't possibly expect me to transform this bag of idiotic scripts into something perfect in one shot.

sudo dmesg -C
# discard old kernel logs
func_lib_setup_kernel_last_timestamp

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dmesg -C here was a bug but a harmless bug because the sof-kernel-log-check.sh call at the end generally checks kern.log, not dmesg. Now you're turning this into a real, "green failure" bug.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the comment, sorry. what is the issue?

# confirm begin_timestamp is valid, if it is not the number, search full log
re='^[0-9]+$'
if [[ $begin_timestamp =~ $re ]] ; then
cmd="journalctl -k"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't get all the details but for sure the comment is either very confusing or wrong because $begin_timestamp =~ $re doesn't check that the timestamp is "valid", I mean not valid in a way compatible with journalctl --since

If this merely supposed to check the default value then the following code is enough and much simpler and easier to read:

DEFAULT_BEGIN=0 # or any better name
begin_timestamp=${1:-${DEFAULT_BEGIN}}
...
...
...
if [ "$begin_timestamp" = "$DEFAULT_BEGIN" ]; then

(BTW $begin_timestamp has spaces so it most definitely needs double quoting. Use shellcheck)

@marc-hb marc-hb requested a review from a team September 3, 2020 08:21
@plbossart
Copy link
Member Author

I AM DONE.

@plbossart plbossart closed this Sep 3, 2020
@marc-hb
Copy link
Collaborator

marc-hb commented Sep 3, 2020

You can't possibly expect me to transform this bag of idiotic scripts into something perfect in one shot.

Not "perfect", certainly not. But not giving more misleading results than it gives now, which is unfortunately a huge challenge that I'm afraid you underestimated despite my "don't go there" warnings and my various, less ambitious suggestions. BTW I will resubmit now the "simpler" yet very useful commits of this PR in a smaller PR.

"Not giving more misleading results" = not adding any new green failure which unfortunately requires a significant test effort involving some fake errors (see small #361)

@marc-hb
Copy link
Collaborator

marc-hb commented Sep 3, 2020

BTW I will resubmit now the "simpler" yet very useful commits of this PR in a smaller PR.

Done in #362

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:non-audio Failure False positives: failing when we don't want to False Pass / green failure

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants