-
Notifications
You must be signed in to change notification settings - Fork 59
Fix error filters and use journaltcl #354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
marc-hb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add the bug numbers in the code itself?
6f98c6c to
75da0ce
Compare
Done |
01e350d to
9f549a9
Compare
The kernel was modified to only throw errors after the last boot attempt, there should be no need to filter in sof-test. Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
This error is now handled in the topic/sof-dev kernel. Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
TPM, iwlwifi and thermal zone warnings
All warnings are checked with '.', '(' and ')' treated as escapes to
avoid regexp issues.
BugLink: thesofproject#307
BugLink: thesofproject#343
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
'.' is a special character in a regexp. Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Not sure what this is except legacy cruft. Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
3f65a37 to
38a7b8c
Compare
|
I will need some help here, I am completely out of my depth and I don't have a clue how these scripts are supposed to work. It's just horrible to even look at this code, sorry. |
|
Not many people have a clue how these possibly buggy scripts work, and some of these people are gone. Switching to journalctl and error levels is awesome but it's not so easy to implement and very difficult to test because it requires injecting some errors manually. This (past) mistake caught by @aiChaoSONG (thx!) tells me that testing didn't happen yet: KERNEL_LAST_TIMESTAMP=date "+%Y-%m-%d %H:%M:%S.%6N"So I don't think the switch to journalctl and error levels should be combined with unrelated changes and refactorings in the same PR and - hopefully no offence - I think it should be performed by someone who is (unfortunately) willing to "learn some advanced bash" or has already done so. Fred said he already started to work on this, can we please wait until he's back from vacation next week and keep filtering "as usual" in the mean time? I already offered to help him any way I can. A middle-ground is maybe to simply append a |
xiulipan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Pierre to implement the journalctl feature. Comments for fix inline.
d18c272 to
93a03fa
Compare
8e61fae to
2f34fd2
Compare
xiulipan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some bash string issue with quote.
|
@xiulipan your suggestions don't work I really hate this bash thing, it's just horrible. |
Using dmesg lines precludes the filtering of dmesg error levels. Move to journalctl, use a severity filter defined as warning and use a timestamp to record previous checks. Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
A lot more errors reported by CI, maybe due to longer logs? Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
788ea40 to
e25c6b3
Compare
It turns out the parameter also needed to be in double-quotes. Not all scripts did it. Fixed now. |
|
@marc-hb @xiulipan @aiChaoSONG this should be ready to review now, I cleaned-up the commits. Phew. |
|
SOFCI TEST |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid the --since logic doesn't work and falls back on scanning the entire log in every test.
How did you test this locally? To inject errors easily I would locally replace the -k option in journalctl -k with --boot and add logger hello $0 statements in tests.
| { | ||
| # shellcheck disable=SC2034 # external script will use it | ||
| KERNEL_LAST_LINE=$(wc -l /var/log/kern.log|awk '{print $1;}') | ||
| KERNEL_LAST_TIMESTAMP=$(date "+%Y-%m-%d %H:%M:%S.%6N") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note this will break twice a year on system with daylight savings, see systemd/systemd#5194. Can you please add a comment with this URL? I can also add it later.
I believe all our CI systems are now configured with UTC, i.e. no daylight savings. @xiulipan , @aiChaoSONG confirm? UTC is the only sensible choice for an international project anyway...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find a way to make journalctl understand a date in UTC time. If you find a solution this can be a follow-up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is any simple solution either, I'm merely asking to add a comment.
I'll do it in a follow-up PR (please don't "resolve" this thread)
| # clean up dmesg | ||
| sudo dmesg -C | ||
| # discard old kernel logs | ||
| func_lib_setup_kernel_last_timestamp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: nothing gets "discarded"
| # cleanup dmesg buffer before test | ||
| sudo dmesg -c > /dev/null | ||
| # discard old kernel logs | ||
| func_lib_setup_kernel_last_timestamp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: nothing gets "discarded".
| # confirm begin_timestamp is valid, if it is not the number, search full log | ||
| re='^[0-9]+$' | ||
| if [[ $begin_timestamp =~ $re ]] ; then | ||
| cmd="journalctl -k" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how this can work because you have defined above: KERNEL_LAST_TIMESTAMP=$(date "+%Y-%m-%d %H:%M:%S.%6N")
Did you mean KERNEL_LAST_TIMESTAMP=$(date +%s.%6N)?
I'm afraid this code scans the complete kernel log since boot in each test and that we don't see it because you carefully filtered out all known issues. This would also explain why you had KERNEL_LAST_TIMESTAMP=date "+%Y-%m-%d %H:%M:%S.%6N" before and didn't notice.
The commit message of the next commit is also a clue:
A lot more errors reported by CI, maybe due to longer logs?
The only (painful) way to test error handling is to make tests fail, there is no other option. It doesn't have to be in CI, instead you can just locally replace the -k option in journalctl -k with --boot and insert some logger hello1 commands in some tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is for the case where no argument is passed. But in practice we always pass one, so that branch is never taken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't get all the details but for sure the comment is either very confusing or wrong because $begin_timestamp =~ $re doesn't check that the timestamp is "valid", I mean not valid in a way compatible with journalctl --since
If this merely supposed to check the default value then the following code is enough and much simpler and easier to read:
DEFAULT_BEGIN=0 # or any better name
begin_timestamp=${1:-${DEFAULT_BEGIN}}
...
...
...
if [ "$begin_timestamp" = "$DEFAULT_BEGIN" ]; then(BTW $begin_timestamp has spaces so it most definitely needs double quoting. Use shellcheck)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the test means that if there's anything other than numbers, it's not a timestamp.
I don't know why we are arguing about this.
| fi | ||
|
|
||
| func_lib_setup_kernel_last_line() | ||
| func_lib_setup_kernel_last_timestamp() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could have kept this function name for now, it would have made this (risky) refactoring so much smaller. Too late.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's just a rename, not a refactoring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your answer is only about the least relevant and accurate word of my comment.
I'm just saying this could have been much shorter hence significantly easier to review.
| builtin exit 1 | ||
| fi | ||
|
|
||
| # Handle Call Trace separately |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you help me understand how this was done before? (if at all)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea what the previous code did:
err=$(eval $cmd|grep 'Call Trace' -A5 -B3)$(eval $cmd | grep -E "$err_str"|grep -vE "$ignore_str") There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This previous code simply:
- looked for Call Traces
- looked for errors, ignoring known errors
- concatenated 1. and 2. into
$err
If $err is empty then it means both 1. and 2. are empty.
If either is non-empty then both were shown whereas you're not showing Call Traces anymore when there is an error.
Can you explain why you changed that code? Only because you didn't understand it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I didn't understand it, and that's a good enough reason, no?
Needed to (manually and locally) test our overly complex error handling framework, see for instance PR thesofproject#354 Signed-off-by: Marc Herbert <marc.herbert@intel.com>
marc-hb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tl;dr: I finally tested this PR for real and it introduces at least 1 or 2 regressions.
Based on some previous issues (now fixed) and discussions I think it's clear you haven't been testing this PR by making some tests FAIL. Tests that don't fail are not tested, so for such a risky change of our core (and messy) error detection "framework", green results in CI don't prove anything.
So I bit the bullet, implemented and tested a (small but very useful) function that injects fake kernel errors in the logs to properly test this part of the code at last. It's in PR #361, everyone please help review. It's small.
Now of course I used this new fake_kern_error() function to manually and locally add one-line errors in a few tests. Then I used run-all-tests.sh to locally check that all the tests where I injected errors were failing; and only those.
So far I had time to try only one combination of artificially failing tests yet this very small scale testing was enough to immediately find that you're introducing a "green failure" regression in check-playback.sh, see more details below.
I also observed a "red success" that is also introduced by this PR in another test but I didn't have time to analyze it yet. Hopefully tomorrow.
Please use PR #361 (or better) too to make some tests fail and actually test this PR. Draft PRs with fake errors should help too, I will likely launch a few tomorrow too.
I really do appreciate the huge effort to attempt to clean up this mess but we really can't afford to make it worse than it already is and the only way to avoid making it worse is to test any change very thoroughly. The worse the code is, the more testing any change requires.
| fi | ||
|
|
||
| func_lib_setup_kernel_last_line() | ||
| func_lib_setup_kernel_last_timestamp() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your answer is only about the least relevant and accurate word of my comment.
I'm just saying this could have been much shorter hence significantly easier to review.
| { | ||
| # shellcheck disable=SC2034 # external script will use it | ||
| KERNEL_LAST_LINE=$(wc -l /var/log/kern.log|awk '{print $1;}') | ||
| KERNEL_LAST_TIMESTAMP=$(date "+%Y-%m-%d %H:%M:%S.%6N") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is any simple solution either, I'm merely asking to add a comment.
I'll do it in a follow-up PR (please don't "resolve" this thread)
| builtin exit 1 | ||
| fi | ||
|
|
||
| # Handle Call Trace separately |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This previous code simply:
- looked for Call Traces
- looked for errors, ignoring known errors
- concatenated 1. and 2. into
$err
If $err is empty then it means both 1. and 2. are empty.
If either is non-empty then both were shown whereas you're not showing Call Traces anymore when there is an error.
Can you explain why you changed that code? Only because you didn't understand it?
| done | ||
|
|
||
| sof-kernel-log-check.sh $KERNEL_LAST_LINE >/dev/null | ||
| sof-kernel-log-check.sh "$KERNEL_LAST_TIMESTAMP" >/dev/null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This call here did not seem to make much sense and still does not...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just performed a dmesg->journactl change. You can't possibly expect me to transform this bag of idiotic scripts into something perfect in one shot.
| sudo dmesg -C | ||
| # discard old kernel logs | ||
| func_lib_setup_kernel_last_timestamp | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dmesg -C here was a bug but a harmless bug because the sof-kernel-log-check.sh call at the end generally checks kern.log, not dmesg. Now you're turning this into a real, "green failure" bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the comment, sorry. what is the issue?
| # confirm begin_timestamp is valid, if it is not the number, search full log | ||
| re='^[0-9]+$' | ||
| if [[ $begin_timestamp =~ $re ]] ; then | ||
| cmd="journalctl -k" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't get all the details but for sure the comment is either very confusing or wrong because $begin_timestamp =~ $re doesn't check that the timestamp is "valid", I mean not valid in a way compatible with journalctl --since
If this merely supposed to check the default value then the following code is enough and much simpler and easier to read:
DEFAULT_BEGIN=0 # or any better name
begin_timestamp=${1:-${DEFAULT_BEGIN}}
...
...
...
if [ "$begin_timestamp" = "$DEFAULT_BEGIN" ]; then(BTW $begin_timestamp has spaces so it most definitely needs double quoting. Use shellcheck)
|
I AM DONE. |
Not "perfect", certainly not. But not giving more misleading results than it gives now, which is unfortunately a huge challenge that I'm afraid you underestimated despite my "don't go there" warnings and my various, less ambitious suggestions. BTW I will resubmit now the "simpler" yet very useful commits of this PR in a smaller PR. "Not giving more misleading results" = not adding any new green failure which unfortunately requires a significant test effort involving some fake errors (see small #361) |
Done in #362 |
Remove useless filters, add new ones for Dell CML-U