Skip to content

Conversation

@xiulipan
Copy link
Contributor

@xiulipan xiulipan commented Oct 23, 2020

Try to use journalctl to read kernel log from system.
Use UNIX style timestamp to split kernel logs from test case to test case.
Print same format like dmesg in journalctl kernel log.
Update all test case used to relay on demsg logs

@xiulipan xiulipan requested a review from a team as a code owner October 23, 2020 11:48
@marc-hb
Copy link
Collaborator

marc-hb commented Oct 24, 2020

Previous attempts (and previous mistakes) in #263 and #354

"Those who ignore history are doomed to repeat it"

PS: I did NOT look at this newer attempt yet

@xiulipan
Copy link
Contributor Author

@marc-hb Yep, I reviewed both #263 and #354 and some patch is cherry-pick from previous work.
Still trying to find a clear way to get the log collection feature work with journalctl.

Directly use journalctl in the testcase to avoid kernel log missing
issues.

Signed-off-by: Pan Xiuli <xiuli.pan@linux.intel.com>
use journalctl by default for any kernel log reader usage.
Also add header for the file.

Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Signed-off-by: Pan Xiuli <xiuli.pan@linux.intel.com>
Add path for the sof-dump-status.py to avoid error:
sof-dump-status.py: command not found

Signed-off-by: Pan Xiuli <xiuli.pan@linux.intel.com>
use journalctl to get kernl logs.
use timestamp to split logs for each test case.
replace all test case to use the timestamp

Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Signed-off-by: Pan Xiuli <xiuli.pan@linux.intel.com>
@xiulipan xiulipan force-pushed the pr/journalctl branch 2 times, most recently from 6f7c349 to 8d567e1 Compare October 28, 2020 07:22
Add common error logs from systemd, initramfs, atkbd, i801_smbus,
ata3, genirq, DMAR, dw_dmac, proc_thermal, SDWm, touch pad

Those error level logs did not break functionality of the DUTs.

Signed-off-by: Pan Xiuli <xiuli.pan@linux.intel.com>
remove the redundant kernel log check for all case.
enable kernel log check for each iteration.

Signed-off-by: Pan Xiuli <xiuli.pan@linux.intel.com>
remove the redundant kernel log check for all case.
enable kernel log check for each iteration.

Signed-off-by: Pan Xiuli <xiuli.pan@linux.intel.com>
remove the redundant kernel log check for all case.
enable kernel log check for each iteration.

Signed-off-by: Pan Xiuli <xiuli.pan@linux.intel.com>
remove the redundant kernel log check for all case.
enable kernel log check for each iteration.

Signed-off-by: Pan Xiuli <xiuli.pan@linux.intel.com>
…iteration

remove the redundant kernel log check for all case.
enable kernel log check for each iteration.
also do some clean up for this script

Signed-off-by: Pan Xiuli <xiuli.pan@linux.intel.com>
…teration

remove the redundant kernel log check for all case.
enable kernel log check for each iteration.

Signed-off-by: Pan Xiuli <xiuli.pan@linux.intel.com>
remove the redundant kernel log check for all case.
enable kernel log check for each iteration.

Signed-off-by: Pan Xiuli <xiuli.pan@linux.intel.com>
…tion

remove the redundant kernel log check for all case.
enable kernel log check for each iteration.

Signed-off-by: Pan Xiuli <xiuli.pan@linux.intel.com>
remove the redundant kernel log check for all case.
enable kernel log check for each iteration.

Signed-off-by: Pan Xiuli <xiuli.pan@linux.intel.com>
…eration

remove the redundant kernel log check for all case.
enable kernel log check for each iteration.

Signed-off-by: Pan Xiuli <xiuli.pan@linux.intel.com>
try to flush and sync journalctl to avoid empty logs.

Signed-off-by: Pan Xiuli <xiuli.pan@linux.intel.com>
@xiulipan xiulipan changed the title [RFC] use journalctl for kernel log [WIP] use journalctl for kernel log Oct 28, 2020
# sof-kernel-log-check script parameter number is 0/Non-Number will force check from dmesg
sof-kernel-log-check.sh 0 || die "Catch error in dmesg"
# check kernel log for each iteration to catch issues
sof-kernel-log-check.sh $KERNEL_LAST_TIMESTAMP || die "Catch error in kernel log"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caught error instead of Catch error


sof-kernel-log-check.sh 0 || die "Catched error in dmesg"
# check kernel log for each iteration to catch issues
sof-kernel-log-check.sh $KERNEL_LAST_TIMESTAMP || die "Catch error in kernel log"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment here. lets fix it everywhere

Copy link
Collaborator

@marc-hb marc-hb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fixing an extremely critical feature so this must be broken down in smaller pull requests and merged progressively = "continuous" integration. This is especially important considering the regular stream of other, unrelated CI changes and issues will not stop and will still need to be investigated and addressed at the same time. I mean we cannot stop everyone else (including you!) from making other CI fixes and improvements while performing this critical change. We also need the ability to quickly revert the last PR if anything goes wrong - and from past experience a number of things will go wrong.

  • As long as you use the -k flags you do not need to modify the ignore_str at the same time. The last thing we want is to change the implementation and the ignore_str at the same time. So let's forget about any ignore_str changes for now.

  • The --priority flag must also be postponed.

  • The first PR to submit is the PR that renames func_lib_setup_kernel_last_line to func_lib_setup_kernel_last_timestamp and KERNEL_LAST_LINE to KERNEL_LAST_TIMESTAMP without making any real code change. That will first get a lot of changed lines out of the way with no risk. The name will not match the old implementation for the transition period but we don't care.

  • The second PR to submit is quoting all the instances of $KERNEL_LAST_TIMESTAMP and then wait a day or two to make sure everything still works.

  • Then we can start flipping the switch with high impact but small number of lines PR.

Most importantly: what sort of (fake) errors did you test this with? Testing a change like this requires at least two throw-away PRs:

  • one throw-away PR with strategically placed fake errors + old code
  • a second throw-away PR with the exact same fake errors + the new code

The test results for both must be exactly the same.

for i in $(seq 1 $loop_cnt)
do
# cleanup dmesg buffer for each iteration
sudo dmesg -c > /dev/null
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still required?

@xiulipan xiulipan changed the title [WIP] use journalctl for kernel log [RFC] use journalctl for kernel log Oct 29, 2020
@xiulipan
Copy link
Contributor Author

@marc-hb Thanks for the tips, I am also think about how to spilt the patches. Please help to check my plan:

  1. replace LINE with TIMESTAMP, log check with no --priority but still with err_str
  2. quoting clean up for the $KERNEL_LAST_TIMESTAMP
  3. remove err_str and use --priority, add new ignore_str
  4. clean up test case to remove duplicate kernel log check.

some discussion I want to have here:

  • What priority level we want to track here? @plbossart @ranj063 In this RFC, I have tried --priority=err Is this enough? Will warning be overkill?
  • verify-sof-firmware-load.sh seems still have issue even with sudo journalctl --sync --flush, I could not tell what happened to it
    and we still got a fail https://sof-ci.01.org/softestpr/PR468/build344/devicetest/ on BSW. I will try to fallback to dmesg for this test.

Keep this PR for testing and I will start with task 1.

@xiulipan
Copy link
Contributor Author

xiulipan commented Nov 2, 2020

close with latest part 1 PR #480

@xiulipan xiulipan closed this Nov 2, 2020
@marc-hb
Copy link
Collaborator

marc-hb commented Nov 5, 2020

Please help to check my plan:

Much finer-grained, much better!

For such a "core" change like this we really need "Totally Continuous" Integration: small, well tested changes, quick reviews, and then at least a couple days careful observation between merged PRs. See my request in #480.
One day observation is enough after non-functional changes like renames, more days needed to observe more impactful changes.

What priority level we want to track here? @plbossart @ranj063 In this RFC, I have tried --priority=err Is this enough? Will warning be overkill?

We can talk about changing levels and filtering last, after everything else is in place. Then we should involve many more people. For now the priority is to switch to journalctl while preserving everything else exactly the same - and then find and fix everything that broke with just that before even tweaking the filtering.

@marc-hb
Copy link
Collaborator

marc-hb commented Nov 5, 2020

verify-sof-firmware-load.sh seems still have issue

Still now after fixing all clock and timezone issues?

even with sudo journalctl --sync --flush, I could not tell what happened to it

These options make absolutely no difference except if you use them a split second before of system crash or sudden power loss. We are not really dealing with crashes anyway so don't use them, they're just confusing.

and we still got a fail https://sof-ci.01.org/softestpr/PR468/build344/devicetest/ on BSW.

BSW has been experiencing a lot of storage issues (see internal issue 539 + links), could that be the cause?

I will try to fallback to dmesg for this test.

Please no, that sounds brittle and complex. Let's understand what the real issue is first and/or temporarily disable BSW in the mean time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants