Use journalctl for log collection and remove dependency on /var/log/kern.log #639

aiChaoSONG · 2021-03-29T07:09:59Z

No description provided.

marc-hb

Thanks @chao for looking at this, really appreciated.

Please don't hijack the checkpoint function(s) and KERNEL_CHECKPOINT variable for this. The entire purpose of the checkpoints is to NOT scan the logs from boot, so having a special checkpoint case that is not actually a checkpoint is quite confusing. EDIT: another normal "checkpoint" assumption is that they move only forward.

The few special tests that want to scan all the logs from boot should ideally invoke journalctl_cmd --boot directly and specifically without a date, then it will be more obvious what they do. If that's too complicated for some reason then as an alternative they could define some new, IGNORE_CHECKPOINTS=true or SCAN_LOGS_FROM_BOOT=true global variable.

Not changing the nature of checkpoints should also reduce the number of lines changed considerably.

case-lib/lib.sh

marc-hb · 2021-03-30T06:23:43Z

case-lib/lib.sh

+            ;;
+        "now")
+            # There is a 5s delay between two cases by sof-framework,
+            # make the check point 6s earlier to avoid log loss.


This is too brittle. The simpler and safer way to solve this problem is to set checkpoints only at (almost) the same time as when collecting logs and never at any other time.

More importantly: I think this is an important but different problem, not directly related to this Pull Request, so best left aside for now.

@marc-hb I was not able to understand your feedback.

a) please describe when checkpoints should be set.

b) why can it be done in a separate PR, since it's a key functionality to enable journalctl.

There is "log loss" if one checkpoint is set and then another checkpoint is set again (5 seconds) later and logs between these two checkpoints are ignored. I believe this is the problem that this -6seconds hack is trying to address, however it is:

hard to be sure because there is not much description for this bug

very brittle because it relies on this 5 seconds gap to be always exactly the same between two consecutive tests

duplicating approximately 1 second of logs into two consecutive tests "to be sure"

duplicating approximately 6 seconds of logs when there is no gap

This seems to be an interesting bug but it's unrelated to this PR and it's not even related to whether journalctl or whatever else is used to collect logs.

BTW journalctl is already in use for most logs, this PR is just ~~trying to finish the job~~ removing kern.log. Then we should remove dmesg.

The proper fix is to simply never leave any such gap. In other word don't set a checkpoint and then set another checkpoint later without scanning or collecting the logs in between.

@marc-hb Before implementing this, I was thinking to use a new Global Variable to collect boot logs, too. But then I found that this can be integrated in the same function, setup_kernel_check_point, to provide a unified API for test case.

A test case designer can decide whether they want to collect logs from boot, or from now. When you say another check point, I think you mean setup_kernel_check_point "boot", Actually this is only used in the first test case currently. Moving forward principle is not influenced, because we give test case the flexibility setup check point. For this reason, we take "boot" as a virtual check point.

With current code, the command for collecting log from boot is: journalctl -k -q --no-pager --utc --output=short-monotonic --no-hostname --boot , the command for collecting log from check point is: journalctl -k -q --no-pager --utc --output=short-monotonic --no-hostname --boot --since=@CHECK_POINT, already good enough to handle both conditions.

By the way, the --boot option means collect log of the latest boot, it doesn't mean collect boot logs.

@plbossart @marc-hb On the potential log loss.

sof-framework will add a 5s delay between two test cases, we are not sure there will be new logs in this 5s period, so I make the checkpoint 6s earlier with KERNEL_CHECKPOINT=$(($(date +%s) - 6)). Actually, no case is running on DUT in this 5s period, very low possibility to have kernel logs, I think. The point here becomes, should we care and collect very likely unrelated and sometime may be empty logs between test cases? With 6 here, we may have 1s log duplicated. We can choose 0, 1, 2, 3, 4, 5 or 6 to decrease log loss. In my opinion now, 5 may be the best with very low possibility of log loss and no log duplication.

in the same function, setup_kernel_check_point, to provide a unified API for test case.

I don't think it's good to have a unified API for two different things, especially not when the API is named after only one of the two things.

Actually this is only used in the first test case currently.

So it's not great either to change and require almost every invocation to over-specify "now" and introduce the abstract concept of a "virtual" checkpoints that are not checkpoints, all this for just one special case. BTW it should be possible to run tests in any order.

In fact the only test that uses setup_kernel_check_point "boot" in this PR is apparently verify-firmware-presence.sh That test is super short and simple and merely making sure that /lib/firmware/intel/sof/platform.ri exists, so it should not even check kernel logs at all, kernel logs are not relevant for such a test. Is it really the only test that needed kern.log?

In the future we probably want a test that does nothing and checks boot logs but that would be a different test, it shouldn't be confused with checking /lib/firmware.

By the way, the --boot option means collect log of the latest boot, it doesn't mean collect boot logs.

Sorry I used both somewhat interchangeably.

The point here becomes, should we care and collect very likely unrelated and sometime may be empty logs between test cases?

Good question but IMHO not new and independent from this PR. One problem at a time.

This patch sets up kernel check point 5s earlier, because CI will add a 5s delay between two cases, and also renames func_lib_setup_kernel_check_point to setup_kernel_check_point, because the prefix doesn't provide much information. Signed-off-by: Chao Song <chao.song@linux.intel.com>

Previously, we are using /var/log/kern.log to collect kernel logs, but this file is not present on Chrome/Fedora, etc. Use journalctl to make log collection more portable. Signed-off-by: Chao Song <chao.song@linux.intel.com>

Signed-off-by: Chao Song <chao.song@linux.intel.com>

marc-hb

I have two "big" questions, maybe related to each other:

If one test wants to scan the entire boot log with this new design then how it is supposed to do it? Do we even have any such test right now?
I see many more checkpoint additions than DMESG_LOG_START_LINE removals. In other words some tests are now setting a checkpoint and they were doing nothing before. What part of the kernel log were they scanning before?

marc-hb · 2021-04-07T22:55:25Z

case-lib/lib.sh

-    KERNEL_CHECKPOINT=$(date +%s)
+    # There is a 5s delay between two cases by sof-framework,
+    # make the check point 5s earlier to avoid log loss.
+    export KERNEL_CHECKPOINT=$(($(date +%s) - 5))


This still belong to a different PR but OK, it's in a different commit and it's a small enough change. Small in number of lines, not in consequences...

Can you please

Add this to the comment:

# There is a 5s delay between two cases by sof-framework, # make the check point 5s earlier to avoid log loss. # Note this can make an error caused by one test appear in the next one

Make the 5s a global, $SOF_TEST_INTERVAL variable and use it here?

marc-hb · 2021-04-07T22:58:28Z

case-lib/lib.sh

 {
   journalctl -k -q --no-pager --utc --output=short-monotonic \
-     --no-hostname  "$@"
+     --no-hostname --boot "$@"


-k implies --boot, please remove this.

Feel free to expand -k to --kernel. Short options are meant to save typing time on the command line but they are less readable in scripts which are read many times more than they are written.

marc-hb · 2021-04-07T23:06:12Z

case-lib/lib.sh

-    DMESG_LOG_START_LINE=$(wc -l /var/log/kern.log|awk '{print $1;}')
-fi
-
 is_sof_used()


In an ideal world every commit should be usable for bisecatability, cherry-picking, reverting, etc. reasons. I guess this implies the entire DMESG_LOG_START_LINE removal should be a single commit? Good the second commit does not even have a commit message :-)

marc-hb · 2021-04-07T23:07:28Z

case-lib/hijack.sh

+        journalctl_cmd --since=@"$KERNEL_CHECKPOINT" > "$LOG_ROOT/dmesg.txt"
    else
-        cut -f5- -d ' ' /var/log/kern.log > "$LOG_ROOT/dmesg.txt"
+        die "Kernel check point is not properly setup"


die "KERNEL_CHECKPOINT is not properly setup"

Much faster git grep when this happens.

normally, this should not happen, I just want to use this to tell the case owner that he/she should setup up check point.
could you please explain more on git grep?

When someone sees the message with KERNEL_CHECKPOINT then they can immediately git grep that and find all relevant locations in sof-test.

marc-hb · 2021-04-07T23:13:55Z

test-case/check-reboot.sh

 export LOG_ROOT='$(realpath $LOG_ROOT)'
-export DMESG_LOG_START_LINE=$(wc -l /var/log/kern.log|awk '{print $1;}')
+setup_kernel_check_point
+export KERNEL_CHECKPOINT


Can you add a comment explaining why this one needs the export when the others don't?

Wouldn't it be simpler to export KERNEL_CHECKPOINT always? If not why not?

Sorry, this is a mistake from previously version, the export KERNEL_CHECKPOINT should be removed.

aiChaoSONG · 2021-04-08T02:27:34Z

If one test wants to scan the entire boot log with this new design then how it is supposed to do it? Do we even have any such test right now?

This is in my next PR which add a new function disable_kernel_check_point, which set KERNEL_CHECKPOINT to "disabled", and add one more check, once it is disabled, collect log from boot, and also a new test case to collect boot log.

I see many more checkpoint additions than DMESG_LOG_START_LINE removals. In other words some tests are now setting a checkpoint and they were doing nothing before. What part of the kernel log were they scanning before?

Some cases don't have DMESG_LOG_START_LINE because it is set implicitly in lib.sh, and now I want test case writer to explicitly set a check point. Why are some other cases have DMESG_LOG_START_LINE? actually I don't know, ask the owner!

DMESG_LOG_START_LINE=$(wc -l /var/log/kern.log|awk '{print $1;}')

marc-hb · 2021-04-12T21:59:45Z

Why are some other cases have DMESG_LOG_START_LINE? actually I don't know, ask the owner!

We must make 100% sure every single test catch errors when they happen and especially when refactoring this most critical part of sof-test. Otherwise when end up with "green failures" like this most recent one and many others before it:
thesofproject/linux#2810 (comment)

That requires significant time and effort but taking shortcuts and running and maintaining tests that catch errors "maybe" is pointless, better not spend time at all if you can't afford to fully understand test this particular refactoring thoroughly. It affects our entire test suite and product quality.

aiChaoSONG · 2021-04-20T07:13:06Z

this PR is separated to two other PRs #643, #649, and they are merged, close this.

aiChaoSONG requested a review from a team as a code owner March 29, 2021 07:09

aiChaoSONG force-pushed the journalctl branch 2 times, most recently from a7bcac4 to 1d13104 Compare March 29, 2021 08:23

marc-hb mentioned this pull request Mar 30, 2021

[FEATURE] sof-test needs to support Fedora 34 #633

Closed

aiChaoSONG force-pushed the journalctl branch from 1d13104 to db67f6b Compare March 30, 2021 02:45

marc-hb requested changes Mar 30, 2021

View reviewed changes

marc-hb changed the title ~~[TEST] Use jorunalctl for log collection and remove dependency on /var/log/kern.log~~ [TEST] Use journalctl for log collection and remove dependency on /var/log/kern.log Mar 30, 2021

aiChaoSONG force-pushed the journalctl branch from db67f6b to e86979f Compare April 6, 2021 02:41

aiChaoSONG force-pushed the journalctl branch 2 times, most recently from 5079921 to 9a5d9d3 Compare April 6, 2021 09:47

Chao Song added 2 commits April 6, 2021 17:50

hijack: use journalctl to collect logs

ffd227f

Previously, we are using /var/log/kern.log to collect kernel logs, but this file is not present on Chrome/Fedora, etc. Use journalctl to make log collection more portable. Signed-off-by: Chao Song <chao.song@linux.intel.com>

sof-test: remove dependency on /var/log/kern.log

9a5d9d3

Signed-off-by: Chao Song <chao.song@linux.intel.com>

aiChaoSONG changed the title ~~[TEST] Use journalctl for log collection and remove dependency on /var/log/kern.log~~ Use journalctl for log collection and remove dependency on /var/log/kern.log Apr 7, 2021

aiChaoSONG requested a review from marc-hb April 7, 2021 03:33

marc-hb requested changes Apr 7, 2021

View reviewed changes

aiChaoSONG mentioned this pull request Apr 8, 2021

sof-test: include test interval in kernel check point #643

Merged

marc-hb mentioned this pull request Apr 12, 2021

ASoC: Intel: boards: use software node API thesofproject/linux#2810

Merged

aiChaoSONG closed this Apr 20, 2021

This was referenced Apr 22, 2021

sof-test: use journalctl for log collection #649

Merged

check-kmod: set the kernel check point before starting a test #655

Closed

marc-hb added the False Pass / green failure label Jul 3, 2021

marc-hb added area:non-audio Failure False positives: failing when we don't want to area:dmesg filter labels Oct 15, 2021

marc-hb mentioned this pull request Oct 21, 2022

lib.sh: simplify the check for firmware rtos #973

Merged

Use journalctl for log collection and remove dependency on /var/log/kern.log #639

Use journalctl for log collection and remove dependency on /var/log/kern.log #639

Uh oh!

Conversation

aiChaoSONG commented Mar 29, 2021

Uh oh!

marc-hb left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marc-hb Mar 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marc-hb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marc-hb Apr 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aiChaoSONG Apr 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aiChaoSONG commented Apr 8, 2021

Uh oh!

marc-hb commented Apr 12, 2021

Uh oh!

aiChaoSONG commented Apr 20, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

marc-hb left a comment •

edited

Loading

marc-hb Mar 30, 2021 •

edited

Loading

marc-hb Apr 7, 2021 •

edited

Loading

aiChaoSONG Apr 8, 2021 •

edited

Loading