-
Notifications
You must be signed in to change notification settings - Fork 59
Use journalctl for log collection and remove dependency on /var/log/kern.log #639
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
a7bcac4 to
1d13104
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @chao for looking at this, really appreciated.
Please don't hijack the checkpoint function(s) and KERNEL_CHECKPOINT variable for this. The entire purpose of the checkpoints is to NOT scan the logs from boot, so having a special checkpoint case that is not actually a checkpoint is quite confusing. EDIT: another normal "checkpoint" assumption is that they move only forward.
The few special tests that want to scan all the logs from boot should ideally invoke journalctl_cmd --boot directly and specifically without a date, then it will be more obvious what they do. If that's too complicated for some reason then as an alternative they could define some new, IGNORE_CHECKPOINTS=true or SCAN_LOGS_FROM_BOOT=true global variable.
Not changing the nature of checkpoints should also reduce the number of lines changed considerably.
case-lib/lib.sh
Outdated
| ;; | ||
| "now") | ||
| # There is a 5s delay between two cases by sof-framework, | ||
| # make the check point 6s earlier to avoid log loss. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is too brittle. The simpler and safer way to solve this problem is to set checkpoints only at (almost) the same time as when collecting logs and never at any other time.
More importantly: I think this is an important but different problem, not directly related to this Pull Request, so best left aside for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marc-hb I was not able to understand your feedback.
a) please describe when checkpoints should be set.
b) why can it be done in a separate PR, since it's a key functionality to enable journalctl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is "log loss" if one checkpoint is set and then another checkpoint is set again (5 seconds) later and logs between these two checkpoints are ignored. I believe this is the problem that this -6seconds hack is trying to address, however it is:
- hard to be sure because there is not much description for this bug
- very brittle because it relies on this 5 seconds gap to be always exactly the same between two consecutive tests
- duplicating approximately 1 second of logs into two consecutive tests "to be sure"
- duplicating approximately 6 seconds of logs when there is no gap
This seems to be an interesting bug but it's unrelated to this PR and it's not even related to whether journalctl or whatever else is used to collect logs.
BTW journalctl is already in use for most logs, this PR is just trying to finish the job removing kern.log. Then we should remove dmesg.
The proper fix is to simply never leave any such gap. In other word don't set a checkpoint and then set another checkpoint later without scanning or collecting the logs in between.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marc-hb Before implementing this, I was thinking to use a new Global Variable to collect boot logs, too. But then I found that this can be integrated in the same function, setup_kernel_check_point, to provide a unified API for test case.
A test case designer can decide whether they want to collect logs from boot, or from now. When you say another check point, I think you mean setup_kernel_check_point "boot", Actually this is only used in the first test case currently. Moving forward principle is not influenced, because we give test case the flexibility setup check point. For this reason, we take "boot" as a virtual check point.
With current code, the command for collecting log from boot is: journalctl -k -q --no-pager --utc --output=short-monotonic --no-hostname --boot , the command for collecting log from check point is: journalctl -k -q --no-pager --utc --output=short-monotonic --no-hostname --boot --since=@CHECK_POINT, already good enough to handle both conditions.
By the way, the --boot option means collect log of the latest boot, it doesn't mean collect boot logs.
@plbossart @marc-hb On the potential log loss.
sof-framework will add a 5s delay between two test cases, we are not sure there will be new logs in this 5s period, so I make the checkpoint 6s earlier with KERNEL_CHECKPOINT=$(($(date +%s) - 6)). Actually, no case is running on DUT in this 5s period, very low possibility to have kernel logs, I think. The point here becomes, should we care and collect very likely unrelated and sometime may be empty logs between test cases? With 6 here, we may have 1s log duplicated. We can choose 0, 1, 2, 3, 4, 5 or 6 to decrease log loss. In my opinion now, 5 may be the best with very low possibility of log loss and no log duplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the same function, setup_kernel_check_point, to provide a unified API for test case.
I don't think it's good to have a unified API for two different things, especially not when the API is named after only one of the two things.
Actually this is only used in the first test case currently.
So it's not great either to change and require almost every invocation to over-specify "now" and introduce the abstract concept of a "virtual" checkpoints that are not checkpoints, all this for just one special case. BTW it should be possible to run tests in any order.
In fact the only test that uses setup_kernel_check_point "boot" in this PR is apparently verify-firmware-presence.sh That test is super short and simple and merely making sure that /lib/firmware/intel/sof/platform.ri exists, so it should not even check kernel logs at all, kernel logs are not relevant for such a test. Is it really the only test that needed kern.log?
In the future we probably want a test that does nothing and checks boot logs but that would be a different test, it shouldn't be confused with checking /lib/firmware.
By the way, the --boot option means collect log of the latest boot, it doesn't mean collect boot logs.
Sorry I used both somewhat interchangeably.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point here becomes, should we care and collect very likely unrelated and sometime may be empty logs between test cases?
Good question but IMHO not new and independent from this PR. One problem at a time.
This patch sets up kernel check point 5s earlier, because CI will add a 5s delay between two cases, and also renames func_lib_setup_kernel_check_point to setup_kernel_check_point, because the prefix doesn't provide much information. Signed-off-by: Chao Song <chao.song@linux.intel.com>
5079921 to
9a5d9d3
Compare
Previously, we are using /var/log/kern.log to collect kernel logs, but this file is not present on Chrome/Fedora, etc. Use journalctl to make log collection more portable. Signed-off-by: Chao Song <chao.song@linux.intel.com>
Signed-off-by: Chao Song <chao.song@linux.intel.com>
marc-hb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have two "big" questions, maybe related to each other:
- If one test wants to scan the entire boot log with this new design then how it is supposed to do it? Do we even have any such test right now?
- I see many more checkpoint additions than
DMESG_LOG_START_LINEremovals. In other words some tests are now setting a checkpoint and they were doing nothing before. What part of the kernel log were they scanning before?
| KERNEL_CHECKPOINT=$(date +%s) | ||
| # There is a 5s delay between two cases by sof-framework, | ||
| # make the check point 5s earlier to avoid log loss. | ||
| export KERNEL_CHECKPOINT=$(($(date +%s) - 5)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still belong to a different PR but OK, it's in a different commit and it's a small enough change. Small in number of lines, not in consequences...
Can you please
- Add this to the comment:
# There is a 5s delay between two cases by sof-framework,
# make the check point 5s earlier to avoid log loss.
# Note this can make an error caused by one test appear in the next one
- Make the 5s a global,
$SOF_TEST_INTERVALvariable and use it here?
| { | ||
| journalctl -k -q --no-pager --utc --output=short-monotonic \ | ||
| --no-hostname "$@" | ||
| --no-hostname --boot "$@" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-k implies --boot, please remove this.
Feel free to expand -k to --kernel. Short options are meant to save typing time on the command line but they are less readable in scripts which are read many times more than they are written.
| DMESG_LOG_START_LINE=$(wc -l /var/log/kern.log|awk '{print $1;}') | ||
| fi | ||
|
|
||
| is_sof_used() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In an ideal world every commit should be usable for bisecatability, cherry-picking, reverting, etc. reasons. I guess this implies the entire DMESG_LOG_START_LINE removal should be a single commit? Good the second commit does not even have a commit message :-)
| journalctl_cmd --since=@"$KERNEL_CHECKPOINT" > "$LOG_ROOT/dmesg.txt" | ||
| else | ||
| cut -f5- -d ' ' /var/log/kern.log > "$LOG_ROOT/dmesg.txt" | ||
| die "Kernel check point is not properly setup" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
die "KERNEL_CHECKPOINT is not properly setup"
Much faster git grep when this happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
normally, this should not happen, I just want to use this to tell the case owner that he/she should setup up check point.
could you please explain more on git grep?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When someone sees the message with KERNEL_CHECKPOINT then they can immediately git grep that and find all relevant locations in sof-test.
| export LOG_ROOT='$(realpath $LOG_ROOT)' | ||
| export DMESG_LOG_START_LINE=$(wc -l /var/log/kern.log|awk '{print $1;}') | ||
| setup_kernel_check_point | ||
| export KERNEL_CHECKPOINT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment explaining why this one needs the export when the others don't?
Wouldn't it be simpler to export KERNEL_CHECKPOINT always? If not why not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, this is a mistake from previously version, the export KERNEL_CHECKPOINT should be removed.
This is in my next PR which add a new function
Some cases don't have DMESG_LOG_START_LINE because it is set implicitly in lib.sh, and now I want test case writer to explicitly set a check point. Why are some other cases have DMESG_LOG_START_LINE? actually I don't know, ask the owner! |
We must make 100% sure every single test catch errors when they happen and especially when refactoring this most critical part of sof-test. Otherwise when end up with "green failures" like this most recent one and many others before it: That requires significant time and effort but taking shortcuts and running and maintaining tests that catch errors "maybe" is pointless, better not spend time at all if you can't afford to fully understand test this particular refactoring thoroughly. It affects our entire test suite and product quality. |
No description provided.