Skip to content

Conversation

@kamalesh-babulal
Copy link
Contributor

Rebase the tests to the closet upstream kernel to RHEL 8.0, as per
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html-single/8.0_release_notes/index#overview
it is kernel v4.18. So rebase it on top of upstream v4.18.

Signed-off-by: Kamalesh Babulal kamalesh@linux.vnet.ibm.com

@kamalesh-babulal
Copy link
Contributor Author

I have only compile tested v4.18 and RHEL 8.0, with all of the patch(s) applied.

@joe-lawrence
Copy link
Contributor

Hi @kamalesh-babulal,

If I understand correctly, this PR rebases the integration tests on top of upstream v4.18. That's going to be very close to RHEL-8.0 GA kernel, but not exactly the same. For example, the tracepoints-sections.patch modifies run_timer_softirq() and expects to see base->must_forward_clk = false in the code context leading up to the added code. When I look at kernel-4.18.0-80.el8.src.rpm however, its run_timer_softirq() doesn't touch must_forward_clk. So I think RHEL-8.0 must have included a backport of 363e934d8811 ("timers: Clear timer_base::must_forward_clk with timer_base::lock held"), which is upstream v4.19+.

FWIW, I think that is the only patch that had a v4.18 vs. RHEL-8.0 discrepancy. If you can update the PR with rebase against kernel-4.18.0-80.el8.src.rpm, I can verify that they build / run on one of our internal test boxes.

@kamalesh-babulal
Copy link
Contributor Author

@joe-lawrence Thanks for the review and for help with a test run. Agree, there is the difference between the upstream v4.18 and RHEL 8.0 kernel source. When I tried, it applied with minor fuzz on a couple of test cases over RHEL 8.0 sources. Yeah, it makes sense to have rebased against RHEL 8.0 sources, instead of closet upstream. I will redo it and push it.

@kamalesh-babulal
Copy link
Contributor Author

v2 changes:

  • Rebased to RHEL 8.0 kernel version 4.18.0-80.el8
  • Added README file, with the kernel version the integration tests were rebased.

@joe-lawrence
Copy link
Contributor

Thanks for rebasing @kamalesh-babulal. Here are results from an x86_64 test run:
combined.log
gcc-static-local-var-5.log
make.log
test.log

One build failure, perhaps a result of the objtool warnings?

build: gcc-static-local-var-5
Using cache at /root/.kpatch/src
Testing patch file(s)
Reading special section data
Building original source
Building patched source
Extracting new and modified ELF sections
ERROR: audit.o: reference to static local variable lock.65414 in audit_log_end was removed
/root/kpatch/kpatch-build/create-diff-object: unreconcilable difference
ERROR: 1 error(s) encountered. Check /root/.kpatch/build.log for more details.
ERROR: gcc-static-local-var-5: build failed

and one test script failure, due to print formatting changes (so something like
data-new-LOADED.patch.txt to fix:

loading patch module: test-data-new.ko
waiting (up to 15 seconds) for patch transition to complete...
transition complete (2 seconds)
ERROR: data-new: rhel-8.0/data-new-LOADED.test failed after kpatch load
disabling patch module: test_data_new
waiting (up to 15 seconds) for patch transition to complete...
transition complete (2 seconds)
unloading patch module: test_data_new

I'll try to do a similar ppc64le test run when I get a chance.

@joe-lawrence
Copy link
Contributor

And the same for ppc64le

make.out.txt
test.log
macro-printk.1.log
module-call-external.1.log

Build errors:

build: macro-printk
Using cache at /root/.kpatch/src
Testing patch file(s)
Reading special section data
Building original source
Building patched source
Extracting new and modified ELF sections
fib_frontend.o: changed function: inet_rtm_newroute
ERROR: fib_semantics.o: object size mismatch: __msg.66237
/root/kpatch/kpatch-build/create-diff-object: unreconcilable difference
fib_trie.o: changed function: fib_table_insert
ERROR: 1 error(s) encountered. Check /root/.kpatch/build.log for more details.
ERROR: macro-printk: build failed
build: module-call-external
Using cache at /root/.kpatch/src
Testing patch file(s)
Reading special section data
Building original source
Building patched source
Extracting new and modified ELF sections
export.o: changed function: e_show
af_netlink.o: new function: kpatch_string
Patched objects: fs/nfsd/nfsd.ko vmlinux
Building patch module: test-module-call-external.ko
ERROR: Undefined symbols: kpatch_string . Check /root/.kpatch/build.log for more details.
ERROR: module-call-external: build failed

Test errors:

data-new (same as x86_64 above)

shadow-newpid crashes the machine:

livepatch: enabling patch 'test_shadow_newpid'
livepatch: 'test_shadow_newpid': starting patching transition
livepatch: 'test_shadow_newpid': patching complete
Unable to handle kernel paging request for data at address 0x00000000
Faulting instruction address: 0xc000000000cc6b38
Oops: Kernel access of bad area, sig: 11 [#1]
LE SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: test_shadow_newpid(OEK) sg pseries_rng xts vmx_crypto xfs libcrc32c sd_mod ibmvscsi ibmveth scsi_transport_srp dm_mirror dm_region_hash dm_log dm_mod [last unloaded: test_parainstructions_section]
CPU: 7 PID: 579 Comm: systemd-journal Kdump: loaded Tainted: G           OE K  --------- -  - 4.18.0-80.el8.ppc64le #1
NIP:  c000000000cc6b38 LR: c0000000004e84d0 CTR: c0000000004e8450
REGS: c0000007f8fa7850 TRAP: 0300   Tainted: G           OE K  --------- -  -  (4.18.0-80.el8.ppc64le)
MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 88000228  XER: 20040000
CFAR: c000000000008934 DAR: 0000000000000000 DSISR: 40000000 IRQMASK: 0 
GPR00: d00000000b4a3030 c0000007f8fa7ad0 c0000000015bd100 c0000007f2020361 
GPR04: 000000000000fc9f 0000000000000000 c0000007f8fa7c20 0000000000000000 
GPR08: 0000000000000030 000000007fffffff 0000000000000000 d00000000b4a3660 
GPR12: c0000000004e8450 c000000007fa6200 0000000000000000 0000000000000000 
GPR16: 0000000000000000 0000000000000000 c0000007f8fa7c20 0000000000000000 
GPR20: 0000000000000000 0000000000000000 000000000000003f 00000000ffffffff 
GPR24: 000000000000003f 00000000ffffffff 000000000000003f 00000000ffffffff 
GPR28: 0000000000000000 000000000000fc9f c0000007f2020361 c0000007f2030000 
NIP [c000000000cc6b38] vsnprintf+0x58/0x5c0
LR [c0000000004e84d0] seq_printf+0x80/0xd0
Call Trace:
[c0000007f8fa7ad0] [c0000007f8fa7b10] 0xc0000007f8fa7b10 (unreliable)
[c0000007f8fa7bc0] [c0000000004e84d0] seq_printf+0x80/0xd0
[c0000007f8fa7bf0] [d00000000b4a3030] proc_pid_status+0xcf8/0x2598 [test_shadow_newpid]
[c0000007f8fa7ce0] [c00000000007610c] livepatch_handler+0x30/0x64
[c0000007f8fa7d30] [c0000000004e6b6c] seq_read+0x1cc/0x650
[c0000007f8fa7dd0] [c0000000004a09d8] sys_read+0x108/0x310
[c0000007f8fa7e30] [c00000000000b388] system_call+0x5c/0x70
Instruction dump:
7fe32214 fa210078 fa410080 fba100d8 fbc100e0 fa810090 7c9d2378 7fa3f840 
7c7e1b78 7cb12b78 7cd23378 419d043c <89310000> 7fd4f378 2f890000 419e0134 
---[ end trace eb35e9a9bf3754e2 ]---

I'll look into this crash later today...

@kamalesh-babulal
Copy link
Contributor Author

@joe-lawrence Thanks a lots for testing, I was trying to gcc-static-local-var-5 on ppc64le and hitting a segfault. I will debug more and update.

@joe-lawrence
Copy link
Contributor

Hmm, shadow-newpid.patch crashes even when I sub out all the code for a nop(). Some more files:
crash.tar.gz

@joe-lawrence
Copy link
Contributor

FYI, shadow-newpid.patch no longer builds with kpatch v0.7.0, specifically 4f4870d ("create-diff-object: Don't allow jump labels"):

fork.o: WARNING: unable to correlate static local variable ctr.65763 used by .toc, assuming variable is new
fork.o: changed function: _do_fork.constprop.7
fork.o: changed function: sys_fork
fork.o: changed function: sys_vfork
fork.o: changed function: _do_fork
array.o: changed function: proc_pid_status
/root/kpatch/kpatch-build/create-diff-object: ERROR: exit.o: kpatch_regenerate_special_section: 2091: Found a jump label at do_exit()+0x87c, using key devmap_managed_key.  Jump labels aren't currently supported.  Use static_key_enabled() instead.

but as the crash was in the patched proc_pid_status() and not do_exit(), I suspect that jump labels may not be the real culprit.

@jpoimboe
Copy link
Member

I can confirm the proc_pid_status() panic is fixed by #1007.

@joe-lawrence
Copy link
Contributor

Nice, thanks for verifying @jpoimboe!

@julien-thierry
Copy link
Contributor

For context:

  • The first 3 patches aim to fix the issue on x86_64 pointed by Joe on gcc-static-local-var-5.
    However with those changes (especially the fact of considering .part. as child function), this introduces a sibling call warning on ppc64el on the same test. This is due to the parent of the .part. now being considered for inclusion since one of its child's section has changed.
    If we agree to include those changes, I can add the no-optimize attribute to the test patch.

  • Fourth patch fixes the error on macro-printk, although this feels like making the issue more unlikely to happen, I do not have an absolute solution for that.

The rest of the patches is just the actual rebase and silencing issues that won't be solved right now.

sm00th added a commit to sm00th/kpatch that referenced this pull request Sep 18, 2019
When these from internal depths of Red Hat upstream paths changed and
now we are one level deeper in directory tree.

The issue probably also exist in rhel8.0 rebase pr dynup#993.

Signed-off-by: Artem Savkov <asavkov@redhat.com>
Copy link
Contributor

@joe-lawrence joe-lawrence left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good (with a few nits noted) @jpoimboe could you sanity check the symbol correlation changes, in particular the ("kpatch-build: Look for local static variables in child functions") commit? It reads correct to me, but you know that part of the code better than myself.

{
parent->nb_children++;
parent->children = realloc(parent->children,
sizeof(*parent->children) * parent->nb_children);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nit, but realloc() could fail so we should catch that like other allocation failures.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will update.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also now needs to be freed at some point.


if (!sym->parent)
ERROR("failed to find parent function for %s", sym->name);
if (!parent)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we missing ERROR("failed to find parent function for %s", parent) here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, seems I messed my patch splitting, thanks for spotting this!

I'll update as well.

@@ -1,3 +1,5 @@
Disabled due to https://github.com/dynup/kpatch/issues/940
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a follow up commit could rework these tests to avoid functions that use jump labels? I realize that's ignoring the problem, but would let us continue to test read-mostly and shadow variables in the meantime.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can have a look at it, but what might end up happening is that the test becomes completely different (patching other functions or maybe files) than the similar test in fedora/rhel folders. Which might be confusing if we keep the same name.

Perhaps we can keep this one disabled and add a second test "data-read-mostly-2.patch" that would (hopefully) work on RHEL-8 with the current state of kpatch, by avoiding to modify functions with jump labels.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, creating a newly named test file would be less-confusing. Anyway, that was just an idea for a follow up PR if you wanted to explore that, not needed for this one.

{
const unsigned char *b = buf;
@@ -2383,6 +2383,12 @@ static ssize_t n_tty_write(struct tty_struct *tty, struct file *file,
return (b - buf) ? b - buf : retval;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for future work: we should probably explain this in the patch author guide.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for future work: we should probably explain this in the patch author guide.

I'm working on it now, I'll post up a PR with author guide changes soon.

#!/bin/bash

SCRIPTDIR="$(readlink -f $(dirname $(type -p $0)))"
ROOTDIR="$(readlink -f $SCRIPTDIR/../..)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @sm00th mentioned in PR #1040, this needs to be ROOTDIR="$(readlink -f $SCRIPTDIR/../../..)"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will update.

Copy link
Member

@jpoimboe jpoimboe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful if each commit description mentioned the patch which triggers the behavior.

Also we might want to consider adding unit tests for some of these issues.

Also it might make sense to split this up into several PRs. It's not clear to me how some of these fixes are related to the integration tests. We usually try to do one logical change per PR. So maybe the integration tests should be their own PR, and the other fixes should be in another (or maybe even multiple other) PRs.

* subfunctions. kpatch_detect_child_functions detects such subfunctions and
* subfunctions. Some functions can also be split into multiple *.part
* functions.
* kpatch_detect_child_functions detects such subfunctions and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure a ".part" function can be considered a "child". In the context of this code, a child (or subfunction) means that it's a ".cold" extension of the parent function, where the parent function can branch to the child subfunction.

But in the case of ".part", I believe it's an optimized clone of the original function. But otherwise the functions have no relation -- i.e. they are independent and the parent doesn't branch to the clone.

Or maybe I'm misunderstanding. Was there a patch which showed this behavior?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the initial motivation for this change was a failure on the test gcc-static-local-var-5. The failure was due to a static local variable "lock" in function audit_log_end() being moved in the patched object into audit_log_end.part.17 .

One particular thing in the patched object is that both the audit_log_end and audit_log_end.part.* symbols exists, while the original only has audit_log_end. So the original audit_log_end is twined with the patched audit_log_end and static local references are only looked in the patched section related to audit_log_end and kpatch misses the fact that a reference was moved to the .part. section.

So, maybe the "child" concideration is not correct. But there needs to be a link between .part. functions and their original symbol (when it is also present on the patched object, as most of the time I've noticed it isn't) and the static variable correlation needs to be able to follow that link when looking for relocations.

Should I create a separate list of "parts" of a symbol? Other suggestions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. Does the patched audit_log_end() also reference the "lock" static local variable? IIRC, "part" functions are clones of the originals. So it seems like audit_log_end() should also reference the static local, unless I'm missing something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it doesn't have a reference to the lock. audit_log_end() almost only has a jump to audit_log_end.part in this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah... I think I had a fundamental misunderstanding of this optimization. Looking at gcc/ipa-split.c in the GCC source confirms that. I had been thinking this was more like the constprop and isra optimizations.

So I take back my original comment! It's very similar to the .cold optimization and I think the parent/child relationship makes sense after all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for having a look!

So how do you suggest I proceed with this PR? Split it in two, one with the test addition and patch modifications and the other with the create-diff-object changes?

Any suggestion for which one should be base on the other?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we don't have a CI running these yet, I think I'd suggest doing the integration tests first, in one PR. That way the other fixes can refer to them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I'll try sending that later today after rebasing on master and checking there aren't new issues!

Julien Thierry and others added 6 commits October 31, 2019 12:47
Tests intentionaly disabled should be skipped by multiple.test

Signed-off-by: Julien Thierry <jthierry@redhat.com>
Disabled patches won't trigger a build, but the combined load test
will still attempt to run their associated LOADED.test script. The
combined test will fail due to voluntarily disabled tests.

Do not run tests scripts associated with disabled tests.

Signed-off-by: Julien Thierry <jthierry@redhat.com>
Rebase the integration test cases on top of RHEL 8.0 kernel version
4.18.0-80.el8.

Suggested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
[JT: adapt data-new-LOADED to new meminfo format,
     use common template for multiple.test]
Signed-off-by: Julien Thierry <jthierry@redhat.com>
Jump labels are unsupported, so tests modifying functions using them are
expected to fail. So disable them, for now...

Signed-off-by: Julien Thierry <jthierry@redhat.com>
On ppcle64, test gcc-static-local-var-4 impacts a jump label reference
which is currently unsupported.

Signed-off-by: Julien Thierry <jthierry@redhat.com>
new-function test fails on ppc64le with the following message:

create-diff-object: ERROR: n_tty.o: kpatch_no_sibling_calls_ppc64le: 3445: Found an unsupported sibling call at n_tty_write()+0x20.  Add __attribute__((optimize("-fno-optimize-sibling-calls"))) to n_tty_write() definition.

Add the suggested attribute, as was done for rhel-7.[5-7] versions of
the test.

Signed-off-by: Julien Thierry <jthierry@redhat.com>
@julien-thierry
Copy link
Contributor

After the last update the following errors are withstanding:
PPCLE64:

  • build: macro-printk
    ERROR: fib_semantics.o: object size mismatch: __msg.66237
    /root/kpatch/kpatch-build/create-diff-object: unreconcilable difference

This can be fixed with the following PR:
#1053

x86:

  • build: gcc-static-local-var-5
    ERROR: audit.o: reference to static local variable lock.65414 in audit_log_end was removed
    /root/kpatch/kpatch-build/create-diff-object: unreconcilable difference

This can be fixed with the following PR:
#1054

Copy link
Contributor

@sm00th sm00th left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see gcc-static-local-var-5 failing on x86_64 with:
ERROR: audit.o: reference to static local variable lock.65414 in audit_log_end was removed

And macro-printk on ppc64le with:
ERROR: fib_semantics.o: object size mismatch: __msg.66237
(which I guess is caused by one of recent merges)

But I think it will be easier to deal with those in a separate PR since this one is already hard to review, but it has improved drastically from where we started. Thank you.

@julien-thierry
Copy link
Contributor

I see gcc-static-local-var-5 failing on x86_64 with:
ERROR: audit.o: reference to static local variable lock.65414 in audit_log_end was removed

And macro-printk on ppc64le with:
ERROR: fib_semantics.o: object size mismatch: __msg.66237
(which I guess is caused by one of recent merges)

But I think it will be easier to deal with those in a separate PR since this one is already hard to review, but it has improved drastically from where we started. Thank you.

Both of these issues are expected (in the current state) and are fixed by the pending PRs #1054 and #1053

@jpoimboe jpoimboe merged commit 34a45ba into dynup:master Jan 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants