AArch64: Enable kernel fragment on aarch64#1006
Conversation
|
Hi guys |
After backporting patch series of enabling memory hot remove on aarch64 to v5.4.x, we finally could enable nvdimm/dax on aarch64. Fixes: kata-containers#2601 Depends-on: github.com/kata-containers/packaging#1006 Signed-off-by: Penny Zheng <penny.zheng@arm.com>
jodh-intel
left a comment
There was a problem hiding this comment.
Very nice! Thanks @Pennyzct !
lgtm
|
@Pennyzct - label added. Note that you get the same behaviour (of stopping the PR from being merged) if you add "WIP" to the title of the PR. |
|
@jodh-intel thanks for the explanation. ;) |
grahamwhaley
left a comment
There was a problem hiding this comment.
Overall, l g t m @Pennyzct - very nice.
Thanks for clearing up all the x86 specifics that were in common - sorry about that ;-)
Just a couple of bits of feedback we should discuss with some others - but overall, looking good!
Let's hope this does make future kernel updates easier!
| # FIXME - check if we should be setting this | ||
| # https://github.com/kata-containers/packaging/issues/483 | ||
| #CONFIG_RANDOMIZE_BASE=y | ||
| CONFIG_RANDOMIZE_BASE=y |
There was a problem hiding this comment.
@grahamwhaley Enabling KSLR is recommended from a security point of view, but I dont think you can leverage KSM after this is enabled.
cc @sameo
There was a problem hiding this comment.
@amshinde thinking about this, KSM works by looking for hash identical pages - so, I suspect KSM will still work with RANDOMIZE_BASE - but, I suspect you cannot do any native address based sharing (as the page addresses will be different) - but then I'm not sure we do that between guest kernels loaded by the VM anyway?
I had a dig around to see if there were any documented and notable disadvantages to setting CONFIG_RANDOMIZE_BASE. The only real one I can identify as being real/current is that it will hinder trying to debug crashed kernels, as the address map will change between each boot. I think we debug guest kernels so infrequently that the gains from the positive security aspect probably outweigh the negatives of the debug situation.
We probably want to ask @kata-containers/architecture-committee if they have any thoughts here.
There was a problem hiding this comment.
I remember noticing that the 'crash' tool had something for this (possibly x86 only and not sure how this works on arm and a guest):
commit ad1a44f5d9b9a41d2c43bb386c765913275f05bf
Author: Dave Anderson anderson@redhat.com
Date: Fri Jan 13 15:38:39 2017 -0500
Fix for support of /proc/kcore as the live memory source in Linux 4.8
and later x86_64 kernels configured with CONFIG_RANDOMIZE_BASE, which
randomizes the unity-mapping PAGE_OFFSET value. Without the patch,
the crash session fails during session initialization with the error
message "crash: seek error: kernel virtual address: <address>
type: page_offset_base".
(anderson@redhat.com)
| # Compaction is the only memory management component to form high order | ||
| # (larger physically contiguous) memory blocks reliably. The lack of the | ||
| # feature can lead to unexpected OOM killer invocations for high order memory requests. | ||
| CONFIG_COMPACTION=y |
There was a problem hiding this comment.
ooh, nice catch - yes, having checked, we probably do want this enabled.
Note to others, this is a config change for x86.
| CONFIG_SCSI_VIRTIO=y | ||
| CONFIG_VIRTIO_BLK=y | ||
| CONFIG_TTY=y | ||
| CONFIG_VIRTIO_BALLOON=y |
There was a problem hiding this comment.
I'm not sure if we actually use balloon - @jcvenegas can you check/config pls.
There was a problem hiding this comment.
okay, I'll delete it. ;)
| # Use the maximum number of CPUs supported by KVM (255) | ||
| CONFIG_NR_CPUS=255 | ||
|
|
||
| CONFIG_PERF_EVENTS=y |
There was a problem hiding this comment.
just to note, iirc, we are debating if we need perf on/off by default in the kata kernels I think - iirc, they may come with a startup perf cost?
There was a problem hiding this comment.
IMO, we could make perf off by default.
But for x86_64, this config CONFIG_PERF_EVENTS is automatically selected by CONFIG_KVM, See https://github.com/torvalds/linux/blob/master/arch/x86/kvm/Kconfig#L41.
And we select CONFIG_CGROUP_PERF=y in common cgroup config file, which is dependent on CONFIG_PERF_EVENTS.
I'll try to dump kernel start up info to track the cost. ;)
There was a problem hiding this comment.
Oh, that's interesting - didn't know KVM selected PERF_EVENTS. I wonder if they are needed for something specific - would you know @dagrh ?
I guess having them on may do no harm - but, I also guess that is now an architecture specific decision at some level - so probably your choice for this arch @Pennyzct :-)
There was a problem hiding this comment.
That also surprises me; my reading of it is that it's used to emulate a PMU in the guest on x86 - although I thought the hardware PMU helped with that these days,
| @@ -0,0 +1,498 @@ | |||
| From ba91422b18892bceacf3b4aa60354cf36fcabf9b Mon Sep 17 00:00:00 2001 | |||
There was a problem hiding this comment.
I'm sure this has probably been discussed - but, do we do hot remove on Kata - I thought we pretty much only ever added, as remove can be very expensive (as you need to 'clear out' any used pages in the 'dimm' you want to remove).
/cc @jcvenegas for any thoughts.
There was a problem hiding this comment.
Hi~ @grahamwhaley
The reason that I enabled memory hot remove is that ZONE_DEVICE is dependent on it, and
NVDIMM/DAX needs ZONE_DEVICE. ;). maybe I should elaborate more in commit message.
| # This can still be dynamically disabled on the kernel command line/kata config if needed. | ||
| # Disable for now, as it upsets the entropy test, and we need to improve those: FIXME: see: | ||
| # https://github.com/kata-containers/tests/issues/1543 | ||
| # CONFIG_RANDOM_TRUST_CPU is not set |
There was a problem hiding this comment.
I'm not sure that is/was a typo - iirc enabling it broke entropy - we need to review this one I think.
| # Estimated cost (detailed in the kernel config files) | ||
| # is maybe 2.3% for both | ||
| CONFIG_STACKPROTECTOR=y | ||
| CONFIG_STACKPROTECTOR_STRONG |
| # | ||
| # Automatically generated file; DO NOT EDIT. | ||
| # Linux/arm64 5.4.3 Kernel Configuration | ||
| # Linux/arm64 5.4.15 Kernel Configuration |
There was a problem hiding this comment.
Nice to have the update! - ooi, have you done any metrics comparisons between the old and new config?
There was a problem hiding this comment.
yep, I'll do the metrics comparisons asap. ;)
|
/test-ubuntu |
|
@Pennyzct - np, it's documented in the table here: https://github.com/kata-containers/community/blob/master/CONTRIBUTING.md#temporarily-blocking-a-pr. |
c5b957f to
88899cb
Compare
|
Hi guys @grahamwhaley @amshinde @kata-containers/kernel I've re-based and addressed a few comments. ;) Still, we have a few configs left to discuss: |
grahamwhaley
left a comment
There was a problem hiding this comment.
Thanks for the metrics! Nice it all headed in the 'good' direction.
I'll drop my lgmt here in preparation for other feedback/decisions and the dropping of the WIP
| # FIXME - check if we should be setting this | ||
| # https://github.com/kata-containers/packaging/issues/483 | ||
| #CONFIG_RANDOMIZE_BASE=y | ||
| CONFIG_RANDOMIZE_BASE=y |
There was a problem hiding this comment.
@amshinde thinking about this, KSM works by looking for hash identical pages - so, I suspect KSM will still work with RANDOMIZE_BASE - but, I suspect you cannot do any native address based sharing (as the page addresses will be different) - but then I'm not sure we do that between guest kernels loaded by the VM anyway?
I had a dig around to see if there were any documented and notable disadvantages to setting CONFIG_RANDOMIZE_BASE. The only real one I can identify as being real/current is that it will hinder trying to debug crashed kernels, as the address map will change between each boot. I think we debug guest kernels so infrequently that the gains from the positive security aspect probably outweigh the negatives of the debug situation.
We probably want to ask @kata-containers/architecture-committee if they have any thoughts here.
| # Use the maximum number of CPUs supported by KVM (255) | ||
| CONFIG_NR_CPUS=255 | ||
|
|
||
| CONFIG_PERF_EVENTS=y |
There was a problem hiding this comment.
Oh, that's interesting - didn't know KVM selected PERF_EVENTS. I wonder if they are needed for something specific - would you know @dagrh ?
I guess having them on may do no harm - but, I also guess that is now an architecture specific decision at some level - so probably your choice for this arch @Pennyzct :-)
For now, a few configs as follows in common acpi dir are truly x86-spcecific or disable by default on arm64. CONFIG_ACPI_CPU_FREQ_PSS=y CONFIG_ACPI_HOTPLUG_IOAPIC=y CONFIG_ACPI_LEGACY_TABLES_LOOKUP CONFIG_ACPI_LPIT=y CONFIG_ARCH_MIGHT_HAVE_ACPI_PDC=y CONFIG_ACPI_PROCESSOR_CSTATE=y CONFIG_ACPI_SYSTEM_POWER_STATES_SUPPORT=y CONFIG_HAVE_ACPI_APEI_NMI=y And I also add a few configs which are aarch64-specific. Like CONFIG_ACPI_REDUCED_HARDWARE_ONLY=y, since ARM64 can run properly in ACPI hardware reduced mode. Fixes: kata-containers#1004 Signed-off-by: Penny Zheng <penny.zheng@arm.com>
There exists a few configs about linux guest support or optimization that are not supported on aarch64. CONFIG_HYPERVISOR_GUEST is only defined under arch/x86/Kconfig and unfortunately, CONFIG_KVM_GUEST is not supported on aarch64 for now. Fixes: kata-containers#1004 Signed-off-by: Penny Zheng <penny.zheng@arm.com>
There exists a few security-related configs, which are x86-64 specific. CONFIG_LEGACY_VSYSCALL_NONE=y CONFIG_RETPOLINE=y CONFIG_RELOCATABLE and CONFIG_RANDOMIZE_BASE are kinds of tangled on aarch64, if CONFIG_RANDOMIZE_BASE=y, then CONFIG_RELOCATABLE will be selected automatically. CONFIG_RANDOMIZE_BASE will randomize the virtual address at which the kernel image is loaded, which as a security feature could deter exploit attempts relying on knowledge of the location of kernel internals. Fixes: kata-containers#1004 Signed-off-by: Penny Zheng <penny.zheng@arm.com>
mmio devices are required in firecracker, and for now, x86_64 and aarch64 are all supporting kata containers with firecracker. So, we need to move mmio-related configs to common dir. Fixes: kata-containers#1004 Signed-off-by: Penny Zheng <penny.zheng@arm.com>
Compaction is the only memory management component to form high order (larger physically contiguous) memory blocks reliably. The page allocator relies on compaction heavily and the lack of the feature can lead to unexpected OOM killer invocations for high order memory requests. We shouldn't disable this option unless there really is a strong reason. Fixes: kata-containers#1004 Signed-off-by: Penny Zheng <penny.zheng@arm.com>
Add a few arm64-specific configs and classify them into seven new categories , that is, 1. base architecture-dependent options(base.conf) It also includes varient-specific features, like CONFIG_ARM64_PMEM is one ARMv8.2 arichitectural features. 2. crypto-related options(crypto.conf) ARMv8 adds cryptographic instructions that could significantly improve performance on tasks such as AES encryption and SHA1 and SHA256 hashing. 3. device tree related options(dt.conf) The "Open Firmware Device Tree", or simply Device Tree (DT), is a data structure and language for describing hardware, which is commonly used in arm architecture. 4. ARM errata workarounds options(errata.conf) There are many Kconfig entires under "Kernel Features" -> "ARM errata workarounds via the alternatives framework", which provides software workarounds to mitigate systems affected by those erratum. Vendor-specific option will be left to users to decide. 5. pci related options(pci.conf) a simplified pci host controller for mach-virt. 6. serial devices options(serial.conf) CONFIG_SERIAL_OF_PLATFORM is used for all 8250 compatible serial ports that are probed through device tree. 7. rtc related options(rtc.conf) we don't have KVM’s paravirtualized clock and ptp implementation is still under experimental mode, so we need rtc on aarch64. QEMU provides an emulated ARM AMBA PrimeCell PL031 RTC. Fixes: kata-containers#1004 Signed-off-by: Penny Zheng <penny.zheng@arm.com>
Backport Anshuman Khandual's patch series of Enabling memory hot remove on aarch64(https://patchwork.kernel.org/cover/11419305/) to v5.4.x. XONE_DEVICE is dependent on the implementation of memory hot remove. This patch series has already been merged, and queued for 5.7. After backporting this series, we could finally enable nvdimm/dax on arm64. Fixes: kata-containers#1004 Signed-off-by: Penny Zheng <penny.zheng@arm.com>
fix a few typo errors. Fixes: kata-containers#1004 Signed-off-by: Penny Zheng <penny.zheng@arm.com>
Since we disable pci shpc hotplug for arm64, see kata-containers#498 for detailed reason. We need to move CONFIG_HOTPLUG_PCI_SHPC from common conf to x86_64-specific. Fixes: kata-containers#1004 Signed-off-by: Penny Zheng <penny.zheng@arm.com> Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
88899cb to
d9b1e94
Compare
|
Hi guys @grahamwhaley @dagrh |
|
Hi guys |
kvm-ptp is critical for mitigating time drift between host and guest. This implementation in kernel side is still one experimental feature on aarch64, and see kata-containers#998 for detailed instructions. Fixes: kata-containers#1004 Signed-off-by: Penny Zheng <penny.zheng@arm.com> Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
The config file created by kernel fragments scheme is quite different with the old arm64_kata_kvm_5.4.x. So I will update arm64_kata_kvm_5.4.x for consistency. Fixes: kata-containers#1004 Signed-off-by: Penny Zheng <penny.zheng@arm.com>
d9b1e94 to
14a281d
Compare
|
/test-ubuntu |
|
/AzurePipelines run |
|
Commenter does not have sufficient privileges for PR 1006 in repo kata-containers/packaging |
|
Hi guys |
|
/AzurePipelines run |
|
Azure Pipelines successfully started running 2 pipeline(s). |
|
Hi guys |
The linux kernel feature RANDOMIZE_BASE improved the security and at the same time increased the memory footprint of a kata container, this feature was enabled in kata-containers/packaging#1006. In order to mitigate this increase in memory consumption, we can boot container using the uncompressed kernel. fixes # Signed-off-by: Julio Montes <julio.montes@intel.com>
The linux kernel feature RANDOMIZE_BASE improved the security and at the same time increased the memory footprint of a kata container, this feature was enabled in kata-containers/packaging#1006. In order to mitigate this increase in memory consumption, we can boot container using the uncompressed kernel. fixes kata-containers#669 Signed-off-by: Julio Montes <julio.montes@intel.com>
The linux kernel feature RANDOMIZE_BASE improved the security and at the same time increased the memory footprint of a kata container, this feature was enabled in kata-containers/packaging#1006. In order to mitigate this increase in memory consumption, we can boot container using the uncompressed kernel. fixes kata-containers#669 Signed-off-by: Julio Montes <julio.montes@intel.com>
The linux kernel feature RANDOMIZE_BASE improved the security and at the same time increased the memory footprint of a kata container, this feature was enabled in kata-containers/packaging#1006. In order to mitigate this increase in memory consumption, we can boot container using the uncompressed kernel. Reduce boot time by ~5% Reduce KSM memory footprint by ~14% Reduce noKSM memory footprint by ~27% fixes kata-containers#669 Signed-off-by: Julio Montes <julio.montes@intel.com>
The linux kernel feature RANDOMIZE_BASE improved the security and at the same time increased the memory footprint of a kata container, this feature was enabled in kata-containers/packaging#1006. In order to mitigate this increase in memory consumption, we can boot container using the uncompressed kernel. Reduce boot time by ~5% Reduce KSM memory footprint by ~14% Reduce noKSM memory footprint by ~27% fixes kata-containers#669 Signed-off-by: Julio Montes <julio.montes@intel.com>


Hi guys @kata-containers/kernel @grahamwhaley
I've been trying to enable kernel fragments on arm64. It's quite a large patch, plz review. ;)