Skip to content
This repository was archived by the owner on May 12, 2021. It is now read-only.

AArch64: Enable kernel fragment on aarch64#1006

Merged
jodh-intel merged 11 commits into
kata-containers:masterfrom
Pennyzct:kernel_fragment_on_aarch64
Apr 23, 2020
Merged

AArch64: Enable kernel fragment on aarch64#1006
jodh-intel merged 11 commits into
kata-containers:masterfrom
Pennyzct:kernel_fragment_on_aarch64

Conversation

@Pennyzct
Copy link
Copy Markdown
Contributor

@Pennyzct Pennyzct commented Apr 9, 2020

Hi guys @kata-containers/kernel @grahamwhaley

I've been trying to enable kernel fragments on arm64. It's quite a large patch, plz review. ;)

@Pennyzct
Copy link
Copy Markdown
Contributor Author

Pennyzct commented Apr 9, 2020

Hi guys
Could you plz add a WIP tag here?? I've waiting for #998 got merged and then do a few twist here. ;)

@Pennyzct Pennyzct closed this Apr 9, 2020
@Pennyzct Pennyzct reopened this Apr 9, 2020
Pennyzct added a commit to Pennyzct/runtime that referenced this pull request Apr 9, 2020
After backporting patch series of enabling memory hot remove on aarch64
to v5.4.x, we finally could enable nvdimm/dax on aarch64.

Fixes: kata-containers#2601
Depends-on: github.com/kata-containers/packaging#1006

Signed-off-by: Penny Zheng <penny.zheng@arm.com>
Copy link
Copy Markdown

@jodh-intel jodh-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! Thanks @Pennyzct !

lgtm

@jodh-intel jodh-intel added the wip Work in Progress (PR incomplete - needs more work or rework) label Apr 9, 2020
@jodh-intel
Copy link
Copy Markdown

@Pennyzct - label added. Note that you get the same behaviour (of stopping the PR from being merged) if you add "WIP" to the title of the PR.

@Pennyzct
Copy link
Copy Markdown
Contributor Author

Pennyzct commented Apr 9, 2020

@jodh-intel thanks for the explanation. ;)

Copy link
Copy Markdown
Contributor

@grahamwhaley grahamwhaley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, l g t m @Pennyzct - very nice.
Thanks for clearing up all the x86 specifics that were in common - sorry about that ;-)

Just a couple of bits of feedback we should discuss with some others - but overall, looking good!
Let's hope this does make future kernel updates easier!

# FIXME - check if we should be setting this
# https://github.com/kata-containers/packaging/issues/483
#CONFIG_RANDOMIZE_BASE=y
CONFIG_RANDOMIZE_BASE=y
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a change to x86 as well I think? We should contemplate from the security pov then
@amshinde @egernst

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@grahamwhaley Enabling KSLR is recommended from a security point of view, but I dont think you can leverage KSM after this is enabled.
cc @sameo

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amshinde thinking about this, KSM works by looking for hash identical pages - so, I suspect KSM will still work with RANDOMIZE_BASE - but, I suspect you cannot do any native address based sharing (as the page addresses will be different) - but then I'm not sure we do that between guest kernels loaded by the VM anyway?

I had a dig around to see if there were any documented and notable disadvantages to setting CONFIG_RANDOMIZE_BASE. The only real one I can identify as being real/current is that it will hinder trying to debug crashed kernels, as the address map will change between each boot. I think we debug guest kernels so infrequently that the gains from the positive security aspect probably outweigh the negatives of the debug situation.

We probably want to ask @kata-containers/architecture-committee if they have any thoughts here.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember noticing that the 'crash' tool had something for this (possibly x86 only and not sure how this works on arm and a guest):
commit ad1a44f5d9b9a41d2c43bb386c765913275f05bf
Author: Dave Anderson anderson@redhat.com
Date: Fri Jan 13 15:38:39 2017 -0500

Fix for support of /proc/kcore as the live memory source in Linux 4.8
and later x86_64 kernels configured with CONFIG_RANDOMIZE_BASE, which
randomizes the unity-mapping PAGE_OFFSET value.  Without the patch,
the crash session fails during session initialization with the error
message "crash: seek error: kernel virtual address: <address>
type: page_offset_base".
(anderson@redhat.com)

# Compaction is the only memory management component to form high order
# (larger physically contiguous) memory blocks reliably. The lack of the
# feature can lead to unexpected OOM killer invocations for high order memory requests.
CONFIG_COMPACTION=y
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooh, nice catch - yes, having checked, we probably do want this enabled.
Note to others, this is a config change for x86.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch @Pennyzct

CONFIG_SCSI_VIRTIO=y
CONFIG_VIRTIO_BLK=y
CONFIG_TTY=y
CONFIG_VIRTIO_BALLOON=y
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we actually use balloon - @jcvenegas can you check/config pls.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dont think we use this today.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, I'll delete it. ;)

# Use the maximum number of CPUs supported by KVM (255)
CONFIG_NR_CPUS=255

CONFIG_PERF_EVENTS=y
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to note, iirc, we are debating if we need perf on/off by default in the kata kernels I think - iirc, they may come with a startup perf cost?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, we could make perf off by default.
But for x86_64, this config CONFIG_PERF_EVENTS is automatically selected by CONFIG_KVM, See https://github.com/torvalds/linux/blob/master/arch/x86/kvm/Kconfig#L41.
And we select CONFIG_CGROUP_PERF=y in common cgroup config file, which is dependent on CONFIG_PERF_EVENTS.
I'll try to dump kernel start up info to track the cost. ;)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that's interesting - didn't know KVM selected PERF_EVENTS. I wonder if they are needed for something specific - would you know @dagrh ?

I guess having them on may do no harm - but, I also guess that is now an architecture specific decision at some level - so probably your choice for this arch @Pennyzct :-)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That also surprises me; my reading of it is that it's used to emulate a PMU in the guest on x86 - although I thought the hardware PMU helped with that these days,

@@ -0,0 +1,498 @@
From ba91422b18892bceacf3b4aa60354cf36fcabf9b Mon Sep 17 00:00:00 2001
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure this has probably been discussed - but, do we do hot remove on Kata - I thought we pretty much only ever added, as remove can be very expensive (as you need to 'clear out' any used pages in the 'dimm' you want to remove).
/cc @jcvenegas for any thoughts.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi~ @grahamwhaley
The reason that I enabled memory hot remove is that ZONE_DEVICE is dependent on it, and
NVDIMM/DAX needs ZONE_DEVICE. ;). maybe I should elaborate more in commit message.

# This can still be dynamically disabled on the kernel command line/kata config if needed.
# Disable for now, as it upsets the entropy test, and we need to improve those: FIXME: see:
# https://github.com/kata-containers/tests/issues/1543
# CONFIG_RANDOM_TRUST_CPU is not set
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that is/was a typo - iirc enabling it broke entropy - we need to review this one I think.

# Estimated cost (detailed in the kernel config files)
# is maybe 2.3% for both
CONFIG_STACKPROTECTOR=y
CONFIG_STACKPROTECTOR_STRONG
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ouch! nice catch :-)

Comment thread kernel/configs/arm64_kata_kvm_5.4.x Outdated
#
# Automatically generated file; DO NOT EDIT.
# Linux/arm64 5.4.3 Kernel Configuration
# Linux/arm64 5.4.15 Kernel Configuration
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to have the update! - ooi, have you done any metrics comparisons between the old and new config?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, I'll do the metrics comparisons asap. ;)

@grahamwhaley
Copy link
Copy Markdown
Contributor

/test-ubuntu
because I'm curious...

@jodh-intel
Copy link
Copy Markdown

@Pennyzct Pennyzct force-pushed the kernel_fragment_on_aarch64 branch 2 times, most recently from c5b957f to 88899cb Compare April 17, 2020 06:41
@Pennyzct
Copy link
Copy Markdown
Contributor Author

Pennyzct commented Apr 17, 2020

Hi guys @grahamwhaley @amshinde @kata-containers/kernel

I've re-based and addressed a few comments. ;)
The metrics report on boot-up time and memory footprint is here. For boot-up time, we have 3%~5% decrease after importing kernel fragments and enabling dax.
And for memory foortprint, we only could run a few tests, since we need a few extra configuration to run memory hotplug(need to boot from uefi). Still, we could see around 6% decrease on memory footprint with ksm and without ksm.
image

image

Still, we have a few configs left to discuss:
CONFIG_RANDOMIZE_BASE=y
CONFIG_PERF_EVENTS=y and CONFIG_CGROUP_PERF=y

Copy link
Copy Markdown
Contributor

@grahamwhaley grahamwhaley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the metrics! Nice it all headed in the 'good' direction.
I'll drop my lgmt here in preparation for other feedback/decisions and the dropping of the WIP

# FIXME - check if we should be setting this
# https://github.com/kata-containers/packaging/issues/483
#CONFIG_RANDOMIZE_BASE=y
CONFIG_RANDOMIZE_BASE=y
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amshinde thinking about this, KSM works by looking for hash identical pages - so, I suspect KSM will still work with RANDOMIZE_BASE - but, I suspect you cannot do any native address based sharing (as the page addresses will be different) - but then I'm not sure we do that between guest kernels loaded by the VM anyway?

I had a dig around to see if there were any documented and notable disadvantages to setting CONFIG_RANDOMIZE_BASE. The only real one I can identify as being real/current is that it will hinder trying to debug crashed kernels, as the address map will change between each boot. I think we debug guest kernels so infrequently that the gains from the positive security aspect probably outweigh the negatives of the debug situation.

We probably want to ask @kata-containers/architecture-committee if they have any thoughts here.

# Use the maximum number of CPUs supported by KVM (255)
CONFIG_NR_CPUS=255

CONFIG_PERF_EVENTS=y
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that's interesting - didn't know KVM selected PERF_EVENTS. I wonder if they are needed for something specific - would you know @dagrh ?

I guess having them on may do no harm - but, I also guess that is now an architecture specific decision at some level - so probably your choice for this arch @Pennyzct :-)

For now, a few configs as follows in common acpi dir are truly x86-spcecific
or disable by default on arm64.
CONFIG_ACPI_CPU_FREQ_PSS=y
CONFIG_ACPI_HOTPLUG_IOAPIC=y
CONFIG_ACPI_LEGACY_TABLES_LOOKUP
CONFIG_ACPI_LPIT=y
CONFIG_ARCH_MIGHT_HAVE_ACPI_PDC=y
CONFIG_ACPI_PROCESSOR_CSTATE=y
CONFIG_ACPI_SYSTEM_POWER_STATES_SUPPORT=y
CONFIG_HAVE_ACPI_APEI_NMI=y
And I also add a few configs which are aarch64-specific.
Like CONFIG_ACPI_REDUCED_HARDWARE_ONLY=y, since ARM64 can run properly
in ACPI hardware reduced mode.

Fixes: kata-containers#1004

Signed-off-by: Penny Zheng <penny.zheng@arm.com>
There exists a few configs about linux guest support or optimization
that are not supported on aarch64.
CONFIG_HYPERVISOR_GUEST is only defined under arch/x86/Kconfig and
unfortunately, CONFIG_KVM_GUEST is not supported on aarch64 for now.

Fixes: kata-containers#1004

Signed-off-by: Penny Zheng <penny.zheng@arm.com>
There exists a few security-related configs, which are x86-64 specific.
CONFIG_LEGACY_VSYSCALL_NONE=y
CONFIG_RETPOLINE=y

CONFIG_RELOCATABLE and CONFIG_RANDOMIZE_BASE are kinds of tangled on
aarch64, if CONFIG_RANDOMIZE_BASE=y, then CONFIG_RELOCATABLE will be
selected automatically.
CONFIG_RANDOMIZE_BASE will randomize the virtual address at which the
kernel image is loaded, which as a security feature could deter exploit
attempts relying on knowledge of the location of kernel internals.

Fixes: kata-containers#1004

Signed-off-by: Penny Zheng <penny.zheng@arm.com>
mmio devices are required in firecracker, and for now, x86_64 and
aarch64 are all supporting kata containers with firecracker.
So, we need to move mmio-related configs to common dir.

Fixes: kata-containers#1004

Signed-off-by: Penny Zheng <penny.zheng@arm.com>
Compaction is the only memory management component to form high order
(larger physically contiguous) memory blocks reliably.
The page allocator relies on compaction heavily and the lack of the feature
can lead to unexpected OOM killer invocations for high order memory requests.
We shouldn't disable this option unless there really is a strong reason.

Fixes: kata-containers#1004

Signed-off-by: Penny Zheng <penny.zheng@arm.com>
Add a few arm64-specific configs and classify them into seven new categories
, that is,
1. base architecture-dependent options(base.conf)
It also includes varient-specific features, like CONFIG_ARM64_PMEM is
one ARMv8.2 arichitectural features.
2. crypto-related options(crypto.conf)
ARMv8 adds cryptographic instructions that could significantly improve
performance on tasks such as AES encryption and SHA1 and SHA256 hashing.
3. device tree related options(dt.conf)
The "Open Firmware Device Tree", or simply Device Tree (DT), is a data
structure and language for describing hardware, which is commonly
used in arm architecture.
4. ARM errata workarounds options(errata.conf)
There are many Kconfig entires under "Kernel Features" ->
"ARM errata workarounds via the alternatives framework", which provides
software workarounds to mitigate systems affected by those erratum.
Vendor-specific option will be left to users to decide.
5. pci related options(pci.conf)
a simplified pci host controller for mach-virt.
6. serial devices options(serial.conf)
CONFIG_SERIAL_OF_PLATFORM is used for all 8250 compatible serial ports
that are probed through device tree.
7. rtc related options(rtc.conf)
we don't have KVM’s paravirtualized clock and ptp implementation is
still under experimental mode, so we need rtc on aarch64.
QEMU provides an emulated ARM AMBA PrimeCell PL031 RTC.

Fixes: kata-containers#1004

Signed-off-by: Penny Zheng <penny.zheng@arm.com>
Backport Anshuman Khandual's patch series of Enabling memory hot
remove on aarch64(https://patchwork.kernel.org/cover/11419305/)
to v5.4.x.
XONE_DEVICE is dependent on the implementation of memory hot remove.
This patch series has already been merged, and queued for 5.7.
After backporting this series, we could finally enable nvdimm/dax
on arm64.

Fixes: kata-containers#1004

Signed-off-by: Penny Zheng <penny.zheng@arm.com>
fix a few typo errors.

Fixes: kata-containers#1004

Signed-off-by: Penny Zheng <penny.zheng@arm.com>
Since we disable pci shpc hotplug for arm64, see
kata-containers#498 for detailed
reason.
We need to move CONFIG_HOTPLUG_PCI_SHPC from common conf to
x86_64-specific.

Fixes: kata-containers#1004

Signed-off-by: Penny Zheng <penny.zheng@arm.com>
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
@Pennyzct Pennyzct force-pushed the kernel_fragment_on_aarch64 branch from 88899cb to d9b1e94 Compare April 21, 2020 06:49
@Pennyzct
Copy link
Copy Markdown
Contributor Author

Pennyzct commented Apr 21, 2020

Hi guys @grahamwhaley @dagrh
I will let config_perf_events on arm64 for consistency.
And BTW, in @justin-he's new PR, kata-containers/runtime@105673b, we will add pmu=off in qemu command line to deliberately turn off pmu, since pmu will cost significant overhead on boot time and virtualization context switch.

@Pennyzct
Copy link
Copy Markdown
Contributor Author

Hi guys
I've added a new commit to enable kvm-ptp, plz re-review. ;)
And I've done all the work here, so maybe we could remove the wip flag here.
And btw, since we have no arm jenkins job here in packaging, I raised a PR kata-containers/runtime#2602 in runtime repo to run all tests.

kvm-ptp is critical for mitigating time drift between host and guest.
This implementation in kernel side is still one experimental feature on
aarch64, and see kata-containers#998
for detailed instructions.

Fixes: kata-containers#1004

Signed-off-by: Penny Zheng <penny.zheng@arm.com>
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
The config file created by kernel fragments scheme is quite different
with the old arm64_kata_kvm_5.4.x.
So I will update arm64_kata_kvm_5.4.x for consistency.

Fixes: kata-containers#1004

Signed-off-by: Penny Zheng <penny.zheng@arm.com>
@Pennyzct Pennyzct force-pushed the kernel_fragment_on_aarch64 branch from d9b1e94 to 14a281d Compare April 21, 2020 09:52
@grahamwhaley grahamwhaley removed the wip Work in Progress (PR incomplete - needs more work or rework) label Apr 21, 2020
@Pennyzct
Copy link
Copy Markdown
Contributor Author

/test-ubuntu

@Pennyzct
Copy link
Copy Markdown
Contributor Author

/AzurePipelines run

@azure-pipelines
Copy link
Copy Markdown

Commenter does not have sufficient privileges for PR 1006 in repo kata-containers/packaging

@Pennyzct
Copy link
Copy Markdown
Contributor Author

Pennyzct commented Apr 22, 2020

Hi guys
obs ci is not responding. ;)
Need help trigger it~~~

@jodh-intel
Copy link
Copy Markdown

/AzurePipelines run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 2 pipeline(s).

@Pennyzct
Copy link
Copy Markdown
Contributor Author

Hi guys
CI is all happy now~~~~~~🎉🎊

@jodh-intel jodh-intel merged commit 56d7074 into kata-containers:master Apr 23, 2020
devimc pushed a commit to devimc/kata-containers that referenced this pull request Sep 3, 2020
The linux kernel feature RANDOMIZE_BASE improved the security and at
the same time increased the memory footprint of a kata container,
this feature was enabled in kata-containers/packaging#1006.
In order to mitigate this increase in memory consumption, we can
boot container using the uncompressed kernel.

fixes #

Signed-off-by: Julio Montes <julio.montes@intel.com>
devimc pushed a commit to devimc/kata-containers that referenced this pull request Sep 3, 2020
The linux kernel feature RANDOMIZE_BASE improved the security and at
the same time increased the memory footprint of a kata container,
this feature was enabled in kata-containers/packaging#1006.
In order to mitigate this increase in memory consumption, we can
boot container using the uncompressed kernel.

fixes kata-containers#669

Signed-off-by: Julio Montes <julio.montes@intel.com>
devimc pushed a commit to devimc/kata-containers that referenced this pull request Sep 4, 2020
The linux kernel feature RANDOMIZE_BASE improved the security and at
the same time increased the memory footprint of a kata container,
this feature was enabled in kata-containers/packaging#1006.
In order to mitigate this increase in memory consumption, we can
boot container using the uncompressed kernel.

fixes kata-containers#669

Signed-off-by: Julio Montes <julio.montes@intel.com>
devimc pushed a commit to devimc/kata-containers that referenced this pull request Sep 4, 2020
The linux kernel feature RANDOMIZE_BASE improved the security and at
the same time increased the memory footprint of a kata container,
this feature was enabled in kata-containers/packaging#1006.
In order to mitigate this increase in memory consumption, we can
boot container using the uncompressed kernel.

Reduce boot time by ~5%
Reduce KSM memory footprint by ~14%
Reduce noKSM memory footprint by ~27%

fixes kata-containers#669

Signed-off-by: Julio Montes <julio.montes@intel.com>
devimc pushed a commit to devimc/kata-containers that referenced this pull request Sep 4, 2020
The linux kernel feature RANDOMIZE_BASE improved the security and at
the same time increased the memory footprint of a kata container,
this feature was enabled in kata-containers/packaging#1006.
In order to mitigate this increase in memory consumption, we can
boot container using the uncompressed kernel.

Reduce boot time by ~5%
Reduce KSM memory footprint by ~14%
Reduce noKSM memory footprint by ~27%

fixes kata-containers#669

Signed-off-by: Julio Montes <julio.montes@intel.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants