Skip to content
This repository was archived by the owner on May 12, 2021. It is now read-only.

kernel: enable config CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE by default#335

Merged
jodh-intel merged 1 commit into
kata-containers:masterfrom
wenlxie:fixmemory
Mar 13, 2019
Merged

kernel: enable config CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE by default#335
jodh-intel merged 1 commit into
kata-containers:masterfrom
wenlxie:fixmemory

Conversation

@wenlxie
Copy link
Copy Markdown
Contributor

@wenlxie wenlxie commented Feb 19, 2019

In my test env, I assigned 120Gi memory to the container, but found that Sandbox can't be launched. For details: kata-containers/runtime#1244
kata-agent and udev process was killed because of OOM triggerd.
https://lwn.net/Articles/668944/

Signed-off-by: Wenli Xie wenlxie@ebay.com

Copy link
Copy Markdown

@jodh-intel jodh-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

We might want input from @grahamwhaley in case this has an impact on the fragements work (#314).

@grahamwhaley
Copy link
Copy Markdown
Contributor

:-) I think more importantly is how this does or does not fit in with the memory hotplug flow of kata - @jcvenegas @devimc for input from the 'how is kata hotplug meant to work' side.
But, this is also very good evidence for the 120Gb hotplug failure case as well - nice find @wenlxie

@devimc
Copy link
Copy Markdown

devimc commented Feb 19, 2019

@wenlxie great job! do you know why kata-agent receives OOM?
Currently we connect the memory (and CPUs) manually through gRPC/kata-agent, with this patch this implementation should change. I'm not sure but I think this patch can break compatibility with ARM (cc @Pennyzct @Weichen81 )

@jcvenegas
Copy link
Copy Markdown
Member

@wenlxie intesting what is the maximal amount of memory you can hotplug currently with out get the agent killed ?

@sboeuf
Copy link
Copy Markdown

sboeuf commented Feb 19, 2019

@wenlxie the way memory hotplug works for ARM has been implemented here by @Pennyzct.

@Pennyzct could you take a look at this PR and provide some feedback since I'm worried having the config CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE enabled might not be the right fix here.

@wenlxie
Copy link
Copy Markdown
Contributor Author

wenlxie commented Feb 20, 2019

@jcvenegas

@wenlxie intesting what is the maximal amount of memory you can hotplug currently with out get the agent killed ?

In my env, the limitation is 120Gi. I tested 119Gi memory, it is ok.

@wenlxie
Copy link
Copy Markdown
Contributor Author

wenlxie commented Feb 20, 2019

@devimc
I think this is the reason. https://lwn.net/Articles/668944/

Currently, all newly added memory blocks remain in 'offline' state unless
someone onlines them, some linux distributions carry special udev rules
like:

SUBSYSTEM=="memory", ACTION=="add", ATTR{state}=="offline", ATTR{state}="online"

to make this happen automatically. This is not a great solution for virtual
machines where memory hotplug is being used to address high memory pressure
situations as such onlining is slow and a userspace process doing this
(udev) has a chance of being killed by the OOM killer as it will probably
require to allocate some memory.

@Pennyzct
Copy link
Copy Markdown
Contributor

Hi~@wenlxie @devimc @sboeuf
For ARM, we also rely on listenToUdevEvents and onlineCPUMem to online hot-plugged memory blocks.
And I have tried this CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE, it works well, which means all above manual dependency also could be eliminated on arm platform. ;)

@jcvenegas
Copy link
Copy Markdown
Member

@chavafg I am afraid we dont have a full kata testing in this repository we need to add one to fully test.
@wenlxie meanwhile.

  • Seems that CI SoB-check is failing due to missing commit body, please add something similar that you put in this PR. Also please open an github issue for this. And also add to the commit body a line like:
Fixes: #ISSUE_NUMBER
  • While we have a job here to fully test kata we need to open a dummy in runtime and add a depends on I can do that for you.

@jcvenegas
Copy link
Copy Markdown
Member

@wenlxie @sboeuf if this works well we will need to sync to also remove onlineCPUMem the memory part.

@sboeuf
Copy link
Copy Markdown

sboeuf commented Feb 20, 2019

@jcvenegas
Ok so if I summarize here, by having this flag, the kernel will take care of hotplugging any new memory added on the fly while the system is already booted, right?

@Pennyzct confirmed this feature works well for ARM and we don't need the manual logic of doing OnlineCpuMem from the agent, is that right?

We need to try this out with x86_64 and s390, to validate that the manual online cpu and memory logic can be totally removed.
@jcvenegas can you check this for x86?
@alicefr can you check this for s390?

@amshinde
Copy link
Copy Markdown
Member

@jcvenegas @chavafg Can we add full CI testing to this PR?
Would be good to test changes to be made in the agent (to remove manual online memory) with this change.

@wenlxie wenlxie force-pushed the fixmemory branch 2 times, most recently from 2533a47 to 313e8b9 Compare February 21, 2019 02:11
@Pennyzct
Copy link
Copy Markdown
Contributor

@sboeuf yep. that's what I mean. I deleted two parts to test this config, one in listenToUdevEvents, and the whole onlineCPUMem, and it works well. all memory blocks automatically online. ;)

@wenlxie
Copy link
Copy Markdown
Contributor Author

wenlxie commented Feb 21, 2019

@chavafg I am afraid we dont have a full kata testing in this repository we need to add one to fully test.
@wenlxie meanwhile.

  • Seems that CI SoB-check is failing due to missing commit body, please add something similar that you put in this PR. Also please open an github issue for this. And also add to the commit body a line like:
Fixes: #ISSUE_NUMBER
  • While we have a job here to fully test kata we need to open a dummy in runtime and add a depends on I can do that for you.

@jcvenegas Thanks, now the SoB check had passed.

@alicefr
Copy link
Copy Markdown

alicefr commented Feb 21, 2019

@wenlxie you can remove the option for s390x. The memory hotplug is not supported by qemu KVM yet

@wenlxie
Copy link
Copy Markdown
Contributor Author

wenlxie commented Feb 21, 2019

@wenlxie you can remove the option for s390x. The memory hotplug is not supported by qemu KVM yet

@alicefr removed, thanks

@sboeuf
Copy link
Copy Markdown

sboeuf commented Feb 21, 2019

/test

@wenlxie wenlxie force-pushed the fixmemory branch 3 times, most recently from 28abbef to 78cb669 Compare February 22, 2019 02:25
@wenlxie
Copy link
Copy Markdown
Contributor Author

wenlxie commented Feb 22, 2019

/test

@wenlxie
Copy link
Copy Markdown
Contributor Author

wenlxie commented Feb 22, 2019

kata_config_version is also need to be changed. Or the Ci job will fail.

@sboeuf
Copy link
Copy Markdown

sboeuf commented Feb 22, 2019

/test

@wenlxie
Copy link
Copy Markdown
Contributor Author

wenlxie commented Feb 27, 2019

@jodh-intel thanks.
I had bump the kata_config_version

@wenlxie
Copy link
Copy Markdown
Contributor Author

wenlxie commented Feb 27, 2019

Need to run CI job again. Can someone help to trigger it?

@jodh-intel
Copy link
Copy Markdown

Sure....

/retest

@jodh-intel
Copy link
Copy Markdown

/test

Comment thread kernel/kata_config_version Outdated
@@ -1 +1 @@
25
26
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @wenlxie - sorry about this, but I think the kernel config version has moved ahead in the master branch again - can you rebase and repush... sorry for the hassle - let's see if we can get this one moving again...

@wenlxie wenlxie force-pushed the fixmemory branch 2 times, most recently from e05719c to 12d3cd6 Compare March 6, 2019 09:12
@wenlxie
Copy link
Copy Markdown
Contributor Author

wenlxie commented Mar 6, 2019

@grahamwhaley I rebase it and need someone help to trigger the test, thank.

@nitkon
Copy link
Copy Markdown
Contributor

nitkon commented Mar 6, 2019

/test

@wenlxie wenlxie force-pushed the fixmemory branch 2 times, most recently from 42263e0 to ee28a84 Compare March 7, 2019 01:23
@wenlxie
Copy link
Copy Markdown
Contributor Author

wenlxie commented Mar 7, 2019

The kernel config version bumps again. :(

@jcvenegas
Copy link
Copy Markdown
Member

/test

@jcvenegas
Copy link
Copy Markdown
Member

@jodh-intel could you merge to another bump race condition :P ?

@jodh-intel
Copy link
Copy Markdown

Restarted CI which failed due to a timeout.

@jcvenegas
Copy link
Copy Markdown
Member

time out again copy here to check if is a not stable test

• [SLOW TEST:5.021 seconds]
docker exit code
/tmp/jenkins/workspace/kata-containers-packaging-ubuntu-18-04-PR/go/src/github.com/kata-containers/tests/integration/docker/exitcode_test.go:20
  check exit codes
  /tmp/jenkins/workspace/kata-containers-packaging-ubuntu-18-04-PR/go/src/github.com/kata-containers/tests/vendor/github.com/onsi/ginkgo/extensions/table/table.go:92
    with exit code '256' when interactive mode is: 'false', it should exit '0'
    /tmp/jenkins/workspace/kata-containers-packaging-ubuntu-18-04-PR/go/src/github.com/kata-containers/tests/vendor/github.com/onsi/ginkgo/extensions/table/table_entry.go:46
------------------------------
Build timed out (after 50 minutes). Marking the build as aborted.

@jcvenegas
Copy link
Copy Markdown
Member

/test

@jodh-intel
Copy link
Copy Markdown

18.04 timed out. Prodded it with a sharper stick this time...

@amshinde
Copy link
Copy Markdown
Member

@wenlxie This needs another update to kata_config_version! Can you please do that?
Would like to see this merged soon.

@wenlxie
Copy link
Copy Markdown
Contributor Author

wenlxie commented Mar 13, 2019

@amshinde I rebased it. But the ci job always fails with timeout, no more infos to debug the issue.

@grahamwhaley
Copy link
Copy Markdown
Contributor

/test
let's see how the CI is now - and keep an eye open to determine if this is a CI fail specific to this PR, or a more generic instability...
thx for your patience @wenlxie

@jodh-intel
Copy link
Copy Markdown

/me pounces...

@jodh-intel jodh-intel merged commit 594ce2f into kata-containers:master Mar 13, 2019
@wenlxie
Copy link
Copy Markdown
Contributor Author

wenlxie commented Mar 14, 2019

/test
let's see how the CI is now - and keep an eye open to determine if this is a CI fail specific to this PR, or a more generic instability...
thx for your patience @wenlxie

@grahamwhaley @jodh-intel Thanks

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.