Skip to content

cmdlib: bump supermin VM memory to 3G#2940

Closed
jlebon wants to merge 1 commit intocoreos:mainfrom
jlebon:pr/bump-supermin-mem
Closed

cmdlib: bump supermin VM memory to 3G#2940
jlebon wants to merge 1 commit intocoreos:mainfrom
jlebon:pr/bump-supermin-mem

Conversation

@jlebon
Copy link
Copy Markdown
Member

@jlebon jlebon commented Jun 23, 2022

Something has changed recently which causes us to hit the ENOMEM issue
more easily now:

openshift/os#594 (comment)

Mid-term, we could rework the compose so that only the OCI archive is
pulled through 9p rather than a full pull-local. Long-term, the fix is
to stop using 9p.

But for now to unblock CI, let's just bump the VM memory to 3G which
should help.

jmarrero
jmarrero previously approved these changes Jun 23, 2022
Copy link
Copy Markdown
Member

@jmarrero jmarrero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

travier
travier previously approved these changes Jun 23, 2022
@jlebon
Copy link
Copy Markdown
Member Author

jlebon commented Jun 23, 2022

Mid-term, we could rework the compose so that only the OCI archive is
pulled through 9p rather than a full pull-local.

Started working on a patch for this.

@dustymabe
Copy link
Copy Markdown
Member

dustymabe commented Jun 23, 2022

Mid-term, we could rework the compose so that only the OCI archive is
pulled through 9p rather than a full pull-local. Long-term, the fix is
to stop using 9p.

Could we attach the ociarchive file directly as a block device (read-only) with a fixed size to the VM and access it directly (i.e. cat/pipe it to stdin of a process that will extract it)?

@jlebon
Copy link
Copy Markdown
Member Author

jlebon commented Jun 23, 2022

Mid-term, we could rework the compose so that only the OCI archive is
pulled through 9p rather than a full pull-local. Long-term, the fix is
to stop using 9p.

Could we attach the ociarchive file directly as a block device (read-only) with a fixed size to the VM and access it directly (i.e. cat/pipe it to stdin of a process that will extract it)?

You're talking about the image creation part, right? The issue we're hitting here is rpm-ostree compose time. Image creation actually already works that way (pushing the container only through 9p and extracting inside of there). The path I'm proposing is applying the same idea but in reverse for pulling out the OSTree commit.

@jlebon jlebon enabled auto-merge (rebase) June 23, 2022 13:23
@jlebon jlebon disabled auto-merge June 23, 2022 13:59
Something has changed recently which causes us to hit the ENOMEM issue
more easily now:

openshift/os#594 (comment)

Mid-term, we could rework the compose so that only the OCI archive is
pulled through 9p rather than a full `pull-local`. Long-term, the fix is
to stop using 9p.

But for now to unblock CI, let's just bump the VM memory to 4G which
should help.
@jlebon
Copy link
Copy Markdown
Member Author

jlebon commented Jun 23, 2022

OK this is useful. Prow failed with ENOMEM, so clearly it's not just a matter of throwing more RAM at it. Unless this regression we're facing has really made 9p more memory hungry. Anyway, I bumped it to 4G now just to see if that works, but let's not merge this yet. I'll try to get the OCI archive-based patch working.

@jlebon jlebon dismissed stale reviews from travier and jmarrero via 83ce8d9 June 23, 2022 14:05
@jlebon jlebon force-pushed the pr/bump-supermin-mem branch from 3593869 to 83ce8d9 Compare June 23, 2022 14:05
@jlebon
Copy link
Copy Markdown
Member Author

jlebon commented Jun 23, 2022

golangci/golangci-lint info checking GitHub for latest tag
golangci/golangci-lint crit unable to find '' - use 'latest' or see https://github.com/golangci/golangci-lint/releases for details 

Hmm, not sure what's going on here.

/retest

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Jun 23, 2022

@jlebon: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/images 83ce8d9 link true /test images
ci/prow/rhcos 83ce8d9 link true /test rhcos

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@cgwalters
Copy link
Copy Markdown
Member

OK this is useful. Prow failed with ENOMEM, so clearly it's not just a matter of throwing more RAM at it.

Yeah I should have highlighted this comment more. It's not that we're running out of memory exactly, it's that 9p specifically is trying to do something broken. We're provoking that bug by doing lots of little files.

Adding more memory only papers over it in that it makes it less likely for the kernel to try reclaiming inodes.

Only real fixes are:

  • use virtiofs and hope that's better (I think it is)
  • Actually fix 9p in the kernel to stop trying to allocate more memory in a reclaim path
  • Stop doing lots of little files over 9p

@jlebon
Copy link
Copy Markdown
Member Author

jlebon commented Jun 24, 2022

Closing in favour of #2946.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants