Skip to content

[PULP-402] Greatly improve memory usage during collection syncs#2454

Merged
gerrod3 merged 1 commit into
pulp:mainfrom
gerrod3:memory-sync
Mar 17, 2026
Merged

[PULP-402] Greatly improve memory usage during collection syncs#2454
gerrod3 merged 1 commit into
pulp:mainfrom
gerrod3:memory-sync

Conversation

@gerrod3
Copy link
Copy Markdown
Contributor

@gerrod3 gerrod3 commented Mar 6, 2026

The memory reduction comes in two parts:

  1. Refactor first stage to limit number of coroutines running at once to 100
  2. Avoid when possible json serialization of docs_blob, files, manifest & contents fields

The memory growth should remain flat (or extremely minimal) after first batch (500) processing of collection versions. Most syncs should now complete with less than 1 GB of memory usage.

https://issues.redhat.com/browse/PULP-402
Assssisted by: clause-opus-4.6

Results

Perform a sync on a clean system with this requirements file:

{
'collections': 
  [
    {'name': 'amazon.aws'},
    {'name': 'community.general'}, 
    {'name': 'ansible.netcommon'}, 
    {'name': 'ansible.utils'}, 
    {'name': 'ansible.posix'}, 
    {'name': 'community.crypto'}
  ]
}
Metric Original No Docs/Files JSON serialize All Optimizations
Peak RSS ~4160 MB 897 MB 598 MB
Duration ~10 min ~8 min ~6.5 min
Reduction ~~ 78% 86%

Explanation

The majority of the savings comes from not loading the JSON fields into memory as dicts. The fields can be relatively large dicts, with docs_blob sometimes being multiple MB. Python's pymalloc allocator fragments memory for the many small nested objects within the field's dict and never returns it to the OS, causing RSS to grow monotonically. We bypass the object allocation with three changes:

  1. Only read the field as a json string into memory before the insertion into the database
  2. Use raw SQL to directly insert the json strings into PG to avoid the ORM serialization
  3. Use a custom manager that defers these memory hog fields during sync

The second part of the savings comes from refactoring the first stage to limit the number of concurrently running coroutines. We spawn 1 coroutine per collection version to sync and each one does a fetch to grab its metadata. This metadata is again JSON that gets trapped in memory until the coroutine finishes. Because we ran them all at once this would cause a rapid increase of memory per batch that wouldn't decrease till the first stage finished. We fix this with these three changes:

  1. All the methods return lists of coroutines instead of creating tasks that auto-start
  2. The run method uses a single gather with a batched (100) list of the coroutines.
  3. Move around the metadata processing inside add_collection_version to be able to free it as soon as possible

The memory reduction comes in two parts:

1. Refactor first stage to limit number of coroutines running at once to 100
2. Avoid when possible json serialization of docs_blob, files, manifest & contents fields

The memory growth should remain flat (or extremely minimal) after first batch (500) processing of collection versions.
Most syncs should now complete with less than 1 GB of memory usage.

https://issues.redhat.com/browse/PULP-402
Assssisted by: clause-opus-4.6
Copy link
Copy Markdown
Member

@mdellweg mdellweg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a rather huge change. But there is a recurring theme: Too much unbounded data fields (do we really need them) in the database put a lot of pressure on the memory. Can we gain something by moving them into a separate one-to-one table?

async def _fetch_collection_version_metadata(self, api_version, collection_version_url):
async def _fetch_collection_version_metadata(
self, api_version, collection_version_url
) -> list[Coroutine]:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's probably no use for adding type hints if you don't check them. Also I think what you want to return is Awaitable.

with open(docs_blob_path, "r") as docs_blob_file:
blob = docs_blob_file.read()
sql = (
"UPDATE ansible_collectionversion"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds particularly troubling. Content is supposed to be immutable, because it is shared.
Update violates it. That means there is a race condition here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My current reasoning is that there is no race condition based on the checks I do before hand. Before we reach the ContentSaver stage we go through the QueryContent stage, so there are two outcomes from this. 1. Content did not exists before, 2. Content exists and was replaced on the dcontent. If the content already existed then this part isn't ran. Now if we are creating the content there are two scenarios: 1. We are the first task to create the content, 2. We are actually the second+ task to create it, another parallel task/upload got to it first. The checks on whether the pulp_id is the same across _pre_save and _post_save is meant to distinguish between the two scenarios. We only want to update in scenario one to ensure only one task is updating the content post fact. This should prevent race conditions between parallel tasks.

@gerrod3 gerrod3 merged commit afdaf7c into pulp:main Mar 17, 2026
13 checks passed
@gerrod3 gerrod3 deleted the memory-sync branch March 17, 2026 18:58
@patchback
Copy link
Copy Markdown

patchback Bot commented Mar 17, 2026

Backport to 0.24: 💔 cherry-picking failed — conflicts found

❌ Failed to cleanly apply afdaf7c on top of patchback/backports/0.24/afdaf7cdf96f38dda4b92e8d9eca3e661c2c181a/pr-2454

Backporting merged PR #2454 into main

  1. Ensure you have a local repo clone of your fork. Unless you cloned it
    from the upstream, this would be your origin remote.
  2. Make sure you have an upstream repo added as a remote too. In these
    instructions you'll refer to it by the name upstream. If you don't
    have it, here's how you can add it:
    $ git remote add upstream https://github.com/pulp/pulp_ansible.git
  3. Ensure you have the latest copy of upstream and prepare a branch
    that will hold the backported code:
    $ git fetch upstream
    $ git checkout -b patchback/backports/0.24/afdaf7cdf96f38dda4b92e8d9eca3e661c2c181a/pr-2454 upstream/0.24
  4. Now, cherry-pick PR [PULP-402] Greatly improve memory usage during collection syncs #2454 contents into that branch:
    $ git cherry-pick -x afdaf7cdf96f38dda4b92e8d9eca3e661c2c181a
    If it'll yell at you with something like fatal: Commit afdaf7cdf96f38dda4b92e8d9eca3e661c2c181a is a merge but no -m option was given., add -m 1 as follows instead:
    $ git cherry-pick -m1 -x afdaf7cdf96f38dda4b92e8d9eca3e661c2c181a
  5. At this point, you'll probably encounter some merge conflicts. You must
    resolve them in to preserve the patch from PR [PULP-402] Greatly improve memory usage during collection syncs #2454 as close to the
    original as possible.
  6. Push this branch to your fork on GitHub:
    $ git push origin patchback/backports/0.24/afdaf7cdf96f38dda4b92e8d9eca3e661c2c181a/pr-2454
  7. Create a PR, ensure that the CI is green. If it's not — update it so that
    the tests and any other checks pass. This is it!
    Now relax and wait for the maintainers to process your pull request
    when they have some cycles to do reviews. Don't worry — they'll tell you if
    any improvements are necessary when the time comes!

🤖 @patchback
I'm built with octomachinery and
my source is open — https://github.com/sanitizers/patchback-github-app.

@patchback
Copy link
Copy Markdown

patchback Bot commented Mar 17, 2026

Backport to 0.29: 💚 backport PR created

✅ Backport PR branch: patchback/backports/0.29/afdaf7cdf96f38dda4b92e8d9eca3e661c2c181a/pr-2454

Backported as #2472

🤖 @patchback
I'm built with octomachinery and
my source is open — https://github.com/sanitizers/patchback-github-app.

@patchback
Copy link
Copy Markdown

patchback Bot commented Mar 17, 2026

Backport to 0.25: 💔 cherry-picking failed — conflicts found

❌ Failed to cleanly apply afdaf7c on top of patchback/backports/0.25/afdaf7cdf96f38dda4b92e8d9eca3e661c2c181a/pr-2454

Backporting merged PR #2454 into main

  1. Ensure you have a local repo clone of your fork. Unless you cloned it
    from the upstream, this would be your origin remote.
  2. Make sure you have an upstream repo added as a remote too. In these
    instructions you'll refer to it by the name upstream. If you don't
    have it, here's how you can add it:
    $ git remote add upstream https://github.com/pulp/pulp_ansible.git
  3. Ensure you have the latest copy of upstream and prepare a branch
    that will hold the backported code:
    $ git fetch upstream
    $ git checkout -b patchback/backports/0.25/afdaf7cdf96f38dda4b92e8d9eca3e661c2c181a/pr-2454 upstream/0.25
  4. Now, cherry-pick PR [PULP-402] Greatly improve memory usage during collection syncs #2454 contents into that branch:
    $ git cherry-pick -x afdaf7cdf96f38dda4b92e8d9eca3e661c2c181a
    If it'll yell at you with something like fatal: Commit afdaf7cdf96f38dda4b92e8d9eca3e661c2c181a is a merge but no -m option was given., add -m 1 as follows instead:
    $ git cherry-pick -m1 -x afdaf7cdf96f38dda4b92e8d9eca3e661c2c181a
  5. At this point, you'll probably encounter some merge conflicts. You must
    resolve them in to preserve the patch from PR [PULP-402] Greatly improve memory usage during collection syncs #2454 as close to the
    original as possible.
  6. Push this branch to your fork on GitHub:
    $ git push origin patchback/backports/0.25/afdaf7cdf96f38dda4b92e8d9eca3e661c2c181a/pr-2454
  7. Create a PR, ensure that the CI is green. If it's not — update it so that
    the tests and any other checks pass. This is it!
    Now relax and wait for the maintainers to process your pull request
    when they have some cycles to do reviews. Don't worry — they'll tell you if
    any improvements are necessary when the time comes!

🤖 @patchback
I'm built with octomachinery and
my source is open — https://github.com/sanitizers/patchback-github-app.

@patchback
Copy link
Copy Markdown

patchback Bot commented Mar 17, 2026

Backport to 0.28: 💔 cherry-picking failed — conflicts found

❌ Failed to cleanly apply afdaf7c on top of patchback/backports/0.28/afdaf7cdf96f38dda4b92e8d9eca3e661c2c181a/pr-2454

Backporting merged PR #2454 into main

  1. Ensure you have a local repo clone of your fork. Unless you cloned it
    from the upstream, this would be your origin remote.
  2. Make sure you have an upstream repo added as a remote too. In these
    instructions you'll refer to it by the name upstream. If you don't
    have it, here's how you can add it:
    $ git remote add upstream https://github.com/pulp/pulp_ansible.git
  3. Ensure you have the latest copy of upstream and prepare a branch
    that will hold the backported code:
    $ git fetch upstream
    $ git checkout -b patchback/backports/0.28/afdaf7cdf96f38dda4b92e8d9eca3e661c2c181a/pr-2454 upstream/0.28
  4. Now, cherry-pick PR [PULP-402] Greatly improve memory usage during collection syncs #2454 contents into that branch:
    $ git cherry-pick -x afdaf7cdf96f38dda4b92e8d9eca3e661c2c181a
    If it'll yell at you with something like fatal: Commit afdaf7cdf96f38dda4b92e8d9eca3e661c2c181a is a merge but no -m option was given., add -m 1 as follows instead:
    $ git cherry-pick -m1 -x afdaf7cdf96f38dda4b92e8d9eca3e661c2c181a
  5. At this point, you'll probably encounter some merge conflicts. You must
    resolve them in to preserve the patch from PR [PULP-402] Greatly improve memory usage during collection syncs #2454 as close to the
    original as possible.
  6. Push this branch to your fork on GitHub:
    $ git push origin patchback/backports/0.28/afdaf7cdf96f38dda4b92e8d9eca3e661c2c181a/pr-2454
  7. Create a PR, ensure that the CI is green. If it's not — update it so that
    the tests and any other checks pass. This is it!
    Now relax and wait for the maintainers to process your pull request
    when they have some cycles to do reviews. Don't worry — they'll tell you if
    any improvements are necessary when the time comes!

🤖 @patchback
I'm built with octomachinery and
my source is open — https://github.com/sanitizers/patchback-github-app.

@dralley
Copy link
Copy Markdown
Contributor

dralley commented Mar 24, 2026

@gerrod3 Is the json-stream package, which we already depend on for import/export, helpful at all?

https://github.com/daggaz/json-stream?tab=readme-ov-file#-what-are-the-problems-with-the-standard-json-package

@patchback
Copy link
Copy Markdown

patchback Bot commented Apr 20, 2026

Backport to 0.22: 💔 cherry-picking failed — conflicts found

❌ Failed to cleanly apply afdaf7c on top of patchback/backports/0.22/afdaf7cdf96f38dda4b92e8d9eca3e661c2c181a/pr-2454

Backporting merged PR #2454 into main

  1. Ensure you have a local repo clone of your fork. Unless you cloned it
    from the upstream, this would be your origin remote.
  2. Make sure you have an upstream repo added as a remote too. In these
    instructions you'll refer to it by the name upstream. If you don't
    have it, here's how you can add it:
    $ git remote add upstream https://github.com/pulp/pulp_ansible.git
  3. Ensure you have the latest copy of upstream and prepare a branch
    that will hold the backported code:
    $ git fetch upstream
    $ git checkout -b patchback/backports/0.22/afdaf7cdf96f38dda4b92e8d9eca3e661c2c181a/pr-2454 upstream/0.22
  4. Now, cherry-pick PR [PULP-402] Greatly improve memory usage during collection syncs #2454 contents into that branch:
    $ git cherry-pick -x afdaf7cdf96f38dda4b92e8d9eca3e661c2c181a
    If it'll yell at you with something like fatal: Commit afdaf7cdf96f38dda4b92e8d9eca3e661c2c181a is a merge but no -m option was given., add -m 1 as follows instead:
    $ git cherry-pick -m1 -x afdaf7cdf96f38dda4b92e8d9eca3e661c2c181a
  5. At this point, you'll probably encounter some merge conflicts. You must
    resolve them in to preserve the patch from PR [PULP-402] Greatly improve memory usage during collection syncs #2454 as close to the
    original as possible.
  6. Push this branch to your fork on GitHub:
    $ git push origin patchback/backports/0.22/afdaf7cdf96f38dda4b92e8d9eca3e661c2c181a/pr-2454
  7. Create a PR, ensure that the CI is green. If it's not — update it so that
    the tests and any other checks pass. This is it!
    Now relax and wait for the maintainers to process your pull request
    when they have some cycles to do reviews. Don't worry — they'll tell you if
    any improvements are necessary when the time comes!

🤖 @patchback
I'm built with octomachinery and
my source is open — https://github.com/sanitizers/patchback-github-app.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants