[PULP-402] Greatly improve memory usage during collection syncs by gerrod3 · Pull Request #2454 · pulp/pulp_ansible

gerrod3 · 2026-03-06T21:50:59Z

The memory reduction comes in two parts:

Refactor first stage to limit number of coroutines running at once to 100
Avoid when possible json serialization of docs_blob, files, manifest & contents fields

The memory growth should remain flat (or extremely minimal) after first batch (500) processing of collection versions. Most syncs should now complete with less than 1 GB of memory usage.

https://issues.redhat.com/browse/PULP-402
Assssisted by: clause-opus-4.6

Results

Perform a sync on a clean system with this requirements file:

{
'collections': 
  [
    {'name': 'amazon.aws'},
    {'name': 'community.general'}, 
    {'name': 'ansible.netcommon'}, 
    {'name': 'ansible.utils'}, 
    {'name': 'ansible.posix'}, 
    {'name': 'community.crypto'}
  ]
}

Metric	Original	No Docs/Files JSON serialize	All Optimizations
Peak RSS	~4160 MB	897 MB	598 MB
Duration	~10 min	~8 min	~6.5 min
Reduction	~~	78%	86%

Explanation

The majority of the savings comes from not loading the JSON fields into memory as dicts. The fields can be relatively large dicts, with docs_blob sometimes being multiple MB. Python's pymalloc allocator fragments memory for the many small nested objects within the field's dict and never returns it to the OS, causing RSS to grow monotonically. We bypass the object allocation with three changes:

Only read the field as a json string into memory before the insertion into the database
Use raw SQL to directly insert the json strings into PG to avoid the ORM serialization
Use a custom manager that defers these memory hog fields during sync

The second part of the savings comes from refactoring the first stage to limit the number of concurrently running coroutines. We spawn 1 coroutine per collection version to sync and each one does a fetch to grab its metadata. This metadata is again JSON that gets trapped in memory until the coroutine finishes. Because we ran them all at once this would cause a rapid increase of memory per batch that wouldn't decrease till the first stage finished. We fix this with these three changes:

All the methods return lists of coroutines instead of creating tasks that auto-start
The run method uses a single gather with a batched (100) list of the coroutines.
Move around the metadata processing inside add_collection_version to be able to free it as soon as possible

The memory reduction comes in two parts: 1. Refactor first stage to limit number of coroutines running at once to 100 2. Avoid when possible json serialization of docs_blob, files, manifest & contents fields The memory growth should remain flat (or extremely minimal) after first batch (500) processing of collection versions. Most syncs should now complete with less than 1 GB of memory usage. https://issues.redhat.com/browse/PULP-402 Assssisted by: clause-opus-4.6

mdellweg

This is a rather huge change. But there is a recurring theme: Too much unbounded data fields (do we really need them) in the database put a lot of pressure on the memory. Can we gain something by moving them into a separate one-to-one table?

mdellweg · 2026-03-10T09:05:19Z

-    async def _fetch_collection_version_metadata(self, api_version, collection_version_url):
+    async def _fetch_collection_version_metadata(
+        self, api_version, collection_version_url
+    ) -> list[Coroutine]:


There's probably no use for adding type hints if you don't check them. Also I think what you want to return is Awaitable.

mdellweg · 2026-03-10T09:05:29Z

+                        with open(docs_blob_path, "r") as docs_blob_file:
+                            blob = docs_blob_file.read()
+                        sql = (
+                            "UPDATE ansible_collectionversion"


This sounds particularly troubling. Content is supposed to be immutable, because it is shared.
Update violates it. That means there is a race condition here.

My current reasoning is that there is no race condition based on the checks I do before hand. Before we reach the ContentSaver stage we go through the QueryContent stage, so there are two outcomes from this. 1. Content did not exists before, 2. Content exists and was replaced on the dcontent. If the content already existed then this part isn't ran. Now if we are creating the content there are two scenarios: 1. We are the first task to create the content, 2. We are actually the second+ task to create it, another parallel task/upload got to it first. The checks on whether the pulp_id is the same across _pre_save and _post_save is meant to distinguish between the two scenarios. We only want to update in scenario one to ensure only one task is updating the content post fact. This should prevent race conditions between parallel tasks.

patchback · 2026-03-17T18:58:41Z

Backport to 0.24: 💔 cherry-picking failed — conflicts found

❌ Failed to cleanly apply afdaf7c on top of patchback/backports/0.24/afdaf7cdf96f38dda4b92e8d9eca3e661c2c181a/pr-2454

Backporting merged PR #2454 into main

Ensure you have a local repo clone of your fork. Unless you cloned it
from the upstream, this would be your origin remote.
Make sure you have an upstream repo added as a remote too. In these
instructions you'll refer to it by the name upstream. If you don't
have it, here's how you can add it:
```
$ git remote add upstream https://github.com/pulp/pulp_ansible.git
```

Ensure you have the latest copy of upstream and prepare a branch
that will hold the backported code:

$ git fetch upstream
$ git checkout -b patchback/backports/0.24/afdaf7cdf96f38dda4b92e8d9eca3e661c2c181a/pr-2454 upstream/0.24

Now, cherry-pick PR [PULP-402] Greatly improve memory usage during collection syncs #2454 contents into that branch:
```
$ git cherry-pick -x afdaf7cdf96f38dda4b92e8d9eca3e661c2c181a
```
If it'll yell at you with something like fatal: Commit afdaf7cdf96f38dda4b92e8d9eca3e661c2c181a is a merge but no -m option was given., add -m 1 as follows instead:
```
$ git cherry-pick -m1 -x afdaf7cdf96f38dda4b92e8d9eca3e661c2c181a
```
At this point, you'll probably encounter some merge conflicts. You must
resolve them in to preserve the patch from PR [PULP-402] Greatly improve memory usage during collection syncs #2454 as close to the
original as possible.

Push this branch to your fork on GitHub:

$ git push origin patchback/backports/0.24/afdaf7cdf96f38dda4b92e8d9eca3e661c2c181a/pr-2454

Create a PR, ensure that the CI is green. If it's not — update it so that
the tests and any other checks pass. This is it!
Now relax and wait for the maintainers to process your pull request
when they have some cycles to do reviews. Don't worry — they'll tell you if
any improvements are necessary when the time comes!

🤖 @patchback
I'm built with octomachinery and
my source is open — https://github.com/sanitizers/patchback-github-app.

patchback · 2026-03-17T18:58:41Z

Backport to 0.29: 💚 backport PR created

✅ Backport PR branch: patchback/backports/0.29/afdaf7cdf96f38dda4b92e8d9eca3e661c2c181a/pr-2454

Backported as #2472

🤖 @patchback
I'm built with octomachinery and
my source is open — https://github.com/sanitizers/patchback-github-app.

patchback · 2026-03-17T18:58:41Z

Backport to 0.25: 💔 cherry-picking failed — conflicts found

❌ Failed to cleanly apply afdaf7c on top of patchback/backports/0.25/afdaf7cdf96f38dda4b92e8d9eca3e661c2c181a/pr-2454

Backporting merged PR #2454 into main

Ensure you have a local repo clone of your fork. Unless you cloned it
from the upstream, this would be your origin remote.
Make sure you have an upstream repo added as a remote too. In these
instructions you'll refer to it by the name upstream. If you don't
have it, here's how you can add it:
```
$ git remote add upstream https://github.com/pulp/pulp_ansible.git
```

Ensure you have the latest copy of upstream and prepare a branch
that will hold the backported code:

$ git fetch upstream
$ git checkout -b patchback/backports/0.25/afdaf7cdf96f38dda4b92e8d9eca3e661c2c181a/pr-2454 upstream/0.25

Now, cherry-pick PR [PULP-402] Greatly improve memory usage during collection syncs #2454 contents into that branch:
```
$ git cherry-pick -x afdaf7cdf96f38dda4b92e8d9eca3e661c2c181a
```
If it'll yell at you with something like fatal: Commit afdaf7cdf96f38dda4b92e8d9eca3e661c2c181a is a merge but no -m option was given., add -m 1 as follows instead:
```
$ git cherry-pick -m1 -x afdaf7cdf96f38dda4b92e8d9eca3e661c2c181a
```
At this point, you'll probably encounter some merge conflicts. You must
resolve them in to preserve the patch from PR [PULP-402] Greatly improve memory usage during collection syncs #2454 as close to the
original as possible.

Push this branch to your fork on GitHub:

$ git push origin patchback/backports/0.25/afdaf7cdf96f38dda4b92e8d9eca3e661c2c181a/pr-2454

Create a PR, ensure that the CI is green. If it's not — update it so that
the tests and any other checks pass. This is it!
Now relax and wait for the maintainers to process your pull request
when they have some cycles to do reviews. Don't worry — they'll tell you if
any improvements are necessary when the time comes!

🤖 @patchback
I'm built with octomachinery and
my source is open — https://github.com/sanitizers/patchback-github-app.

patchback · 2026-03-17T18:58:42Z

Backport to 0.28: 💔 cherry-picking failed — conflicts found

❌ Failed to cleanly apply afdaf7c on top of patchback/backports/0.28/afdaf7cdf96f38dda4b92e8d9eca3e661c2c181a/pr-2454

Backporting merged PR #2454 into main

Ensure you have a local repo clone of your fork. Unless you cloned it
from the upstream, this would be your origin remote.
Make sure you have an upstream repo added as a remote too. In these
instructions you'll refer to it by the name upstream. If you don't
have it, here's how you can add it:
```
$ git remote add upstream https://github.com/pulp/pulp_ansible.git
```

Ensure you have the latest copy of upstream and prepare a branch
that will hold the backported code:

$ git fetch upstream
$ git checkout -b patchback/backports/0.28/afdaf7cdf96f38dda4b92e8d9eca3e661c2c181a/pr-2454 upstream/0.28

Now, cherry-pick PR [PULP-402] Greatly improve memory usage during collection syncs #2454 contents into that branch:
```
$ git cherry-pick -x afdaf7cdf96f38dda4b92e8d9eca3e661c2c181a
```
If it'll yell at you with something like fatal: Commit afdaf7cdf96f38dda4b92e8d9eca3e661c2c181a is a merge but no -m option was given., add -m 1 as follows instead:
```
$ git cherry-pick -m1 -x afdaf7cdf96f38dda4b92e8d9eca3e661c2c181a
```
At this point, you'll probably encounter some merge conflicts. You must
resolve them in to preserve the patch from PR [PULP-402] Greatly improve memory usage during collection syncs #2454 as close to the
original as possible.

Push this branch to your fork on GitHub:

$ git push origin patchback/backports/0.28/afdaf7cdf96f38dda4b92e8d9eca3e661c2c181a/pr-2454

Create a PR, ensure that the CI is green. If it's not — update it so that
the tests and any other checks pass. This is it!
Now relax and wait for the maintainers to process your pull request
when they have some cycles to do reviews. Don't worry — they'll tell you if
any improvements are necessary when the time comes!

🤖 @patchback
I'm built with octomachinery and
my source is open — https://github.com/sanitizers/patchback-github-app.

dralley · 2026-03-24T20:53:51Z

@gerrod3 Is the json-stream package, which we already depend on for import/export, helpful at all?

https://github.com/daggaz/json-stream?tab=readme-ov-file#-what-are-the-problems-with-the-standard-json-package

patchback · 2026-04-20T13:48:21Z

Backport to 0.22: 💔 cherry-picking failed — conflicts found

❌ Failed to cleanly apply afdaf7c on top of patchback/backports/0.22/afdaf7cdf96f38dda4b92e8d9eca3e661c2c181a/pr-2454

Backporting merged PR #2454 into main

Ensure you have a local repo clone of your fork. Unless you cloned it
from the upstream, this would be your origin remote.
Make sure you have an upstream repo added as a remote too. In these
instructions you'll refer to it by the name upstream. If you don't
have it, here's how you can add it:
```
$ git remote add upstream https://github.com/pulp/pulp_ansible.git
```

Ensure you have the latest copy of upstream and prepare a branch
that will hold the backported code:

$ git fetch upstream
$ git checkout -b patchback/backports/0.22/afdaf7cdf96f38dda4b92e8d9eca3e661c2c181a/pr-2454 upstream/0.22

Now, cherry-pick PR [PULP-402] Greatly improve memory usage during collection syncs #2454 contents into that branch:
```
$ git cherry-pick -x afdaf7cdf96f38dda4b92e8d9eca3e661c2c181a
```
If it'll yell at you with something like fatal: Commit afdaf7cdf96f38dda4b92e8d9eca3e661c2c181a is a merge but no -m option was given., add -m 1 as follows instead:
```
$ git cherry-pick -m1 -x afdaf7cdf96f38dda4b92e8d9eca3e661c2c181a
```
At this point, you'll probably encounter some merge conflicts. You must
resolve them in to preserve the patch from PR [PULP-402] Greatly improve memory usage during collection syncs #2454 as close to the
original as possible.

Push this branch to your fork on GitHub:

$ git push origin patchback/backports/0.22/afdaf7cdf96f38dda4b92e8d9eca3e661c2c181a/pr-2454

Create a PR, ensure that the CI is green. If it's not — update it so that
the tests and any other checks pass. This is it!
Now relax and wait for the maintainers to process your pull request
when they have some cycles to do reviews. Don't worry — they'll tell you if
any improvements are necessary when the time comes!

🤖 @patchback
I'm built with octomachinery and
my source is open — https://github.com/sanitizers/patchback-github-app.

github-actions Bot added the no-issue label Mar 6, 2026

gerrod3 force-pushed the memory-sync branch from 9131a60 to 64fc7f6 Compare March 6, 2026 21:59

gerrod3 force-pushed the memory-sync branch from 64fc7f6 to d15d2b0 Compare March 6, 2026 22:04

mdellweg reviewed Mar 10, 2026

View reviewed changes

mdellweg approved these changes Mar 13, 2026

View reviewed changes

gerrod3 merged commit afdaf7c into pulp:main Mar 17, 2026
13 checks passed

gerrod3 deleted the memory-sync branch March 17, 2026 18:58

gerrod3 added backport-0.24 backport-0.29 backport-0.25 backport-0.28 labels Mar 17, 2026

patchback Bot mentioned this pull request Mar 17, 2026

[PR #2454/afdaf7cd backport][0.29] [PULP-402] Greatly improve memory usage during collection syncs #2472

Merged

gerrod3 mentioned this pull request Mar 18, 2026

Sync only latest version of collections by default #2471

Open

4 tasks

Funi1234 mentioned this pull request Apr 2, 2026

Reduce memory usage during collection deletion #2492

Merged

gerrod3 added the backport-0.22 label Apr 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PULP-402] Greatly improve memory usage during collection syncs#2454

[PULP-402] Greatly improve memory usage during collection syncs#2454
gerrod3 merged 1 commit into
pulp:mainfrom
gerrod3:memory-sync

gerrod3 commented Mar 6, 2026

Uh oh!

mdellweg left a comment

Uh oh!

mdellweg Mar 10, 2026

Uh oh!

mdellweg Mar 10, 2026

Uh oh!

gerrod3 Mar 10, 2026

Uh oh!

Uh oh!

patchback Bot commented Mar 17, 2026 •

edited

Loading

Uh oh!

patchback Bot commented Mar 17, 2026 •

edited

Loading

Uh oh!

patchback Bot commented Mar 17, 2026 •

edited

Loading

Uh oh!

patchback Bot commented Mar 17, 2026 •

edited

Loading

Uh oh!

dralley commented Mar 24, 2026

Uh oh!

patchback Bot commented Apr 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

gerrod3 commented Mar 6, 2026

Results

Explanation

Uh oh!

mdellweg left a comment

Choose a reason for hiding this comment

Uh oh!

mdellweg Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

mdellweg Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

gerrod3 Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

patchback Bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Backport to 0.24: 💔 cherry-picking failed — conflicts found

Backporting merged PR #2454 into main

Uh oh!

patchback Bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Backport to 0.29: 💚 backport PR created

Uh oh!

patchback Bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Backport to 0.25: 💔 cherry-picking failed — conflicts found

Backporting merged PR #2454 into main

Uh oh!

patchback Bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Backport to 0.28: 💔 cherry-picking failed — conflicts found

Backporting merged PR #2454 into main

Uh oh!

dralley commented Mar 24, 2026

Uh oh!

patchback Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Backport to 0.22: 💔 cherry-picking failed — conflicts found

Backporting merged PR #2454 into main

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

patchback Bot commented Mar 17, 2026 •

edited

Loading

patchback Bot commented Mar 17, 2026 •

edited

Loading

patchback Bot commented Mar 17, 2026 •

edited

Loading

patchback Bot commented Mar 17, 2026 •

edited

Loading

patchback Bot commented Apr 20, 2026 •

edited

Loading