Skip to content

Sync only latest version of collections by default#2471

Open
dhageman wants to merge 1 commit intopulp:mainfrom
dhageman:synclatest
Open

Sync only latest version of collections by default#2471
dhageman wants to merge 1 commit intopulp:mainfrom
dhageman:synclatest

Conversation

@dhageman
Copy link

The behavior of the current collection sync without a requirements file is to pull all collections and every version available of each collection. This can lead to failures in the collection sync process due to memory inefficiencies in how collection syncs are performed.

This PR switches the default behavior to only sync the latest version of each collection that is available.

The current behavior can be achieved by providing a requirements file with a version set to a wildcard or version range.

Assisted By: Claude Code

📜 Checklist

  • Commits are cleanly separated with meaningful messages (simple features and bug fixes should be squashed to one commit)
  • A changelog entry or entries has been added for any significant changes
  • Follows the Pulp policy on AI Usage
  • (For new features) - User documentation and test coverage has been added

@mdellweg
Copy link
Member

I am uncomfortable with changing default behavior.
This will almost certainly break long standing existing workflows.

@gerrod3
Copy link
Contributor

gerrod3 commented Mar 18, 2026

@dhageman We recently just merged #2454 to reduce the memory of collection syncs. I would wait till we release it and try it out before many further changes to the sync behavior. It was also backported to 0.24, 0.25, 0.28, and 0.29

@dhageman
Copy link
Author

@mdellweg, @gerrod3 - I appreciate the feed back!

I have confidence that memory issues will be better with the recent patches. Those are welcome changes!

I was hesitant to submit this PR because memory consumption was the most pressing issue, but I changed my mind after discussing it with some colleagues. I think it is a good way to start a conversation if the current behavior is the best default behavior.

I don't believe old collections are reaped from the current repositories for collections. This means the amount of data that is synchronized continues to grow. New organizations starting fresh syncs will discover ever-growing synchronization times paired with increased storage requirements.

Is it reasonable to assume that an organization will want a local copy of every version of every collection available since the beginning of time?

Is it more reasonable to sync all the collections but only the latest version? The additive nature of syncs will continue to pick up the latest version.

Remember, the original behavior can be achieved with an explicit requirements file.

I have reached out to a few additional people to get their feedback on whether a change of this nature is worth the effort. If not - all good - at least we had the conversation!

@gerrod3
Copy link
Contributor

gerrod3 commented Mar 18, 2026

We probably won't accept changing the default behavior, but I think we would accept a new field on the remote latest_version or sync_latest_versions=# to allow modifying the behavior of the sync. This is what other plugins do.

As for reducing disk space usage on initial sync, this is what on-demand syncs are for. I think we are pretty close to being able to solve this in pulp-ansible now #712.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants