-
Notifications
You must be signed in to change notification settings - Fork 16.4k
Use reproducible builds for provider packages #35693
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
What’s the intention behind using different timestamps in different providers? (But some of them are the same? Not sure if I’m reading the changes correctly.) |
Glad you asked. It's a deiberate choice and I spent some time thiniking about implication of using one vs. many and I chose "many" for a good reason - happy to explain it. The intention is to keep the time in wheels "real" and reflecting some "actual" time that people could even refer to actual event. One of the solutions you could choose there - you could put a fixed time always (0 or 2000-01-01 equivalent or another arbitrary date). Or you could use single time for all releases just update it from time to time and move forward to current date. None of them have actual meaning. But I figured that the way how we are releasing providers currently and how our process looks like, we can make the dates in wheel actually MEAN something so that they are not artiffical. We are releasing different providers at different times - sometimes we release only amazon and google and sometimes http, sometimes all of them - and the choice is based on several factors - are there any changes to this provider (we don't release when there aren't), are they documentation only (then we just update documentation but not relase provider) or maybe we release all of them even if there are no changes (this is when we release wave of providers where we update min-airflow version for example or when we have a change that affects all provider - for example when we addded auto-generated Now, our provider preparation consistes of two steps (this is shortly describing process that is in detail described here: https://github.com/apache/airflow/blob/main/dev/README_RELEASE_PROVIDER_PACKAGES.md)
Effectively the timestamp which is stored in provider.yaml is the timestamp when release manager run "breeze release-management prepare-provider-documenation` for that particular provider. Note that this also can change individually for each provider. It's quite possible that there are few iterations of this This effectively means that the time documentation (and provider.yaml) were updated by the release manager will be different for each provider - even in the same "wave" of providers that are being released. The wave is really there to streamline voting process, but in fact each provider has individual release cycle.
And for that it seems most natural to use timestamp that were frozen during documentation preparation FOR THAT PROVIDER (which - again might be different in each provider even in the same wave). We release the code from the commit (and this is where rc* tags are added) of the merged commit where the documentation update happened - so effectively the timestamp when the provider yaml was updated by release manager for THIS PROVIDER (mind that it might be different for each provider even in the same release wave) becomes the timestamp that we are using to generate the package. It seems most natural, the timestamp actually means someting (timestamp when release manager prepared documentation for that provider) and it seems reasonable and desired to keep it different per each provider even if they were released in the same wave, because their documentation might be prepared at different times. Also it's a nice record of when the last time documentation was updated for that provider - we could of course get it from git history - but seeing it in the code is a bit more accurate - because it shows actual time og "generation" not the time of commit (which might be minutes or hours later). |
3211525 to
3060c4e
Compare
b9017cc to
e931861
Compare
Sometimes we dont have to commit the provider.yaml |
We always do when when we want to release provider - at minimum we add new version in "versions" |
Flit allows to build reproducible packages (packages that can be compared bit-by-bit) providing that source date epoch is set to repeatable value when package is built. This PR implements reproducibility of our builds by freezing the documentation preparation time in provider.yaml as "source date epoch" and always using it when building the package. This way anyone using breeze to build the package will have exactly the same binary package produced, which will make it way easier to verify if the packages are ready for release by the PMC member. We will no longer have to check the sources, PMC members will simply need to build the same packages locally using breeze and see if the generated packages are exactly the same. That also includes permissions bits - flit sets permissions of the generated packages to "no permissions for other/group" in order to get bit-to-bit reproducibility - because on some systems umask is set differently by default and created file would be different because of the permission bit. This caused problems when building docker images - so we had it changed when moving packages to 'dist' folder, but this PR changes it so that CI build does it when moving packages from dist to docker-context-files instead The "source-date-epoch" fields have been regenerated in this PR as well. Also this PR replaces `lru_cache` method of storing output of `get_provider_metadata_packages` with custom-stored dictionary - thanks to that instead of invalidating whole cache of providers metadata refreshed from yaml files we can refresh individual provider metadata entries after they have been updated. This saves a lot of time for validation - because every time when provider yaml is updated we need to re-read it and re-validate it with json schema, with this change we only do it for the updated provider yaml - which saves about 0.5 a second per provider yaml update and when you update all provides it is done way faster.
e931861 to
880ec25
Compare
When building wheel providers took a lot of time (12 minutes) there was an optimisation implemented to only build the affected providers and in this case we could not run verification, because having only subset of providers would generate errors during imports. However, changes to make our package bulds reproducible with flit apache#35693 also improve building time for provider packages (all of them are built under 1 minute) and .whl installation had always been rather quick - so we can remove the optimisation now, because side effect of it that in some cases (like apache#36799) it caused the backwards compatibility check succeed - and subsequently continue failing in main canary build.
…36825) When building wheel providers took a lot of time (12 minutes) there was an optimisation implemented to only build the affected providers and in this case we could not run verification, because having only subset of providers would generate errors during imports. However, changes to make our package bulds reproducible with flit #35693 also improve building time for provider packages (all of them are built under 1 minute) and .whl installation had always been rather quick - so we can remove the optimisation now, because side effect of it that in some cases (like #36799) it caused the backwards compatibility check succeed - and subsequently continue failing in main canary build.
…36825) When building wheel providers took a lot of time (12 minutes) there was an optimisation implemented to only build the affected providers and in this case we could not run verification, because having only subset of providers would generate errors during imports. However, changes to make our package bulds reproducible with flit #35693 also improve building time for provider packages (all of them are built under 1 minute) and .whl installation had always been rather quick - so we can remove the optimisation now, because side effect of it that in some cases (like #36799) it caused the backwards compatibility check succeed - and subsequently continue failing in main canary build. (cherry picked from commit 7086192)
…36825) When building wheel providers took a lot of time (12 minutes) there was an optimisation implemented to only build the affected providers and in this case we could not run verification, because having only subset of providers would generate errors during imports. However, changes to make our package bulds reproducible with flit #35693 also improve building time for provider packages (all of them are built under 1 minute) and .whl installation had always been rather quick - so we can remove the optimisation now, because side effect of it that in some cases (like #36799) it caused the backwards compatibility check succeed - and subsequently continue failing in main canary build. (cherry picked from commit 7086192)
Flit allows to build reproducible packages (packages that can be
compared bit-by-bit) providing that source date epoch is set to
repeatable value when package is built. This PR implements
reproducibility of our builds by freezing the documentation preparation
time in provider.yaml as "source date epoch" and always using it when
building the package. This way anyone using breeze to build the package
will have exactly the same binary package produced, which will make it
way easier to verify if the packages are ready for release by the PMC
member.
We will no longer have to check the sources, PMC members will simply
need to build the same packages locally using breeze and see if the
generated packages are exactly the same.
That also includes permissions bits - flit sets permissions of
the generated packages to "no permissions for other/group" in
order to get bit-to-bit reproducibility - because on some systems
umask is set differently by default and created file would be
different because of the permission bit. This caused problems
when building docker images - so we had it changed when moving
packages to 'dist' folder, but this PR changes it so that CI
build does it when moving packages from dist to docker-context-files
instead
The "source-date-epoch" fields have been regenerated in this PR as
well. Also this PR replaces
lru_cachemethod of storing outputof
get_provider_metadata_packageswith custom-stored dictionary -thanks to that instead of invalidating whole cache of providers
metadata refreshed from yaml files we can refresh individual provider
metadata entries after they have been updated. This saves a lot
of time for validation - because every time when provider yaml is
updated we need to re-read it and re-validate it with json schema,
with this change we only do it for the updated provider yaml - which
saves about 0.5 a second per provider yaml update and when you
update all provides it is done way faster.
Based on #35617 so it should only be merged after that one
(Only last commit counts)
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rstor{issue_number}.significant.rst, in newsfragments.