Skip to content

Conversation

@potiuk
Copy link
Member

@potiuk potiuk commented Nov 17, 2023

Flit allows to build reproducible packages (packages that can be
compared bit-by-bit) providing that source date epoch is set to
repeatable value when package is built. This PR implements
reproducibility of our builds by freezing the documentation preparation
time in provider.yaml as "source date epoch" and always using it when
building the package. This way anyone using breeze to build the package
will have exactly the same binary package produced, which will make it
way easier to verify if the packages are ready for release by the PMC
member.

We will no longer have to check the sources, PMC members will simply
need to build the same packages locally using breeze and see if the
generated packages are exactly the same.

That also includes permissions bits - flit sets permissions of
the generated packages to "no permissions for other/group" in
order to get bit-to-bit reproducibility - because on some systems
umask is set differently by default and created file would be
different because of the permission bit. This caused problems
when building docker images - so we had it changed when moving
packages to 'dist' folder, but this PR changes it so that CI
build does it when moving packages from dist to docker-context-files
instead

The "source-date-epoch" fields have been regenerated in this PR as
well. Also this PR replaces lru_cache method of storing output
of get_provider_metadata_packages with custom-stored dictionary -
thanks to that instead of invalidating whole cache of providers
metadata refreshed from yaml files we can refresh individual provider
metadata entries after they have been updated. This saves a lot
of time for validation - because every time when provider yaml is
updated we need to re-read it and re-validate it with json schema,
with this change we only do it for the updated provider yaml - which
saves about 0.5 a second per provider yaml update and when you
update all provides it is done way faster.

Based on #35617 so it should only be merged after that one

(Only last commit counts)


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg boring-cyborg bot added provider:cncf-kubernetes Kubernetes (k8s) provider related issues provider:common-io labels Nov 17, 2023
@potiuk potiuk changed the title Add reproducible builds support Use reproducible builds for provider packages Nov 17, 2023
@uranusjr
Copy link
Member

What’s the intention behind using different timestamps in different providers? (But some of them are the same? Not sure if I’m reading the changes correctly.)

@potiuk
Copy link
Member Author

potiuk commented Nov 17, 2023

What’s the intention behind using different timestamps in different providers? (But some of them are the same? Not sure if I’m reading the changes correctly.)

Glad you asked. It's a deiberate choice and I spent some time thiniking about implication of using one vs. many and I chose "many" for a good reason - happy to explain it.

The intention is to keep the time in wheels "real" and reflecting some "actual" time that people could even refer to actual event. One of the solutions you could choose there - you could put a fixed time always (0 or 2000-01-01 equivalent or another arbitrary date). Or you could use single time for all releases just update it from time to time and move forward to current date. None of them have actual meaning. But I figured that the way how we are releasing providers currently and how our process looks like, we can make the dates in wheel actually MEAN something so that they are not artiffical.

We are releasing different providers at different times - sometimes we release only amazon and google and sometimes http, sometimes all of them - and the choice is based on several factors - are there any changes to this provider (we don't release when there aren't), are they documentation only (then we just update documentation but not relase provider) or maybe we release all of them even if there are no changes (this is when we release wave of providers where we update min-airflow version for example or when we have a change that affects all provider - for example when we addded auto-generated __init__.py). But generally speaking - each provider is released independently on its own schedule.

Now, our provider preparation consistes of two steps (this is shortly describing process that is in detail described here: https://github.com/apache/airflow/blob/main/dev/README_RELEASE_PROVIDER_PACKAGES.md)

  1. Step 1 - > preparing provider documentation. This is where we make sure that all the changes for that provider are present in CHANGELOG, we decide whethere to release provider or not, we decide what is the version bump (patchlevel, feature, breaking) and we update documentation. This change results in a commit where we update provider.yaml + CHANGELOG + commits and this is the commit that is used to generate the packages. We only update provider.yaml files for those packages that are going to be released. The other provider.yaml files remain untouched. And really the time of doing that update to provider.yaml is the time when the provider.yaml gets effectively "frozen" for the upcoming relase.

Effectively the timestamp which is stored in provider.yaml is the timestamp when release manager run "breeze release-management prepare-provider-documenation` for that particular provider.

Note that this also can change individually for each provider. It's quite possible that there are few iterations of this prepare-proivdere-documentation. The whole process is designed in the way that even before you prepare RC you can run and merge such documentation changes before preparing the whole wave - until you actually preapare packages you can incrementatlly add new changes that people add in main and effectively add new changes to only some providers - this way only one or two provider.yaml files might still get updated while you are doing it. And even later - when you decide to remove one or two providers from an rc wave and move them to RC2 - then you continue updating documentation and provider.yaml only for those providers that you removed from the RC1 wave - the other documentation and provider.yaml file is untouched while you are doing it.

This effectively means that the time documentation (and provider.yaml) were updated by the release manager will be different for each provider - even in the same "wave" of providers that are being released. The wave is really there to streamline voting process, but in fact each provider has individual release cycle.

  1. Step 2 -> provider package generation - this is done some time later and also we want to - in the future to make sure that whoever generates provider package from the same tag will get binary identical package (PMC member verifying the release) - so that "timestamp" for each package has to be stored in the source code tagged with RC (later final) tag.

And for that it seems most natural to use timestamp that were frozen during documentation preparation FOR THAT PROVIDER (which - again might be different in each provider even in the same wave). We release the code from the commit (and this is where rc* tags are added) of the merged commit where the documentation update happened - so effectively the timestamp when the provider yaml was updated by release manager for THIS PROVIDER (mind that it might be different for each provider even in the same release wave) becomes the timestamp that we are using to generate the package.

It seems most natural, the timestamp actually means someting (timestamp when release manager prepared documentation for that provider) and it seems reasonable and desired to keep it different per each provider even if they were released in the same wave, because their documentation might be prepared at different times.

Also it's a nice record of when the last time documentation was updated for that provider - we could of course get it from git history - but seeing it in the code is a bit more accurate - because it shows actual time og "generation" not the time of commit (which might be minutes or hours later).

@potiuk potiuk force-pushed the add-reproducible-builds-support branch 3 times, most recently from 3211525 to 3060c4e Compare November 18, 2023 16:02
@potiuk potiuk requested a review from kaxil as a code owner November 18, 2023 16:02
@potiuk potiuk force-pushed the add-reproducible-builds-support branch 4 times, most recently from b9017cc to e931861 Compare November 18, 2023 21:30
@eladkal
Copy link
Contributor

eladkal commented Nov 18, 2023

This change results in a commit where we update provider.yaml + CHANGELOG + commits

Sometimes we dont have to commit the provider.yaml
Mostly in major releases. The PR that added the breaking change also bumps the version in the yaml.

@potiuk
Copy link
Member Author

potiuk commented Nov 18, 2023

This change results in a commit where we update provider.yaml + CHANGELOG + commits

Sometimes we dont have to commit the provider.yaml Mostly in major releases. The PR that added the breaking change also bumps the version in the yaml.

We always do when when we want to release provider - at minimum we add new version in "versions"

Flit allows to build reproducible packages (packages that can be
compared bit-by-bit) providing that source date epoch is set to
repeatable value when package is built. This PR implements
reproducibility of our builds by freezing the documentation preparation
time in provider.yaml as "source date epoch" and always using it when
building the package. This way anyone using breeze to build the package
will have exactly the same binary package produced, which will make it
way easier to verify if the packages are ready for release by the PMC
member.

We will no longer have to check the sources, PMC members will simply
need to build the same packages locally using breeze and see if the
generated packages are exactly the same.

That also includes permissions bits - flit sets permissions of
the generated packages to "no permissions for other/group" in
order to get bit-to-bit reproducibility - because on some systems
umask is set differently by default and created file would be
different because of the permission bit. This caused problems
when building docker images - so we had it changed when moving
packages to 'dist' folder, but this PR changes it so that CI
build does it when moving packages from dist to docker-context-files
instead

The "source-date-epoch" fields have been regenerated in this PR as
well. Also this PR replaces `lru_cache` method of storing output
of `get_provider_metadata_packages` with custom-stored dictionary -
thanks to that instead of invalidating whole cache of providers
metadata refreshed from yaml files we can refresh individual provider
metadata entries after they have been updated. This saves a lot
of time for validation - because every time when provider yaml is
updated we need to re-read it and re-validate it with json schema,
with this change we only do it for the updated provider yaml - which
saves about 0.5 a second per provider yaml update and when you
update all provides it is done way faster.
@potiuk potiuk force-pushed the add-reproducible-builds-support branch from e931861 to 880ec25 Compare November 18, 2023 22:39
@potiuk potiuk merged commit 99534e4 into main Nov 18, 2023
@ephraimbuddy ephraimbuddy added this to the Airflow 2.8.0 milestone Nov 20, 2023
@potiuk potiuk deleted the add-reproducible-builds-support branch November 21, 2023 09:38
potiuk added a commit to potiuk/airflow that referenced this pull request Jan 16, 2024
When building wheel providers took a lot of time (12 minutes) there
was an optimisation implemented to only build the affected providers
and in this case we could not run verification, because having only
subset of providers would generate errors during imports.

However, changes to make our package bulds reproducible with flit apache#35693
also improve building time for provider packages (all of them are
built under 1 minute) and .whl installation had always been rather
quick - so we can remove the optimisation now, because side effect
of it that in some cases (like apache#36799) it caused the backwards
compatibility check succeed - and subsequently continue failing in
main canary build.
potiuk added a commit that referenced this pull request Jan 16, 2024
…36825)

When building wheel providers took a lot of time (12 minutes) there
was an optimisation implemented to only build the affected providers
and in this case we could not run verification, because having only
subset of providers would generate errors during imports.

However, changes to make our package bulds reproducible with flit #35693
also improve building time for provider packages (all of them are
built under 1 minute) and .whl installation had always been rather
quick - so we can remove the optimisation now, because side effect
of it that in some cases (like #36799) it caused the backwards
compatibility check succeed - and subsequently continue failing in
main canary build.
potiuk added a commit that referenced this pull request Feb 7, 2024
…36825)

When building wheel providers took a lot of time (12 minutes) there
was an optimisation implemented to only build the affected providers
and in this case we could not run verification, because having only
subset of providers would generate errors during imports.

However, changes to make our package bulds reproducible with flit #35693
also improve building time for provider packages (all of them are
built under 1 minute) and .whl installation had always been rather
quick - so we can remove the optimisation now, because side effect
of it that in some cases (like #36799) it caused the backwards
compatibility check succeed - and subsequently continue failing in
main canary build.

(cherry picked from commit 7086192)
ephraimbuddy pushed a commit that referenced this pull request Feb 22, 2024
…36825)

When building wheel providers took a lot of time (12 minutes) there
was an optimisation implemented to only build the affected providers
and in this case we could not run verification, because having only
subset of providers would generate errors during imports.

However, changes to make our package bulds reproducible with flit #35693
also improve building time for provider packages (all of them are
built under 1 minute) and .whl installation had always been rather
quick - so we can remove the optimisation now, because side effect
of it that in some cases (like #36799) it caused the backwards
compatibility check succeed - and subsequently continue failing in
main canary build.

(cherry picked from commit 7086192)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants