Skip to content

Conversation

@potiuk
Copy link
Member

@potiuk potiuk commented Nov 16, 2023

Flit allows to build reproducible packages (packages that can
be compared bit-by-bit) providing that source date epoch is
set to repeatable value when package is built. This PR implements
reproducibility of our builds by freezing the documentation
preparation time in provider.yaml as "source date epoch" and
always using it when building the package. This way anyone
using breeze to build the package will have exactly the same
binary package produced, which will make it way easier to
verify if the packages are ready for release by the PMC member.

We will no longer have to check the sources, PMC members will
simply need to build the same packages locally using breeze and
see if the generated packages are exactly the same.

Based on #35617 so it should only be merged after that one

(Only last commit counts)


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was not used so I removed it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@potiuk potiuk force-pushed the add-reproducible-builds-support branch 3 times, most recently from 40b1938 to c3ec014 Compare November 16, 2023 22:34
This is a follow-up after apache#35586 and it depends on this one. It
moves the whole functionality of preparing provider packages to
breeze, removing the need of doing it in the Breeze CI image.

Since we have Python breeze with its own environment managed via
`pipx` we can now make sure that all the necessary packages are
installed in this environment and run package building in the
same environment Breeze uses.

Previously we have been running all the package building inside the
CI image for two reasons:

* we could rely on the same version of build tools (wheel/setuptools)
  being installed in the CI image
* security of the provider package preparation that used setuptools
  pre PEP-517 way of building packages that executed setup.py code

In order to isolate execution of potentially arbitrary code
in setup.py from the HOST environment in CI - where the host
environment might have access to secrets and tokens that would allow
it to break out of the sandbox for PRs coming from forks. The setup.py
file has been prepared by breeze using JINJA templates but it was
potentially possible to manipulate provider package directory structure
and get "Python" injection into generated setup.py, so it was safer
to run it in the isolated Breeze CI environment.

This PR makes it secure to run it in the Host environment,
because instead of generating setup.cfg and setup.py we generate
pyproject.toml with all the necessary information and we are using
PEP-517 compliant way of building provider packages - no arbitrary
code executed via setup.py is possible this way on the host,
so we can safely build provider packages in the host. We are
generating declarative pyproject.toml for that rather than imperative
setup.py, so we are safe to run the build process in the host without
being afraid of executing arbitrary code.

We are using flit as build tool - this is one of the popular build
tools - created by Python Packaging team. It is simple and not
too opinionated, it supports PEP-517 as well as PEP-621, so most of
the project mnetadata in pyproject toml can be added to PEP-621
compliant "project" section of pyproject.toml.

Together with the change we improves the process of generation of the
extracted sources for the providers. Originally we copied the whole
sources of Airflow to a single directory (provider_packages) and run
sequentially provider packages building from that single directory,
however it made it impossible to parallelise such builds - all
providers had to be built sequentially.

We change the approach now - instead of copying all airflow
sources once to the single directory, we build providers in separate
subdirectories of files/provider_packages/PROVIDER_ID and we only
copy there relevant sources (i.e. only provider's subfolder from
the "airflow/providers". This is quite a bit faster (each provider
only gets built using only its own sources so just scanning the
directory is faster) but it also allows to run package preparation
in parallel because each provider is fully isolated from others.

This PR also excludes not-needed `prepare_providers_package.py`
and unneded `provider_packages` folder used to prepare providers
before as well as bash script to build the providers and some
unused bash functions.
@potiuk potiuk force-pushed the add-reproducible-builds-support branch from c3ec014 to add0f14 Compare November 16, 2023 22:49
Flit allows to build reproducible packages (packages that can be
compared bit-by-bit) providing that source date epoch is set to
repeatable value when package is built. This PR implements
reproducibility of our builds by freezing the documentation preparation
time in provider.yaml as "source date epoch" and always using it when
building the package. This way anyone using breeze to build the package
will have exactly the same binary package produced, which will make it
way easier to verify if the packages are ready for release by the PMC
member.

We will no longer have to check the sources, PMC members will simply
need to build the same packages locally using breeze and see if the
generated packages are exactly the same.

The "source-date-epoch" fields have been regenerated in this PR as
well. Also this PR replaces `lru_cache` method of storing output
of `get_provider_metadata_packages` with custom-stored dictionary -
thanks to that instead of invalidating whole cache of providers
metadata refreshed from yaml files we can refresh individual provider
metadata entries after they have been updated. This saves a lot
of time for validation - because every time when provider yaml is
updated we need to re-read it and re-validate it with json schema,
with this change we only do it for the updated provider yaml - which
saves about 0.5 a second per provider yaml update and when you
update all provides it is done way faster.
@potiuk potiuk force-pushed the add-reproducible-builds-support branch from add0f14 to 68100ef Compare November 16, 2023 22:50
@potiuk
Copy link
Member Author

potiuk commented Nov 16, 2023

Need to wait with PROD build until #35617 gets merged

@potiuk potiuk closed this Nov 17, 2023
@potiuk
Copy link
Member Author

potiuk commented Nov 17, 2023

Closing for #35693 to run it from Apache repository - to get the build PROD image working.

@potiuk potiuk deleted the add-reproducible-builds-support branch November 17, 2023 16:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant