Skip to content

Conversation

@bolkedebruin
Copy link
Contributor

@bolkedebruin bolkedebruin commented Nov 13, 2023

This refactors the object storage implementation to
inherit from UPath and therefore from pathlib.Path.
This makes its behavior semantically stricter and
lowers the maintenance burden.

Furthermore the refactoring includes moving
ObjectStoragePath up a level to airflow.io.path making
it more intuitive to import.

Work is underway to allow storage_options to be passed
directly to the underlying filesystem.

cc: @jscheffl @Taragolis @uranusjr


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@potiuk
Copy link
Member

potiuk commented Nov 13, 2023

It also means that comon-io (currently released) will stop working because of the old import. I know it has not been usable - but sill we likely do not want to yank or delete it. Possibly a good solution will be to ADD PEP-363 dynamic import in "store" package in Airflow - symilarly to https://github.com/apache/airflow/blob/main/airflow/contrib/utils/__init__.py

@bolkedebruin
Copy link
Contributor Author

I know; is that really worthwhile doing? It feels unclean and if we time a new release of the providers before 2.8 then who will install the old provider? You cannot even install it currently except when you call pip install and explicitly ignore the dependencies (afaik).

@potiuk
Copy link
Member

potiuk commented Nov 13, 2023

Only if we release new provider version and yank the others - this is another path we can take.

@kaxil
Copy link
Member

kaxil commented Nov 13, 2023

Yup, this is why I had commented in the dev mailing list about not releasing the provider version. The chicked-egg situation with 2.8 Airflow core depending on that provider -- was only for FileTransferOperator as opposed to anything else. So actually we don't need to depend on it in the core, until we want to, to avoid chicken-egg.

Since this (Airflow Object Storage) is such a new feature, I expect things to change for a bit since it is still evolving.

@kaxil
Copy link
Member

kaxil commented Nov 13, 2023

We can just yank them once we release Airflow 2.8 (tentatively planned for Dec 14) and the provider

@potiuk
Copy link
Member

potiuk commented Nov 13, 2023

Yup, this is why I had commented in the dev mailing list about not releasing the provider version. The chicked-egg situation with 2.8 Airflow core depending on that provider -- was only for FileTransferOperator as opposed to anything else. So actually we don't need to depend on it in the core, until we want to, to avoid chicken-egg.

Yeah. I also think making common.io preinstalled is not really needed for now to be honest

@kaxil
Copy link
Member

kaxil commented Nov 13, 2023

@bolkedebruin Are you planning to expose storage_options in this PR or separate?

Work is underway to allow storage_options to be passed
directly to the underlying filesystem.

I would have loved for you to have separate PRs for:

  1. Moving from airflow.io.store.path -> airflow.io.path
  2. Change from os.Pathlike -> upath.CloudPath
  3. Expose storage_options

That way it is easier to review, the diff currently is big :)

@bolkedebruin
Copy link
Contributor Author

@bolkedebruin Are you planning to expose storage_options in this PR or separate?

Work is underway to allow storage_options to be passed
directly to the underlying filesystem.

I would have loved for you to have separate PRs for:

1. Moving from `airflow.io.store.path` -> `airflow.io.path`

2. Change from `os.Pathlike` -> `upath.CloudPath`

3. Expose `storage_options`

That way it is easier to review, the diff currently is big :)

Well, after this feedback I cannot put the storage_options in this PR eh ;-). But you are right, it should have been smaller. I'll add here some considerations and I hope you can review it as is. I mostly have what you want in separate commits in this pR.

Deriving from Path / upath.CloudPath

pathlib.Path makes it very hard to inherit from and it even does that intentionally by (poorly motivated?) design (This probably changes in python 3.13). So, when I started out with ObjectStoragePath I decided to make it os.Pathlike. The downside is that it has a pretty weak API and doesn't specify what methods should be available. So I, well intended, just added a few that seemed to make sense by looking at the Path interface and adding some for convenience.

Enter upath. upath does derive from Path and therefore has a stricter API. It still has the downside of being difficult to derive from, hence the __new__ refactor, but it has implemented the semantics that we need. The refactor of ObjectStoragePath therefore keeps the API the same as before minus find and du and walk. walk might be re-added again as it is - i think - useful for the dag processor in the future.

Considerations

  • ObjectStoragePath is a more opinionated version of upath.CloudPath and it is the only implementation of Path that supports the Airflow-isms (conn_id, getting init from providers etc, serialization). While you can attach other filesystems (e.g. dbfs or webhdfs) the semantics will be as an object store. upath does implement different semantics if required, not sure of that is useful for us.

  • Naming: some discussion happened with @uranusjr on whether ObjectStoragePath is the right name. CloudStoragePath could be al alternave, but it rubs me the wrong way somehow.

airflow.io.store.path --> airflow.io.path

@jscheffl was already in favor of moving it up the hierarchy, I think it has a more natural fit here and unties it from store
which is still there, but not as important.

@bolkedebruin
Copy link
Contributor Author

We can just yank them once we release Airflow 2.8 (tentatively planned for Dec 14) and the provider

I think this is the best way forward.

Yup, this is why I had commented in the dev mailing list about not releasing the provider version. The chicked-egg situation with 2.8 Airflow core depending on that provider -- was only for FileTransferOperator as opposed to anything else. So actually we don't need to depend on it in the core, until we want to, to avoid chicken-egg.

Yeah. I also think making common.io preinstalled is not really needed for now to be honest

I agree. It isn't really required.

@bolkedebruin bolkedebruin force-pushed the refactor_os branch 2 times, most recently from 16e1a4f to 15786b2 Compare November 14, 2023 10:14
@potiuk
Copy link
Member

potiuk commented Nov 14, 2023

I agree. It isn't really required.

Cool. I will separately remove it then. I am now working on simplifying and ultimately improving the security and robustness of our release processes - currently in #35586 and #35617 - but 2 or 3 more are coming so I will make sure to remove the comment and note abotu common.io becoming pre-installed before 2.8.0 (I am touching those areas and I currently keep the preinstalled providers list in two places but eventually I want to move the preinstalled flag to provider.yaml).

@bolkedebruin
Copy link
Contributor Author

I agree. It isn't really required.

Cool. I will separately remove it then. I am now working on simplifying and ultimately improving the security and robustness of our release processes - currently in #35586 and #35617 - but 2 or 3 more are coming so I will make sure to remove the comment and note abotu common.io becoming pre-installed before 2.8.0 (I am touching those areas and I currently keep the preinstalled providers list in two places but eventually I want to move the preinstalled flag to provider.yaml).

Great. I think in the future the common.io might be required or just makes sense to be included. But I think having the API stabilise first makes sense.

@uranusjr
Copy link
Member

The way I’m thinking is ObjectStoragePath is long enough for people to want to from ... import it, and the name they get there is not specific enough. But Path is so vague (it even clashes with pathlib.Path) people will be forced to do a fully qualified import (which is ergonomic enough as long as the import path is short) that eliminates the vagueness of the name.

@bolkedebruin
Copy link
Contributor Author

The way I’m thinking is ObjectStoragePath is long enough for people to want to from ... import it, and the name they get there is not specific enough. But Path is so vague (it even clashes with pathlib.Path) people will be forced to do a fully qualified import (which is ergonomic enough as long as the import path is short) that eliminates the vagueness of the name.

Ah you mean allowing both?

@jscheffl
Copy link
Contributor

The way I’m thinking is ObjectStoragePath is long enough for people to want to from ... import it, and the name they get there is not specific enough. But Path is so vague (it even clashes with pathlib.Path) people will be forced to do a fully qualified import (which is ergonomic enough as long as the import path is short) that eliminates the vagueness of the name.

Ah you mean allowing both?

Sorry, late to the party. Agree on that ObjectStoragePath as too many letters (getting an o15h :-D) and Path would be great but raises a lot of dis-ambiguation with pathlib.Path(so should NOT be done!). My previous favority was to name it StoragePath because it might be not solely Object-Storage being used. Can be actually "any" Storage, also FileStorage, correct? As StoragePath we only have s9h letters and is a compromise between o15h and p2h :-D

@potiuk
Copy link
Member

potiuk commented Nov 14, 2023

Sorry, late to the party. Agree on that ObjectStoragePath as too many letters (getting an o15h :-D) and Path would be great but raises a lot of dis-ambiguation with pathlib.Path(so should NOT be done!). My previous favority was to name it StoragePath because it might be not solely Object-Storage being used. Can be actually "any" Storage, also FileStorage, correct? As StoragePath we only have s9h letters and is a compromise between o15h and p2h :-D

+1 to what Jens wrote. The way how Python works with making class available without namespace after importing it, ambiguity between pathlib.Path will be far to likely to cause confusion.

It's rather likely that someone will still want to use Path from Pathlib in the same module.

I think StoragePath is good.

Copy link
Contributor

@jscheffl jscheffl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First review with a few comments. Mentally an approve but not made full tests. Shall I do a deep testing already now or wait until other comments have settled?

return isinstance(other, _AirflowCloudAccessor) and self._store == other._store


class ObjectStoragePath(CloudPath):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my previous comment I was thinking of shortening the name to StoragePath but now one step later I realize this is really specific to all sorts of object storage? Means I can not use this IO utility (anymore, compared to previous version?) to also do IO operations on local files with the same API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do operations on local files, http (with an attach), dbfs (with an attach). It just means that such a Path will exhibit the semantics of object storage. This was also the case with the previous implementation.

We could extend the implementation with a registry - as upath does - to support different semantics. However, dus to the nature of pathlib.Path and therefore upath it just means repeating quite a lot of code and not being able to fully rely upon upstream implementations.

In that way ObjectStoragePath is more true to its behavior than StoragePath would be. An alternative could be OPath, but that is also a bit weird I think.

@kaxil
Copy link
Member

kaxil commented Nov 14, 2023

Yeah, worth marking it as experimental in 2.8 with an intention of a stable API in 2.9

@uranusjr
Copy link
Member

uranusjr commented Nov 15, 2023

(Edited; see below) I read the implementation and it seems the parsing logic in ObjectStoragePath is a bit fragile if the user passes in some weird combo like

P = ObjectStoragePath

P(P("s3://bucket/path"), P("file://storage/path"))

In pathlib, passing in an absolute path would simply overwrite everything passed in previous arguments and simply gives you back that absolute path, but from I can tell the current implementation cannot do that, and returns a somewhat nonsensical result. I’ll provide some improvements maybe after this is merged or at least stablises.

Edited: It seems like this weirdness is inherited from UPath. I guess I’ll attempt to submit a fix upstream at some point.


Another not directly related point, I’m thinking we can probably merge conn_id into the URI itself like this: s3://conn_id@bucket/. This seems cleaner from the user’s standpoint. The conn_id value should still be parsed out in the implementation, but I feel the user shouldn’t be required to pass it in as a separate argument. We can discuss this later after this is merged.

@bolkedebruin
Copy link
Contributor Author

(Edited; see below) I read the implementation and it seems the parsing logic in ObjectStoragePath is a bit fragile if the user passes in some weird combo like

P = ObjectStoragePath

P(P("s3://bucket/path"), P("file://storage/path"))

In pathlib, passing in an absolute path would simply overwrite everything passed in previous arguments and simply gives you back that absolute path, but from I can tell the current implementation cannot do that, and returns a somewhat nonsensical result. I’ll provide some improvements maybe after this is merged or at least stablises.

Edited: It seems like this weirdness is inherited from UPath. I guess I’ll attempt to submit a fix upstream at some point.

In my previous implementation I checked for the same backing store and would raise an exception if they wouldn't be equal. I think we should do the same here, cause the behavior or pathlib.Path doesn't make sense here as pathlib.Path has the same backing store / filesystem per design.

Another not directly related point, I’m thinking we can probably merge conn_id into the URI itself like this: s3://conn_id@bucket/. This seems cleaner from the user’s standpoint. The conn_id value should still be parsed out in the implementation, but I feel the user shouldn’t be required to pass it in as a separate argument. We can discuss this later after this is merged.

I like that, but it should be indeed a separate PR.

@bolkedebruin
Copy link
Contributor Author

Yeah, worth marking it as experimental in 2.8 with an intention of a stable API in 2.9

I think we should settle on the API as much as we can - following pathlib.Path that should be relatively easy, but marking it experimental due to its implementation still moving is smart. How to do that?

Since attach() already caches store objects, I feel we shouldn't need to
store all the extra information. We can simply take conn_id on Path, and
rely on the internal caching to retrieve the correct information
afterwards.
@uranusjr
Copy link
Member

uranusjr commented Nov 15, 2023

I just pushed some recommendations to bolkedebruin#3

Regarding marking as experiemental, I don’t feel it conflicts with settling the API as much as possible. We should do our best, but the timeline is a bit tight and some rough edges may not be easy to discover without feedback from wider users, which we lack a lot at the moment. From past experience it only requires adding a callout in documentation saying something is not stable (one example).

@potiuk
Copy link
Member

potiuk commented Nov 15, 2023

I think we should settle on the API as much as we can - following pathlib.Path that should be relatively easy, but marking it experimental due to its implementation still moving is smart. How to do that?

add |experimental| directive in the documentation (you will find a few examples in our .rst files)

@jscheffl
Copy link
Contributor

I also your favor marking it experimental - not that I would believe it is in-stable from it's nature but because it would signal that we need a bit of freedom if we see that we need to move some pieces of API if we learn from users feedback. Otherwise we need to follow a strict deprecation. (Such refactoring PR would not be possible if already released!)

@kaxil
Copy link
Member

kaxil commented Nov 15, 2023

Yup, some examples:

and @ephraimbuddy can take care of mentioning that in the blog post too for Airflow 2.8 (https://airflow.apache.org/blog/airflow-2.7.0/)

@bolkedebruin
Copy link
Contributor Author

So I addressed most of the comments:

  • Added experimental flag
  • Improved and simplified accessor with help of @uranusjr - not all changes applied, I'd like to stay close to original upath as much as we can
  • Improved testing of serialization

Please review :-). I'd like to merge it as is.

Before 2.8 I'd like to add the possibility to pass on options to the underlying FS.

Naming:

I've kept it as is. I think we haven't settled on a name yet that really is better than what it is now. However "ObjectPath" comes to mind now :-). Keep voting.

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - one nit only.

@bolkedebruin bolkedebruin merged commit 2fc8d2a into apache:main Nov 18, 2023
@bolkedebruin bolkedebruin deleted the refactor_os branch November 18, 2023 20:38
@ephraimbuddy ephraimbuddy added the changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) label Nov 20, 2023
@ephraimbuddy ephraimbuddy added this to the Airflow 2.8.0 milestone Nov 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants