-
Notifications
You must be signed in to change notification settings - Fork 16.4k
Refactor ObjectStorage #35612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor ObjectStorage #35612
Conversation
|
It also means that comon-io (currently released) will stop working because of the old import. I know it has not been usable - but sill we likely do not want to yank or delete it. Possibly a good solution will be to ADD PEP-363 dynamic import in "store" package in Airflow - symilarly to https://github.com/apache/airflow/blob/main/airflow/contrib/utils/__init__.py |
|
I know; is that really worthwhile doing? It feels unclean and if we time a new release of the providers before 2.8 then who will install the old provider? You cannot even install it currently except when you call |
|
Only if we release new provider version and yank the others - this is another path we can take. |
|
Yup, this is why I had commented in the dev mailing list about not releasing the provider version. The chicked-egg situation with 2.8 Airflow core depending on that provider -- was only for FileTransferOperator as opposed to anything else. So actually we don't need to depend on it in the core, until we want to, to avoid chicken-egg. Since this (Airflow Object Storage) is such a new feature, I expect things to change for a bit since it is still evolving. |
|
We can just yank them once we release Airflow 2.8 (tentatively planned for Dec 14) and the provider |
Yeah. I also think making common.io preinstalled is not really needed for now to be honest |
|
@bolkedebruin Are you planning to expose
I would have loved for you to have separate PRs for:
That way it is easier to review, the diff currently is big :) |
Well, after this feedback I cannot put the Deriving from Path / upath.CloudPath pathlib.Path makes it very hard to inherit from and it even does that intentionally by (poorly motivated?) design (This probably changes in python 3.13). So, when I started out with ObjectStoragePath I decided to make it Enter Considerations
airflow.io.store.path --> airflow.io.path @jscheffl was already in favor of moving it up the hierarchy, I think it has a more natural fit here and unties it from |
I think this is the best way forward.
I agree. It isn't really required. |
16e1a4f to
15786b2
Compare
15786b2 to
22fb381
Compare
Cool. I will separately remove it then. I am now working on simplifying and ultimately improving the security and robustness of our release processes - currently in #35586 and #35617 - but 2 or 3 more are coming so I will make sure to remove the comment and note abotu common.io becoming pre-installed before 2.8.0 (I am touching those areas and I currently keep the preinstalled providers list in two places but eventually I want to move the preinstalled flag to provider.yaml). |
Great. I think in the future the |
|
The way I’m thinking is |
Ah you mean allowing both? |
Sorry, late to the party. Agree on that |
+1 to what Jens wrote. The way how Python works with making class available without namespace after importing it, ambiguity between pathlib.Path will be far to likely to cause confusion. It's rather likely that someone will still want to use Path from Pathlib in the same module. I think |
jscheffl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First review with a few comments. Mentally an approve but not made full tests. Shall I do a deep testing already now or wait until other comments have settled?
| return isinstance(other, _AirflowCloudAccessor) and self._store == other._store | ||
|
|
||
|
|
||
| class ObjectStoragePath(CloudPath): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my previous comment I was thinking of shortening the name to StoragePath but now one step later I realize this is really specific to all sorts of object storage? Means I can not use this IO utility (anymore, compared to previous version?) to also do IO operations on local files with the same API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can do operations on local files, http (with an attach), dbfs (with an attach). It just means that such a Path will exhibit the semantics of object storage. This was also the case with the previous implementation.
We could extend the implementation with a registry - as upath does - to support different semantics. However, dus to the nature of pathlib.Path and therefore upath it just means repeating quite a lot of code and not being able to fully rely upon upstream implementations.
In that way ObjectStoragePath is more true to its behavior than StoragePath would be. An alternative could be OPath, but that is also a bit weird I think.
|
Yeah, worth marking it as experimental in 2.8 with an intention of a stable API in 2.9 |
|
(Edited; see below) I read the implementation and it seems the parsing logic in P = ObjectStoragePath
P(P("s3://bucket/path"), P("file://storage/path"))In pathlib, passing in an absolute path would simply overwrite everything passed in previous arguments and simply gives you back that absolute path, but from I can tell the current implementation cannot do that, and returns a somewhat nonsensical result. I’ll provide some improvements maybe after this is merged or at least stablises. Edited: It seems like this weirdness is inherited from UPath. I guess I’ll attempt to submit a fix upstream at some point. Another not directly related point, I’m thinking we can probably merge |
In my previous implementation I checked for the same backing store and would raise an exception if they wouldn't be equal. I think we should do the same here, cause the behavior or
I like that, but it should be indeed a separate PR. |
I think we should settle on the API as much as we can - following |
Since attach() already caches store objects, I feel we shouldn't need to store all the extra information. We can simply take conn_id on Path, and rely on the internal caching to retrieve the correct information afterwards.
|
I just pushed some recommendations to bolkedebruin#3 Regarding marking as experiemental, I don’t feel it conflicts with settling the API as much as possible. We should do our best, but the timeline is a bit tight and some rough edges may not be easy to discover without feedback from wider users, which we lack a lot at the moment. From past experience it only requires adding a callout in documentation saying something is not stable (one example). |
add |
|
I also your favor marking it experimental - not that I would believe it is in-stable from it's nature but because it would signal that we need a bit of freedom if we see that we need to move some pieces of API if we learn from users feedback. Otherwise we need to follow a strict deprecation. (Such refactoring PR would not be possible if already released!) |
|
Yup, some examples:
and @ephraimbuddy can take care of mentioning that in the blog post too for Airflow 2.8 (https://airflow.apache.org/blog/airflow-2.7.0/) |
Refactor ObjectSToragePath to be a Path
|
So I addressed most of the comments:
Please review :-). I'd like to merge it as is. Before 2.8 I'd like to add the possibility to pass on options to the underlying FS. Naming: I've kept it as is. I think we haven't settled on a name yet that really is better than what it is now. However "ObjectPath" comes to mind now :-). Keep voting. |
potiuk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - one nit only.
This refactors the object storage implementation to
inherit from UPath and therefore from pathlib.Path.
This makes its behavior semantically stricter and
lowers the maintenance burden.
Furthermore the refactoring includes moving
ObjectStoragePath up a level to
airflow.io.pathmakingit more intuitive to import.
Work is underway to allow storage_options to be passed
directly to the underlying filesystem.
cc: @jscheffl @Taragolis @uranusjr
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rstor{issue_number}.significant.rst, in newsfragments.