Add BatchAdapter to simplify using PhysicalExprAdapter / Projector to map RecordBatch between schemas#19716
Conversation
|
(should we undeprecate the schema adapter?) |
|
we can also do a 52.1.0 release too |
|
I actually considered that and decided against it for a couple of reasons:
So given that we don't really want the trait, that we'd be deprecating half of the methods anyway and that the other half could use a refactor / breaking changes to simplify the APIs and that it wouldn't really help most of the use cases ( |
I see no reason why this can't be included in a |
Speaking of which, here it is: |
| } | ||
| } | ||
|
|
||
| /// Factory for creating [`BatchAdapter`] instances to adapt record batches |
There was a problem hiding this comment.
This looks like a useful API for sure (and we hit the same thing when upgrading to 52.0.0 internally)
It looks almost, but not quite the same, as SchemaAdapterFactor / Mapper. In other words, the API is different than SchemaAdapterFactor but looks to me like we could have it do the same thing.
To assist with people upgrading is there any way we can un deprecate SchemaAdapterFactory? For example, maybe we could move SchemaAdapterFactory into physical-expr-adapter and leave a reference back to the old location?
That way people who have the code that uses the old interface could continue to do so with minimal disruption
There was a problem hiding this comment.
deprecation was not fully equal. So the mapper provided opportunities to map_schema which can be replaced by PhysicalExpeAdapter and also map_batch which is challenging(if even possible) to have with new API
There was a problem hiding this comment.
Yes agreed, I give my reasoning in #19716 (comment) but TLDR is it would be hard / impossible to replicate the exact semantics and APIs of SchemaAdapter so any sort of un-deprecated version would only be half functional and probably cause more headache than it's worth.
There was a problem hiding this comment.
@alamb I'm curious why you need this / why implementing a custom PhysicalExprAdapter isn't enough?
There was a problem hiding this comment.
Just one more comment on this: if this looks good maybe we can merge it and if we do need to re-implement some of SchemaAdapter we can do that in a followup?
There was a problem hiding this comment.
Sounds good to me -- maybe we can also update the SchemaAdapter deprecation note to refer to this new structure too
datafusion/datafusion/datasource/src/schema_adapter.rs
Lines 80 to 83 in 75d2473
|
@comphead FYI for Comet stuff. |
|
Thanks @adriangb I missed this PR somehow, actually in Comet we were experimenting with 2 main directions:
I'll check this thoroughly today, thanks for thinking of it ahead of us )) |
I'm a bit confused, isn't it pretty much the same thing? What we do in the Parquet opener is It also doesn't really seem like something you should have to do, if you provide the casting rules you want via |
76aec5b to
d39a948
Compare
unfortunately it is slightly more than just casting(applying default values, unifying schemas, etc), we doing some RB->RB modification just after the scan, hopefully we can do this better in future as this part is expensive. |
from my investigation in apache/datafusion-comet#3047 it seemed that something like |
Correct, I'll try to do a migration today based on this PR, with adapter it is more promising. And you are right, we also should do more on expression level than on RB level, but this would be quite some investigation to be done |
I may be missing something but I think the |
Sounds good, just started discussion in the apache/datafusion-comet#3047 (comment) |
d39a948 to
e717cf6
Compare
… map RecordBatch between schemas
d02de62 to
877fcd2
Compare
|
Should we backport to |
Sounds good I see no reason not to |
… map RecordBatch between schemas (apache#19716) I've now seen this pattern a couple of times, in our own codebase, working on apache/datafusion-comet#3047. I was going to add an example but I think adding an API to handle it for users is a better experience. This should also make it a bit easier to migrate from SchemaAdapter. In fact, I think it's possible to implement a SchemaAdapter using this as the foundation + some shim code. This won't be available in DF 51 to ease migration but it's easy enough to backport (just copy the code in this PR) for users that would find that helpful.
I've now seen this pattern a couple of times, in our own codebase, working on apache/datafusion-comet#3047.
I was going to add an example but I think adding an API to handle it for users is a better experience.
This should also make it a bit easier to migrate from SchemaAdapter. In fact, I think it's possible to implement a SchemaAdapter using this as the foundation + some shim code. This won't be available in DF 51 to ease migration but it's easy enough to backport (just copy the code in this PR) for users that would find that helpful.