-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Update DataFusion to arrow 6.0 #984
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| execute("SELECT c1, AVG(c2) FROM test WHERE c1 = 123 GROUP BY c1", 4).await?; | ||
|
|
||
| let expected = vec!["++", "||", "++", "++"]; | ||
| let expected = vec![ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Required due to apache/arrow-rs#656
| expr: col("c7", &schema).unwrap(), | ||
| options: SortOptions::default(), | ||
| }]; | ||
| let sort = vec
The full output is in basic.txt
and partition.txt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI @tustvold
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a good plan if it works 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated this PR to use c12
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to add a stable sort option in arrow sort kernel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure @jimexist -- I think it is fairly common that analytic systems don't really have the notion of "stable" sort (because the data doesn't have any well defined sort order in storage).
In DataFusion, for example, the order that the rows are produced (and how they are partitioned) depends on the DataSource (as well as they may be re-arranged by a repartition / coalesce operator). This stable sorting really only is useful for testing when the output has a single RecordBatch I think
It may be best not to get used to / rely on that stable sorting
| expr: col("c7", &schema).unwrap(), | ||
| options: SortOptions::default(), | ||
| }]; | ||
| let sort = vec![ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI @tustvold
datafusion/tests/sql.rs
Outdated
| first_value(cast(c4 as Int)) over (partition by c3), \ | ||
| last_value(cast(c4 as Int)) over (partition by c3), \ | ||
| nth_value(cast(c4 as Int), 2) over (partition by c3) \ | ||
| first_value(cast(c4 as Int)) over (partition by c3 order by c3, c4), \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is also required due to the change to use unstable sorting in apache/arrow-rs#552 but the test is non deterministic according to the sql spec (the query output depends on implementation details).
Specifically, it is computing first_value last_value and nth_value for partitions that have more than one value of the partition by value c3 but does not specify an order by clause to determine how those values should be sorted. :
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fyi @jimexist
6122563 to
281d6bc
Compare
| }) | ||
| } | ||
| DataType::Map(_, _) => { | ||
| unimplemented!("Ballista does not yet support Map data type") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needed due to apache/arrow-rs#491 -- I am not enough of an expert to implement protobuf serialization in Ballista for a new DataType at this time but I suspect it is not very hard
8231f95 to
f089217
Compare
f089217 to
5971729
Compare
5971729 to
89d214b
Compare
5375d91 to
95a8c58
Compare
| null_bit_buffer, | ||
| 0, | ||
| vec![Buffer::from(buffer)], | ||
| let data = unsafe { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to apache/arrow-rs#822 (this can lead to unsafe behavior here if the buffers are not correctly created). The alternate is to use the try_new function and skip the (eventual) validation required.
Thoughts?
| let schema = test::aggr_test_schema(); | ||
| let sort = vec

Which issue does this PR close?
Closes #1144
Rationale for this change
Pickup improvements in upstream arrow-rs crate and allow projects downstream of datafusion to use new arrow-rs
This PR is intended for demonstration only; I don't intend to merge this PR as is, but I plan to create a real PR once arrow 6.0 has been released officially.What changes are included in this PR?
Are there any user-facing changes?
The biggest user facing change I think is that sort in arrow-rs is no longer stable, so thus sort in DataFusion will no longer be stable either