-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-16685: [Python] Failing docstring example in Table.join #13260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
|
@amol- do you mind reviewing this fix? |
Would it make sense to add the select for the other kind of joins too? In theory all of them have unpredictable order |
|
The current doctest failures seem to be broken since this commit was merged on master: adb5b00 Is this an expected change or a bug being introduced? @westonpace @sanjibansg ? |
Might it be that it just needs a rebase to master? It seems there were recent fixes to that example -> 9a7cc52 |
Maybe but I don't think so, the last commit on master (f3af2b7) triggered a build for that specific build sphinx and doctest job (https://github.com/apache/arrow/runs/6654897048?check_suite_focus=true#step:6:5459) and is also failing with the same error: |
It maybe because of the changed definition of the |
Applying this diff fixes it locally: @sanjibansg can you confirm that adding the |
Is this going to impact users? Are they going to face any change in behaviour when upgrading to 9.0.0 or it's just an internal change given that users usually don't invoke |
|
For the record, I understand that rows are in an unpredictable order after a join, but why are the columns in an unpredictable order? |
Haven't checked the details of our HasJoinNode implementation, but usually hash joins start picking the biggest of the two tables and joining the smaller to it. So the order of columns usually depends on which tables is picked first. |
Right, but the examples should be deterministic in any case. Also, the implementation could trivially reorder the output columns so that they are always in the same order regardless of table size. @westonpace Am I wrong? |
Yes, I think it should work now, we need that |
|
I opened #13269 for the |
I was wondering exactly the same, I don't see a reason why not preserving the column order of the input DataFrames. Some observations from trying to reproduce it locally (which strangely first didn't work in an interactive session):
|
|
Actually, that made we wonder if this isn't some issue with an unordered set either in Python or C++, and indeed for the inner join we are using Python's set, which I suppose might cause this undeterministic behaviour? arrow/python/pyarrow/_exec_plan.pyx Lines 260 to 262 in 4847b85
|
|
Right, so it's a bug that needs fixing rather than something to workaround in the doc examples. |
|
Agree. Will close this PR and change the title of the JIRA issue. |
|
For future reference, the column output order of the hash join should be deterministic. There may be cases where we switch the sides or join things in a different order for performance reasons. Howver, if the execution engine does this it should always restore the order before sending any batches to a sink. |
Added a step in the example where the columns are selected. This way the order of the columns is fixed and the doctest should pass.