ARROW-16685: [Python] Failing docstring example in Table.join #13260

AlenkaF · 2022-05-30T10:31:37Z

Added a step in the example where the columns are selected. This way the order of the columns is fixed and the doctest should pass.

github-actions · 2022-05-30T10:32:01Z

https://issues.apache.org/jira/browse/ARROW-16685

github-actions · 2022-05-30T10:32:02Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

AlenkaF · 2022-05-30T10:50:14Z

@amol- do you mind reviewing this fix?

amol- · 2022-05-30T10:55:42Z

@amol- do you mind reviewing this fix?

Would it make sense to add the select for the other kind of joins too? In theory all of them have unpredictable order

raulcd · 2022-05-30T11:40:02Z

The current doctest failures seem to be broken since this commit was merged on master: adb5b00

 _______________ [doctest] pyarrow._dataset.DirectoryPartitioning _______________
EXAMPLE LOCATION UNKNOWN, not showing all tests of that example
??? >>> print(partitioning.parse("/2009/11"))
Expected:
    ((year == 2009) and (month == 11))
Got:
    (year == 2009)

/opt/conda/envs/arrow/lib/python3.9/site-packages/pyarrow/_dataset.cpython-39-x86_64-linux-gnu.so:None: DocTestFailure
_________________ [doctest] pyarrow._dataset.HivePartitioning __________________
EXAMPLE LOCATION UNKNOWN, not showing all tests of that example
??? >>> print(partitioning.parse("/year=2009/month=11"))
Expected:
    ((year == 2009) and (month == 11))
Got:
    (year == 2009)

Is this an expected change or a bug being introduced? @westonpace @sanjibansg ?

amol- · 2022-05-30T12:50:35Z

The current doctest failures seem to be broken since this commit was merged on master: adb5b00

 _______________ [doctest] pyarrow._dataset.DirectoryPartitioning _______________
EXAMPLE LOCATION UNKNOWN, not showing all tests of that example
??? >>> print(partitioning.parse("/2009/11"))
Expected:
    ((year == 2009) and (month == 11))
Got:
    (year == 2009)

/opt/conda/envs/arrow/lib/python3.9/site-packages/pyarrow/_dataset.cpython-39-x86_64-linux-gnu.so:None: DocTestFailure
_________________ [doctest] pyarrow._dataset.HivePartitioning __________________
EXAMPLE LOCATION UNKNOWN, not showing all tests of that example
??? >>> print(partitioning.parse("/year=2009/month=11"))
Expected:
    ((year == 2009) and (month == 11))
Got:
    (year == 2009)

Is this an expected change or a bug being introduced? @westonpace @sanjibansg ?

Might it be that it just needs a rebase to master? It seems there were recent fixes to that example -> 9a7cc52

raulcd · 2022-05-30T13:05:47Z

Might it be that it just needs a rebase to master? It seems there were recent fixes to that example -> 9a7cc52

Maybe but I don't think so, the last commit on master (f3af2b7) triggered a build for that specific build sphinx and doctest job (https://github.com/apache/arrow/runs/6654897048?check_suite_focus=true#step:6:5459) and is also failing with the same error:


=================================== FAILURES ===================================
_______________ [doctest] pyarrow._dataset.DirectoryPartitioning _______________
EXAMPLE LOCATION UNKNOWN, not showing all tests of that example
??? >>> print(partitioning.parse("/2009/11"))
Expected:
    ((year == 2009) and (month == 11))
Got:
    (year == 2009)

/opt/conda/envs/arrow/lib/python3.9/site-packages/pyarrow/_dataset.cpython-39-x86_64-linux-gnu.so:None: DocTestFailure
_________________ [doctest] pyarrow._dataset.HivePartitioning __________________
EXAMPLE LOCATION UNKNOWN, not showing all tests of that example
??? >>> print(partitioning.parse("/year=2009/month=11"))
Expected:
    ((year == 2009) and (month == 11))
Got:
    (year == 2009)

sanjibansg · 2022-05-30T13:10:43Z

Might it be that it just needs a rebase to master? It seems there were recent fixes to that example -> 9a7cc52

Maybe but I don't think so, the last commit on master (f3af2b7) triggered a build for that specific build sphinx and doctest job (https://github.com/apache/arrow/runs/6654897048?check_suite_focus=true#step:6:5459) and is also failing with the same error:
=================================== FAILURES ===================================
_______________ [doctest] pyarrow._dataset.DirectoryPartitioning _______________
EXAMPLE LOCATION UNKNOWN, not showing all tests of that example
??? >>> print(partitioning.parse("/2009/11"))
Expected:
    ((year == 2009) and (month == 11))
Got:
    (year == 2009)

/opt/conda/envs/arrow/lib/python3.9/site-packages/pyarrow/_dataset.cpython-39-x86_64-linux-gnu.so:None: DocTestFailure
_________________ [doctest] pyarrow._dataset.HivePartitioning __________________
EXAMPLE LOCATION UNKNOWN, not showing all tests of that example
??? >>> print(partitioning.parse("/year=2009/month=11"))
Expected:
    ((year == 2009) and (month == 11))
Got:
    (year == 2009)

It maybe because of the changed definition of the Parse() method. See the changed test case here, adb5b00#diff-063096b5f67bfb7437c260bf755259ba4251db3ff7e6aca6ff0cb554181b4461R607
We would need to change the example string for the Parse() method accordingly.

raulcd · 2022-05-30T13:49:08Z

It maybe because of the changed definition of the Parse() method. See the changed test case here, adb5b00#diff-063096b5f67bfb7437c260bf755259ba4251db3ff7e6aca6ff0cb554181b4461R607 We would need to change the example string for the Parse() method accordingly.

Applying this diff fixes it locally:

$ git diff
diff --git a/python/pyarrow/_dataset.pyx b/python/pyarrow/_dataset.pyx
index 83c131c..fbde03f 100644
--- a/python/pyarrow/_dataset.pyx
+++ b/python/pyarrow/_dataset.pyx
@@ -1468,7 +1468,7 @@ cdef class DirectoryPartitioning(KeyValuePartitioning):
     >>> from pyarrow.dataset import DirectoryPartitioning
     >>> partitioning = DirectoryPartitioning(
     ...     pa.schema([("year", pa.int16()), ("month", pa.int8())]))
-    >>> print(partitioning.parse("/2009/11"))
+    >>> print(partitioning.parse("/2009/11/"))
     ((year == 2009) and (month == 11))
     """
 
@@ -1595,7 +1595,7 @@ cdef class HivePartitioning(KeyValuePartitioning):
     >>> from pyarrow.dataset import HivePartitioning
     >>> partitioning = HivePartitioning(
     ...     pa.schema([("year", pa.int16()), ("month", pa.int8())]))
-    >>> print(partitioning.parse("/year=2009/month=11"))
+    >>> print(partitioning.parse("/year=2009/month=11/"))
     ((year == 2009) and (month == 11))
 
     """

@sanjibansg can you confirm that adding the / is required now? we probably can apply this minor fix on this PR

amol- · 2022-05-30T14:32:33Z

@sanjibansg can you confirm that adding the / is required now? we probably can apply this minor fix on this PR

Is this going to impact users? Are they going to face any change in behaviour when upgrading to 9.0.0 or it's just an internal change given that users usually don't invoke parse directly?

pitrou · 2022-05-30T15:31:53Z

For the record, I understand that rows are in an unpredictable order after a join, but why are the columns in an unpredictable order?

amol- · 2022-05-30T20:26:01Z

For the record, I understand that rows are in an unpredictable order after a join, but why are the columns in an unpredictable order?

Haven't checked the details of our HasJoinNode implementation, but usually hash joins start picking the biggest of the two tables and joining the smaller to it. So the order of columns usually depends on which tables is picked first.

pitrou · 2022-05-30T20:27:06Z

Haven't checked the details of our HasJoinNode implementation, but usually hash joins start picking the biggest of the two tables and joining the smaller to it. So the order of columns usually depends on which tables is picked first.

Right, but the examples should be deterministic in any case. Also, the implementation could trivially reorder the output columns so that they are always in the same order regardless of table size. @westonpace Am I wrong?

sanjibansg · 2022-05-31T01:57:52Z

It maybe because of the changed definition of the Parse() method. See the changed test case here, adb5b00#diff-063096b5f67bfb7437c260bf755259ba4251db3ff7e6aca6ff0cb554181b4461R607 We would need to change the example string for the Parse() method accordingly.

Applying this diff fixes it locally:

$ git diff
diff --git a/python/pyarrow/_dataset.pyx b/python/pyarrow/_dataset.pyx
index 83c131c..fbde03f 100644
--- a/python/pyarrow/_dataset.pyx
+++ b/python/pyarrow/_dataset.pyx
@@ -1468,7 +1468,7 @@ cdef class DirectoryPartitioning(KeyValuePartitioning):
     >>> from pyarrow.dataset import DirectoryPartitioning
     >>> partitioning = DirectoryPartitioning(
     ...     pa.schema([("year", pa.int16()), ("month", pa.int8())]))
-    >>> print(partitioning.parse("/2009/11"))
+    >>> print(partitioning.parse("/2009/11/"))
     ((year == 2009) and (month == 11))
     """
 
@@ -1595,7 +1595,7 @@ cdef class HivePartitioning(KeyValuePartitioning):
     >>> from pyarrow.dataset import HivePartitioning
     >>> partitioning = HivePartitioning(
     ...     pa.schema([("year", pa.int16()), ("month", pa.int8())]))
-    >>> print(partitioning.parse("/year=2009/month=11"))
+    >>> print(partitioning.parse("/year=2009/month=11/"))
     ((year == 2009) and (month == 11))
 
     """

@sanjibansg can you confirm that adding the / is required now? we probably can apply this minor fix on this PR

Yes, I think it should work now, we need that / at the end for the Parse() method to work in this case.
@amol- This change in Parse() method was made to handle different Partitioning modes in a better way. Now the Parse() method expects the complete path, like "2009/11/1_part.parquet". If it's directory partitioning, then it needs the string "2009/11/", but if it's filename partitioning, then it needs the string "1_part.parquet". So, now the Parse() method internally extracts the required part of the string depending on the partitioning mode.

jorisvandenbossche · 2022-05-31T08:48:45Z

I opened #13269 for the parse() failure

python/pyarrow/table.pxi

jorisvandenbossche · 2022-05-31T09:33:54Z

For the record, I understand that rows are in an unpredictable order after a join, but why are the columns in an unpredictable order?

I was wondering exactly the same, I don't see a reason why not preserving the column order of the input DataFrames.

Some observations from trying to reproduce it locally (which strangely first didn't work in an interactive session):

Within a single python session, it seems the behaviour is always the same (either always passing, or always failing, which was the reason I first thought I couldn't reproduce it)
It is the column order within the set of columns of the right dataframe that is non-deterministic ("n_legs" and "animals" are the columns that might switch order, and they both come from the dataframe). So it seems it is not related to which table is picked first.

jorisvandenbossche · 2022-05-31T09:38:13Z

Actually, that made we wonder if this isn't some issue with an unordered set either in Python or C++, and indeed for the inner join we are using Python's set, which I suppose might cause this undeterministic behaviour?

arrow/python/pyarrow/_exec_plan.pyx

Lines 260 to 262 in 4847b85

    
           elif join_type == "inner": 
        
               c_join_type = CJoinType_INNER 
        
               right_columns = set(right_columns) - set(right_keys)

pitrou · 2022-05-31T10:17:52Z

Right, so it's a bug that needs fixing rather than something to workaround in the doc examples.

AlenkaF · 2022-05-31T10:25:39Z

Agree. Will close this PR and change the title of the JIRA issue.

westonpace · 2022-05-31T22:00:17Z

For future reference, the column output order of the hash join should be deterministic. There may be cases where we switch the sides or join things in a different order for performance reasons. Howver, if the execution engine does this it should always restore the order before sending any batches to a sink.

Select columns from joint tables to fix the order

a064522

github-actions bot added the Component: Python label May 30, 2022

Change all join examples

47965f7

amol- approved these changes May 31, 2022

View reviewed changes

jorisvandenbossche mentioned this pull request May 31, 2022

MINOR: [Python] Fix Partitioning parse() docstring example #13269

Merged

jorisvandenbossche reviewed May 31, 2022

View reviewed changes

python/pyarrow/table.pxi Outdated Show resolved Hide resolved

Change ordering for right outer join

eaeaed8

AlenkaF closed this May 31, 2022

AlenkaF deleted the ARROW-16685 branch September 16, 2022 06:38

ARROW-16685: [Python] Failing docstring example in Table.join #13260

ARROW-16685: [Python] Failing docstring example in Table.join #13260

Uh oh!

Conversation

AlenkaF commented May 30, 2022

Uh oh!

github-actions bot commented May 30, 2022

Uh oh!

github-actions bot commented May 30, 2022

Uh oh!

AlenkaF commented May 30, 2022

Uh oh!

amol- commented May 30, 2022

Uh oh!

raulcd commented May 30, 2022

Uh oh!

amol- commented May 30, 2022

Uh oh!

raulcd commented May 30, 2022

Uh oh!

sanjibansg commented May 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

raulcd commented May 30, 2022

Uh oh!

amol- commented May 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pitrou commented May 30, 2022

Uh oh!

amol- commented May 30, 2022

Uh oh!

pitrou commented May 30, 2022

Uh oh!

sanjibansg commented May 31, 2022

Uh oh!

jorisvandenbossche commented May 31, 2022

Uh oh!

Uh oh!

jorisvandenbossche commented May 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorisvandenbossche commented May 31, 2022

Uh oh!

pitrou commented May 31, 2022

Uh oh!

AlenkaF commented May 31, 2022

Uh oh!

westonpace commented May 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

sanjibansg commented May 30, 2022 •

edited

Loading

amol- commented May 30, 2022 •

edited

Loading

jorisvandenbossche commented May 31, 2022 •

edited

Loading

westonpace commented May 31, 2022 •

edited

Loading