Skip to content

Conversation

@AlenkaF
Copy link
Member

@AlenkaF AlenkaF commented May 30, 2022

Added a step in the example where the columns are selected. This way the order of the columns is fixed and the doctest should pass.

@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@AlenkaF
Copy link
Member Author

AlenkaF commented May 30, 2022

@amol- do you mind reviewing this fix?

@amol-
Copy link
Member

amol- commented May 30, 2022

@amol- do you mind reviewing this fix?

Would it make sense to add the select for the other kind of joins too? In theory all of them have unpredictable order

@raulcd
Copy link
Member

raulcd commented May 30, 2022

The current doctest failures seem to be broken since this commit was merged on master: adb5b00

 _______________ [doctest] pyarrow._dataset.DirectoryPartitioning _______________
EXAMPLE LOCATION UNKNOWN, not showing all tests of that example
??? >>> print(partitioning.parse("/2009/11"))
Expected:
    ((year == 2009) and (month == 11))
Got:
    (year == 2009)

/opt/conda/envs/arrow/lib/python3.9/site-packages/pyarrow/_dataset.cpython-39-x86_64-linux-gnu.so:None: DocTestFailure
_________________ [doctest] pyarrow._dataset.HivePartitioning __________________
EXAMPLE LOCATION UNKNOWN, not showing all tests of that example
??? >>> print(partitioning.parse("/year=2009/month=11"))
Expected:
    ((year == 2009) and (month == 11))
Got:
    (year == 2009)

Is this an expected change or a bug being introduced? @westonpace @sanjibansg ?

@amol-
Copy link
Member

amol- commented May 30, 2022

The current doctest failures seem to be broken since this commit was merged on master: adb5b00

 _______________ [doctest] pyarrow._dataset.DirectoryPartitioning _______________
EXAMPLE LOCATION UNKNOWN, not showing all tests of that example
??? >>> print(partitioning.parse("/2009/11"))
Expected:
    ((year == 2009) and (month == 11))
Got:
    (year == 2009)

/opt/conda/envs/arrow/lib/python3.9/site-packages/pyarrow/_dataset.cpython-39-x86_64-linux-gnu.so:None: DocTestFailure
_________________ [doctest] pyarrow._dataset.HivePartitioning __________________
EXAMPLE LOCATION UNKNOWN, not showing all tests of that example
??? >>> print(partitioning.parse("/year=2009/month=11"))
Expected:
    ((year == 2009) and (month == 11))
Got:
    (year == 2009)

Is this an expected change or a bug being introduced? @westonpace @sanjibansg ?

Might it be that it just needs a rebase to master? It seems there were recent fixes to that example -> 9a7cc52

@raulcd
Copy link
Member

raulcd commented May 30, 2022

Might it be that it just needs a rebase to master? It seems there were recent fixes to that example -> 9a7cc52

Maybe but I don't think so, the last commit on master (f3af2b7) triggered a build for that specific build sphinx and doctest job (https://github.com/apache/arrow/runs/6654897048?check_suite_focus=true#step:6:5459) and is also failing with the same error:


=================================== FAILURES ===================================
_______________ [doctest] pyarrow._dataset.DirectoryPartitioning _______________
EXAMPLE LOCATION UNKNOWN, not showing all tests of that example
??? >>> print(partitioning.parse("/2009/11"))
Expected:
    ((year == 2009) and (month == 11))
Got:
    (year == 2009)

/opt/conda/envs/arrow/lib/python3.9/site-packages/pyarrow/_dataset.cpython-39-x86_64-linux-gnu.so:None: DocTestFailure
_________________ [doctest] pyarrow._dataset.HivePartitioning __________________
EXAMPLE LOCATION UNKNOWN, not showing all tests of that example
??? >>> print(partitioning.parse("/year=2009/month=11"))
Expected:
    ((year == 2009) and (month == 11))
Got:
    (year == 2009)

@sanjibansg
Copy link
Contributor

sanjibansg commented May 30, 2022

Might it be that it just needs a rebase to master? It seems there were recent fixes to that example -> 9a7cc52

Maybe but I don't think so, the last commit on master (f3af2b7) triggered a build for that specific build sphinx and doctest job (https://github.com/apache/arrow/runs/6654897048?check_suite_focus=true#step:6:5459) and is also failing with the same error:


=================================== FAILURES ===================================
_______________ [doctest] pyarrow._dataset.DirectoryPartitioning _______________
EXAMPLE LOCATION UNKNOWN, not showing all tests of that example
??? >>> print(partitioning.parse("/2009/11"))
Expected:
    ((year == 2009) and (month == 11))
Got:
    (year == 2009)

/opt/conda/envs/arrow/lib/python3.9/site-packages/pyarrow/_dataset.cpython-39-x86_64-linux-gnu.so:None: DocTestFailure
_________________ [doctest] pyarrow._dataset.HivePartitioning __________________
EXAMPLE LOCATION UNKNOWN, not showing all tests of that example
??? >>> print(partitioning.parse("/year=2009/month=11"))
Expected:
    ((year == 2009) and (month == 11))
Got:
    (year == 2009)

It maybe because of the changed definition of the Parse() method. See the changed test case here, adb5b00#diff-063096b5f67bfb7437c260bf755259ba4251db3ff7e6aca6ff0cb554181b4461R607
We would need to change the example string for the Parse() method accordingly.

@raulcd
Copy link
Member

raulcd commented May 30, 2022

It maybe because of the changed definition of the Parse() method. See the changed test case here, adb5b00#diff-063096b5f67bfb7437c260bf755259ba4251db3ff7e6aca6ff0cb554181b4461R607 We would need to change the example string for the Parse() method accordingly.

Applying this diff fixes it locally:

$ git diff
diff --git a/python/pyarrow/_dataset.pyx b/python/pyarrow/_dataset.pyx
index 83c131c..fbde03f 100644
--- a/python/pyarrow/_dataset.pyx
+++ b/python/pyarrow/_dataset.pyx
@@ -1468,7 +1468,7 @@ cdef class DirectoryPartitioning(KeyValuePartitioning):
     >>> from pyarrow.dataset import DirectoryPartitioning
     >>> partitioning = DirectoryPartitioning(
     ...     pa.schema([("year", pa.int16()), ("month", pa.int8())]))
-    >>> print(partitioning.parse("/2009/11"))
+    >>> print(partitioning.parse("/2009/11/"))
     ((year == 2009) and (month == 11))
     """
 
@@ -1595,7 +1595,7 @@ cdef class HivePartitioning(KeyValuePartitioning):
     >>> from pyarrow.dataset import HivePartitioning
     >>> partitioning = HivePartitioning(
     ...     pa.schema([("year", pa.int16()), ("month", pa.int8())]))
-    >>> print(partitioning.parse("/year=2009/month=11"))
+    >>> print(partitioning.parse("/year=2009/month=11/"))
     ((year == 2009) and (month == 11))
 
     """

@sanjibansg can you confirm that adding the / is required now? we probably can apply this minor fix on this PR

@amol-
Copy link
Member

amol- commented May 30, 2022

@sanjibansg can you confirm that adding the / is required now? we probably can apply this minor fix on this PR

Is this going to impact users? Are they going to face any change in behaviour when upgrading to 9.0.0 or it's just an internal change given that users usually don't invoke parse directly?

@pitrou
Copy link
Member

pitrou commented May 30, 2022

For the record, I understand that rows are in an unpredictable order after a join, but why are the columns in an unpredictable order?

@amol-
Copy link
Member

amol- commented May 30, 2022

For the record, I understand that rows are in an unpredictable order after a join, but why are the columns in an unpredictable order?

Haven't checked the details of our HasJoinNode implementation, but usually hash joins start picking the biggest of the two tables and joining the smaller to it. So the order of columns usually depends on which tables is picked first.

@pitrou
Copy link
Member

pitrou commented May 30, 2022

Haven't checked the details of our HasJoinNode implementation, but usually hash joins start picking the biggest of the two tables and joining the smaller to it. So the order of columns usually depends on which tables is picked first.

Right, but the examples should be deterministic in any case. Also, the implementation could trivially reorder the output columns so that they are always in the same order regardless of table size. @westonpace Am I wrong?

@sanjibansg
Copy link
Contributor

It maybe because of the changed definition of the Parse() method. See the changed test case here, adb5b00#diff-063096b5f67bfb7437c260bf755259ba4251db3ff7e6aca6ff0cb554181b4461R607 We would need to change the example string for the Parse() method accordingly.

Applying this diff fixes it locally:

$ git diff
diff --git a/python/pyarrow/_dataset.pyx b/python/pyarrow/_dataset.pyx
index 83c131c..fbde03f 100644
--- a/python/pyarrow/_dataset.pyx
+++ b/python/pyarrow/_dataset.pyx
@@ -1468,7 +1468,7 @@ cdef class DirectoryPartitioning(KeyValuePartitioning):
     >>> from pyarrow.dataset import DirectoryPartitioning
     >>> partitioning = DirectoryPartitioning(
     ...     pa.schema([("year", pa.int16()), ("month", pa.int8())]))
-    >>> print(partitioning.parse("/2009/11"))
+    >>> print(partitioning.parse("/2009/11/"))
     ((year == 2009) and (month == 11))
     """
 
@@ -1595,7 +1595,7 @@ cdef class HivePartitioning(KeyValuePartitioning):
     >>> from pyarrow.dataset import HivePartitioning
     >>> partitioning = HivePartitioning(
     ...     pa.schema([("year", pa.int16()), ("month", pa.int8())]))
-    >>> print(partitioning.parse("/year=2009/month=11"))
+    >>> print(partitioning.parse("/year=2009/month=11/"))
     ((year == 2009) and (month == 11))
 
     """

@sanjibansg can you confirm that adding the / is required now? we probably can apply this minor fix on this PR

Yes, I think it should work now, we need that / at the end for the Parse() method to work in this case.
@amol- This change in Parse() method was made to handle different Partitioning modes in a better way. Now the Parse() method expects the complete path, like "2009/11/1_part.parquet". If it's directory partitioning, then it needs the string "2009/11/", but if it's filename partitioning, then it needs the string "1_part.parquet". So, now the Parse() method internally extracts the required part of the string depending on the partitioning mode.

@jorisvandenbossche
Copy link
Member

I opened #13269 for the parse() failure

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented May 31, 2022

For the record, I understand that rows are in an unpredictable order after a join, but why are the columns in an unpredictable order?

I was wondering exactly the same, I don't see a reason why not preserving the column order of the input DataFrames.

Some observations from trying to reproduce it locally (which strangely first didn't work in an interactive session):

  • Within a single python session, it seems the behaviour is always the same (either always passing, or always failing, which was the reason I first thought I couldn't reproduce it)
  • It is the column order within the set of columns of the right dataframe that is non-deterministic ("n_legs" and "animals" are the columns that might switch order, and they both come from the dataframe). So it seems it is not related to which table is picked first.

@jorisvandenbossche
Copy link
Member

Actually, that made we wonder if this isn't some issue with an unordered set either in Python or C++, and indeed for the inner join we are using Python's set, which I suppose might cause this undeterministic behaviour?

elif join_type == "inner":
c_join_type = CJoinType_INNER
right_columns = set(right_columns) - set(right_keys)

@pitrou
Copy link
Member

pitrou commented May 31, 2022

Right, so it's a bug that needs fixing rather than something to workaround in the doc examples.

@AlenkaF
Copy link
Member Author

AlenkaF commented May 31, 2022

Agree. Will close this PR and change the title of the JIRA issue.

@AlenkaF AlenkaF closed this May 31, 2022
@westonpace
Copy link
Member

westonpace commented May 31, 2022

For future reference, the column output order of the hash join should be deterministic. There may be cases where we switch the sides or join things in a different order for performance reasons. Howver, if the execution engine does this it should always restore the order before sending any batches to a sink.

@AlenkaF AlenkaF deleted the ARROW-16685 branch September 16, 2022 06:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants