ARROW-15779: [Python] Create python bindings for Substrait consumer #12672

vibhatha · 2022-03-19T13:34:35Z

The PR includes the initial integration of Substrait to Python

Adding a util API for consuming Substrait
Adding a C++ example for using Substrait with Util API
Adding Python Bindings for Substrait using the Util API
Adding CMake changes to integrate engine module (experimental) : comments and suggestions are much appreciated to improve this component
Adding a Python example to consume a Substrait plan (experimental)

github-actions · 2022-03-19T13:34:51Z

https://issues.apache.org/jira/browse/ARROW-15779

github-actions · 2022-03-19T13:34:52Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

lidavidm · 2022-03-20T16:41:57Z

Hmm, did you mean to delete the testing submodule?

vibhatha · 2022-03-20T17:20:04Z

@lidavidm no I wrongly checked in files of testing submodule. I just wanted to remove that.

python/pyarrow/engine.py

python/pyarrow/_engine.pyx

cpp/src/arrow/engine/substrait/util.cc

lidavidm · 2022-03-24T12:41:28Z

cpp/src/arrow/engine/substrait/util.cc

+  return Status::OK();
+}
+
+Future<> SubstraitSinkConsumer::Finish() {


Isn't there already a sink that outputs to a reader? Why do we need a custom implementation here?

For sure I know that there is a sink that output's a std::shared_ptr<arrow::Table>. Could you please point me to this implementation, I might have missed this one.

My guess is that you are thinking of SinkNode which is very similar to this class.

Right now the Substrait consumer always uses a ConsumingSinkNode and thus it needs a "consumer factory".

Another potential implementation would be for the Substrait to take in a "sink node factory" instead (or we could have both implementations). That might be more flexible in the long term. In that case we could reuse SinkNode here.

So we have SinkNode which is a "node that shoves batches into a push generator" and we have SubstraitSinkConsumer which is a "consumer that shoves batches into a push generator".

We might want to support both as well as a consumer is much easier to implement than a node.

Just to be clear, that last comment was about potential modifications to the Substrait consumer (e.g. we might want the Substrait consumer to support both a consumer factory API and a sink node factory API)

@westonpace good point. In this PR, should we continue with the ConsumingSinkapproach and later on in another PR think about supporting a factory approach?

Yes, let's keep this PR focused on the ConsumingSink approach. We can worry about other changes later.

Should we create a JIRA for that one or should it be an open discussion before it becomes a ticket? Interested in that piece :)

Please create a ticket. It can be our place for open discussion

@westonpace created a ticket here: https://issues.apache.org/jira/browse/ARROW-16036 for. open discussion.

Would like to work on this one. I think the usability piece of this PR can be further improved with this integration.

lidavidm · 2022-03-24T12:41:53Z

cpp/src/arrow/engine/substrait/serde_test.cc

+  ASSERT_OK_AND_ASSIGN(auto reader, engine::SubstraitExecutor::GetRecordBatchReader(
+                                        substrait_json, in_schema));
+  ASSERT_OK_AND_ASSIGN(auto table, Table::FromRecordBatchReader(reader.get()));
+  EXPECT_GT(table->num_rows(), 0);


Shouldn't we know the number of expected rows?

So this means we should know the answer to the query accurately. It depends on the data file, right? So assuming the file static we can set a limit. What do you think?

Alternative is we can read the file directly using Parquet API and check the values.

We can probably just assume the # of rows is static.

Do you plan to address this or not? It's not a big deal either way

Yes, I am. Will update it today.

cc @lidavidm I updated the test case and also added note in case of a modification to the test file could cause a test failure. I think being exact is better than checking a non-zero value and thank you for pointing this out in the first place. Plus in Python we verify this to the letter by using the parquet reader. Using parquet reader here seemed like an overkill. So one way or the other, the objective is covered :)

cpp/src/arrow/engine/substrait/serde_test.cc

cpp/examples/arrow/engine_substrait_example.cc

python/pyarrow/tests/test_substrait.py

python/pyarrow/_engine.pyx

python/examples/substrait/query_execution_example.py

python/CMakeLists.txt

westonpace

Thanks for working on this, I've added some thoughts on the C++ implementation.

cpp/src/arrow/engine/substrait/serde_test.cc

cpp/src/arrow/engine/substrait/util.h

cpp/src/arrow/engine/substrait/util.cc

cpp/src/arrow/engine/substrait/util.h

python/pyarrow/tests/test_substrait.py

lidavidm · 2022-04-08T13:33:35Z

python/pyarrow/_engine.pyx

+
+    Paramters
+    ---------
+    plan : bytes


(The docstring still doesn't describe this parameter fully.)

westonpace

This is getting close. I think the overall approach looks good now. Just a few cleanups:

Also, is this comment resolved? https://github.com/apache/arrow/pull/12672/files#r836413387

cpp/src/arrow/engine/substrait/util.h

cpp/src/arrow/engine/substrait/util.cc

cpp/src/arrow/engine/substrait/util.h

cpp/src/arrow/engine/substrait/serde_test.cc

python/pyarrow/tests/test_substrait.py

cpp/src/arrow/engine/substrait/util.h

python/pyarrow/_engine.pyx

cpp/src/arrow/engine/substrait/util.h

lidavidm · 2022-04-18T12:54:14Z

cpp/src/arrow/engine/substrait/util.h

+    std::string& substrait_json);
+
+/// \brief Retrieve a RecordBatchReader from a Substrait plan in Buffer.
+ARROW_ENGINE_EXPORT Result<std::shared_ptr<RecordBatchReader>> GetRecordBatchReader(


Personally I'd prefer something like ExecuteJsonPlan and ExecuteSerializedPlan over overloads

And again, I think any use of Buffer here in a parameter, method name, or docstring is confusing at best, we should clarify what exactly it's meant to be

cpp/src/arrow/engine/substrait/util.h

cpp/src/arrow/engine/substrait/util.cc

lidavidm · 2022-04-18T13:03:55Z

cpp/src/arrow/engine/substrait/serde_test.cc

+  // Path is supposed to start with "/..."
+  file_path = "file://" + file_path;
+#endif
+  std::cout << "File Path : >>>>" << file_path << std::endl;


Remove this print?

Yes of course, still testing the windows CI breaks. Will remove it after the fix.

lidavidm · 2022-04-18T13:04:19Z

cpp/src/arrow/engine/substrait/serde_test.cc

+  ASSERT_OK_AND_ASSIGN(auto reader, engine::SubstraitExecutor::GetRecordBatchReader(
+                                        substrait_json, in_schema));
+  ASSERT_OK_AND_ASSIGN(auto table, Table::FromRecordBatchReader(reader.get()));
+  EXPECT_GT(table->num_rows(), 0);


lidavidm · 2022-04-26T12:35:59Z

All tests are failing, seems like it needs to be rebased against Weston's recent refactoring?

vibhatha · 2022-04-26T12:37:11Z

All tests are failing, seems like it needs to be rebased against Weston's recent refactoring?

Thank you. Let me do that.

vibhatha · 2022-04-26T12:44:18Z

cc @lidavidm @westonpace following this refactor

option(PYARROW_BUILD_ENGINE "Build the PyArrow Engine integration" OFF)

should be changed to

option(PYARROW_BUILD_SUBSTRAIT "Build the PyArrow Substrait integration" OFF)

lidavidm · 2022-04-26T18:29:50Z

There's still test failures. Can we just do the 'dumb' thing that seems to work?

arrow/cpp/src/parquet/test_util.cc

Lines 42 to 72 in dc97883

    
           const char* get_data_dir() { 
        
             const auto result = std::getenv("PARQUET_TEST_DATA"); 
        
             if (!result || !result[0]) { 
        
               throw ParquetTestException( 
        
                   "Please point the PARQUET_TEST_DATA environment " 
        
                   "variable to the test data directory"); 
        
             } 
        
             return result; 
        
           } 
        
           std::string get_bad_data_dir() { 
        
             // PARQUET_TEST_DATA should point to ARROW_HOME/cpp/submodules/parquet-testing/data 
        
             // so need to reach one folder up to access the "bad_data" folder. 
        
             std::string data_dir(get_data_dir()); 
        
             std::stringstream ss; 
        
             ss << data_dir << "/../bad_data"; 
        
             return ss.str(); 
        
           } 
        
           std::string get_data_file(const std::string& filename, bool is_good) { 
        
             std::stringstream ss; 
        
             if (is_good) { 
        
               ss << get_data_dir(); 
        
             } else { 
        
               ss << get_bad_data_dir(); 
        
             } 
        
             ss << "/" << filename; 
        
             return ss.str(); 
        
           }

Or else looks like we have to fix arrow::internal::PlatformFilename under MinGW.

lidavidm · 2022-04-26T18:30:20Z

Also @vibhatha, it is possible to set up MinGW under a Windows VM so that you can debug more quickly…unless you only have an ARM machine?

vibhatha · 2022-04-26T23:10:11Z

@lidavidm Yes I only have an Arm machine.
I didn’t add a fix for Windows yet, just solved the conflicts. But, trying it now.

ursabot · 2022-05-21T02:20:53Z

Benchmark runs are scheduled for baseline = cc2265a and contender = c544a8b. c544a8b is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.27% ⬆️0.0%] test-mac-arm
[Failed ⬇️0.37% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.79% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] c544a8bb ec2-t3-xlarge-us-east-2
[Failed] c544a8bb test-mac-arm
[Failed] c544a8bb ursa-i9-9960x
[Finished] c544a8bb ursa-thinkcentre-m75q
[Finished] cc2265a3 ec2-t3-xlarge-us-east-2
[Failed] cc2265a3 test-mac-arm
[Failed] cc2265a3 ursa-i9-9960x
[Finished] cc2265a3 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

github-actions bot added the Component: C++ label Mar 19, 2022

vibhatha force-pushed the arrow-15779 branch from 6f97908 to 548a6b6 Compare March 19, 2022 13:46

vibhatha force-pushed the arrow-15779 branch 3 times, most recently from 59d2e12 to 4f97960 Compare March 23, 2022 04:47

github-actions bot added the Component: Python label Mar 23, 2022

vibhatha changed the title ~~ARROW-15779: [Python] Create python bindings for Substrait consumer [WIP]~~ ARROW-15779: [Python] Create python bindings for Substrait consumer Mar 24, 2022

vibhatha marked this pull request as ready for review March 24, 2022 09:54

lidavidm reviewed Mar 24, 2022

View reviewed changes

vibhatha force-pushed the arrow-15779 branch from 84ab5ee to cbf8801 Compare March 25, 2022 07:40

vibhatha requested a review from lidavidm March 25, 2022 10:05

lidavidm reviewed Mar 28, 2022

View reviewed changes

westonpace self-requested a review April 4, 2022 20:59

westonpace requested changes Apr 8, 2022

View reviewed changes

vibhatha force-pushed the arrow-15779 branch from 51f5266 to a3dc940 Compare April 8, 2022 08:57

lidavidm reviewed Apr 8, 2022

View reviewed changes

westonpace requested changes Apr 8, 2022

View reviewed changes

vibhatha force-pushed the arrow-15779 branch from cc48929 to 1c7e15a Compare April 10, 2022 10:14

lidavidm reviewed Apr 18, 2022

View reviewed changes

vibhatha force-pushed the arrow-15779 branch from 3cc9351 to ae9acb7 Compare April 26, 2022 13:40

vibhatha added 27 commits May 21, 2022 00:43

rebase

ee73c40

testing windows ci issue

256f514

testing path

e3ed91d

updating the test case skipping windows

ffcf567

remove unnecessary imports and adding skip for windows

db5dba1

addressing feedbacks

717de20

fix the commit order and formatting in pytest

b3a36ad

removed the export keyword

1809932

updating test condition

f40c2cf

adding comment for test

749be1b

update libarrow.pxd

3d808cf

rebase

3b0da1f

changing the buffer usage in the substrait consumer

d660960

adding a comment on context usage

7739b08

addressing reviews p1

9e19335

change order

ffc87ea

replace parquest tests with arrow ipc

784c07b

updating namespace

83b6602

type_internal: substrait namespace modification (fix amd ci failure)

0a377ea

fix style issue

b6045da

addressing review comments

fc0772c

removing comments and formatting

f259e7b

move libarrow_substrait and minor refactor

1e21a0b

refactor substrait consumer

ce2834a

todo docs added

167e8aa

reverting the change

00cc8d3

fix build break

af47f03

vibhatha force-pushed the arrow-15779 branch from 9716102 to af47f03 Compare May 20, 2022 19:35

westonpace closed this in c544a8b May 20, 2022

ARROW-15779: [Python] Create python bindings for Substrait consumer #12672

ARROW-15779: [Python] Create python bindings for Substrait consumer #12672

Uh oh!

Conversation

vibhatha commented Mar 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 19, 2022

Uh oh!

github-actions bot commented Mar 19, 2022

Uh oh!

lidavidm commented Mar 20, 2022

Uh oh!

vibhatha commented Mar 20, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

westonpace Mar 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

vibhatha commented Mar 19, 2022 •

edited

Loading

westonpace Mar 25, 2022 •

edited

Loading