Skip to content

Conversation

@vibhatha
Copy link
Contributor

@vibhatha vibhatha commented Mar 19, 2022

The PR includes the initial integration of Substrait to Python

  • Adding a util API for consuming Substrait
  • Adding a C++ example for using Substrait with Util API
  • Adding Python Bindings for Substrait using the Util API
  • Adding CMake changes to integrate engine module (experimental) : comments and suggestions are much appreciated to improve this component
  • Adding a Python example to consume a Substrait plan (experimental)

@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@lidavidm
Copy link
Member

Hmm, did you mean to delete the testing submodule?

@vibhatha
Copy link
Contributor Author

@lidavidm no I wrongly checked in files of testing submodule. I just wanted to remove that.

@vibhatha vibhatha force-pushed the arrow-15779 branch 3 times, most recently from 59d2e12 to 4f97960 Compare March 23, 2022 04:47
@vibhatha vibhatha changed the title ARROW-15779: [Python] Create python bindings for Substrait consumer [WIP] ARROW-15779: [Python] Create python bindings for Substrait consumer Mar 24, 2022
@vibhatha vibhatha marked this pull request as ready for review March 24, 2022 09:54
return Status::OK();
}

Future<> SubstraitSinkConsumer::Finish() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there already a sink that outputs to a reader? Why do we need a custom implementation here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sure I know that there is a sink that output's a std::shared_ptr<arrow::Table>. Could you please point me to this implementation, I might have missed this one.

Copy link
Member

@westonpace westonpace Mar 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess is that you are thinking of SinkNode which is very similar to this class.

Right now the Substrait consumer always uses a ConsumingSinkNode and thus it needs a "consumer factory".

Another potential implementation would be for the Substrait to take in a "sink node factory" instead (or we could have both implementations). That might be more flexible in the long term. In that case we could reuse SinkNode here.

So we have SinkNode which is a "node that shoves batches into a push generator" and we have SubstraitSinkConsumer which is a "consumer that shoves batches into a push generator".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to support both as well as a consumer is much easier to implement than a node.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, that last comment was about potential modifications to the Substrait consumer (e.g. we might want the Substrait consumer to support both a consumer factory API and a sink node factory API)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@westonpace good point. In this PR, should we continue with the ConsumingSinkapproach and later on in another PR think about supporting a factory approach?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's keep this PR focused on the ConsumingSink approach. We can worry about other changes later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we create a JIRA for that one or should it be an open discussion before it becomes a ticket? Interested in that piece :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please create a ticket. It can be our place for open discussion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@westonpace created a ticket here: https://issues.apache.org/jira/browse/ARROW-16036 for. open discussion.

Would like to work on this one. I think the usability piece of this PR can be further improved with this integration.

ASSERT_OK_AND_ASSIGN(auto reader, engine::SubstraitExecutor::GetRecordBatchReader(
substrait_json, in_schema));
ASSERT_OK_AND_ASSIGN(auto table, Table::FromRecordBatchReader(reader.get()));
EXPECT_GT(table->num_rows(), 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we know the number of expected rows?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this means we should know the answer to the query accurately. It depends on the data file, right? So assuming the file static we can set a limit. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternative is we can read the file directly using Parquet API and check the values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably just assume the # of rows is static.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WIP

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WIP

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you plan to address this or not? It's not a big deal either way

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I am. Will update it today.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @lidavidm I updated the test case and also added note in case of a modification to the test file could cause a test failure. I think being exact is better than checking a non-zero value and thank you for pointing this out in the first place. Plus in Python we verify this to the letter by using the parquet reader. Using parquet reader here seemed like an overkill. So one way or the other, the objective is covered :)

@westonpace westonpace self-requested a review April 4, 2022 20:59
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this, I've added some thoughts on the C++ implementation.


Paramters
---------
plan : bytes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(The docstring still doesn't describe this parameter fully.)

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is getting close. I think the overall approach looks good now. Just a few cleanups:

Also, is this comment resolved? https://github.com/apache/arrow/pull/12672/files#r836413387

std::string& substrait_json);

/// \brief Retrieve a RecordBatchReader from a Substrait plan in Buffer.
ARROW_ENGINE_EXPORT Result<std::shared_ptr<RecordBatchReader>> GetRecordBatchReader(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I'd prefer something like ExecuteJsonPlan and ExecuteSerializedPlan over overloads

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And again, I think any use of Buffer here in a parameter, method name, or docstring is confusing at best, we should clarify what exactly it's meant to be

// Path is supposed to start with "/..."
file_path = "file://" + file_path;
#endif
std::cout << "File Path : >>>>" << file_path << std::endl;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this print?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes of course, still testing the windows CI breaks. Will remove it after the fix.

ASSERT_OK_AND_ASSIGN(auto reader, engine::SubstraitExecutor::GetRecordBatchReader(
substrait_json, in_schema));
ASSERT_OK_AND_ASSIGN(auto table, Table::FromRecordBatchReader(reader.get()));
EXPECT_GT(table->num_rows(), 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping?

@lidavidm
Copy link
Member

All tests are failing, seems like it needs to be rebased against Weston's recent refactoring?

@vibhatha
Copy link
Contributor Author

All tests are failing, seems like it needs to be rebased against Weston's recent refactoring?

Thank you. Let me do that.

@vibhatha
Copy link
Contributor Author

cc @lidavidm @westonpace following this refactor

option(PYARROW_BUILD_ENGINE "Build the PyArrow Engine integration" OFF)

should be changed to

option(PYARROW_BUILD_SUBSTRAIT "Build the PyArrow Substrait integration" OFF)

@lidavidm
Copy link
Member

There's still test failures. Can we just do the 'dumb' thing that seems to work?

const char* get_data_dir() {
const auto result = std::getenv("PARQUET_TEST_DATA");
if (!result || !result[0]) {
throw ParquetTestException(
"Please point the PARQUET_TEST_DATA environment "
"variable to the test data directory");
}
return result;
}
std::string get_bad_data_dir() {
// PARQUET_TEST_DATA should point to ARROW_HOME/cpp/submodules/parquet-testing/data
// so need to reach one folder up to access the "bad_data" folder.
std::string data_dir(get_data_dir());
std::stringstream ss;
ss << data_dir << "/../bad_data";
return ss.str();
}
std::string get_data_file(const std::string& filename, bool is_good) {
std::stringstream ss;
if (is_good) {
ss << get_data_dir();
} else {
ss << get_bad_data_dir();
}
ss << "/" << filename;
return ss.str();
}

Or else looks like we have to fix arrow::internal::PlatformFilename under MinGW.

@lidavidm
Copy link
Member

Also @vibhatha, it is possible to set up MinGW under a Windows VM so that you can debug more quickly…unless you only have an ARM machine?

@vibhatha
Copy link
Contributor Author

@lidavidm Yes I only have an Arm machine.
I didn’t add a fix for Windows yet, just solved the conflicts. But, trying it now.

@ursabot
Copy link

ursabot commented May 21, 2022

Benchmark runs are scheduled for baseline = cc2265a and contender = c544a8b. c544a8b is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.27% ⬆️0.0%] test-mac-arm
[Failed ⬇️0.37% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.79% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] c544a8bb ec2-t3-xlarge-us-east-2
[Failed] c544a8bb test-mac-arm
[Failed] c544a8bb ursa-i9-9960x
[Finished] c544a8bb ursa-thinkcentre-m75q
[Finished] cc2265a3 ec2-t3-xlarge-us-east-2
[Failed] cc2265a3 test-mac-arm
[Failed] cc2265a3 ursa-i9-9960x
[Finished] cc2265a3 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants