Skip to content

Conversation

@bkietz
Copy link
Member

@bkietz bkietz commented Feb 26, 2021

No description provided.

@github-actions
Copy link

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a quick look. I think my main concern is the potential race condition in Pop. The other comments are more of "I'm about to replace this with something else so lets not worry"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll be replacing this with something better so I don't know how much we care to worry but this is not ideal. For example, with parquet, this would fetch metadata for every file in the scan before starting to read any individual file. It introduces more latency than necessary.

Also, I'm not sure how this will interact with parquet preloading.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, what I'm working on will work around this so maybe not stress at the moment but there's no back-pressure here. If the batch consumer is not fast enough and the dataset is larger than RAM the system will run out of RAM.

It's a bit odd because you are fixing ARROW-11800 here (though without pressure) and then I'll be breaking it again with my implementation (the first pass of my impl will have back pressure and parse loaded buffers a bit more serially)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why both Scan(visitor) and ToBatches? Couldn't you just do ToBatches().Visit(visitor)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference is ToBatches().Visit(visitor) would invoke visitor exclusively on the thread which called Visit() whereas Scan(visitor) invokes visitor in the scan's thread pool. This method is speculative; I'm not sure we'd want to provide that but I included it as an example

@bkietz bkietz force-pushed the 11797-Provide-Scanner-methods-t branch 2 times, most recently from 522aca4 to 08aec2f Compare March 1, 2021 20:36
@lidavidm lidavidm force-pushed the 11797-Provide-Scanner-methods-t branch from 90818b2 to ba8c952 Compare April 7, 2021 18:56
@lidavidm
Copy link
Member

lidavidm commented Apr 7, 2021

I've updated this in case we change how to proceed with ARROW-7001:

  • Rebased against master
  • Added Python/R deprecation warnings for Scan(), since either way, we're planning to remove ScanTask in 5.0
  • Changed the iterator so that scan tasks are materialized and executed iteratively, to better handle larger-than-memory datasets
  • Removed Scan(visitor) to reduce API surface (we can add it if the need comes up)

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very cool. Now I just need to provide the V2 :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style nit: member variables should be at the bottom of a struct:

https://google.github.io/styleguide/cppguide.html#Declaration_Order

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely certain it is safe to modify iteration_error outside the mutex. What happens if Pop is accessing it at the same time?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also not safe to do outside the mutex.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: best to relinquish the mutex before calling notify

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm just not seeing it but what causes the above loop to exit if there is an error scanning?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually we'll run out of scan tasks/batches, since an error in getting the next scan task won't stop the current scan tasks from eventually completing. So really it's "all scan tasks drained (and maybe we didn't start all of them due to a failure)"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: If I see ToBatches I expect RecordBatchVector. I had named it ScanBatches but I don't feel too strongly on this point.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have a point. I didn't want to take ScanBatches since your method has a different signature. But maybe that could be ScanIndexedBatches or something.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the issue I ran into was that ScanBatches was used by FileSystemDataset::Write and it needed the fragment info in order to have access to the fragment's partition expression. So at a minimum I needed to return "record batch & partition it came from".

I think there was some discussion (either on the ML or some JIRA/PR) about the benefit of keeping the fragment available as the user might want to know where the batch came from.

Can you modify the ScanBatches here to return a RecordBatch/Fragment pair? I could align my ScanBatches with that (PositionedRecordBatch is overkill).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider making a parameterized test instead so it is easier to trace failures if needed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just testing a scan of one scan task & one batch? It seems we would want to test more than that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MakeScanner generates a union dataset of 2 InMemoryDatasets each of which repeats the batch 16 times so we should have 32 scan tasks total.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the very least we should create a JIRA to migrate these over to the new scan. Although I'm wondering if we want to just do that now because that would give us a lot more coverage of the non-deprecated path.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do it now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Does this comment still make sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope! :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is a naive question but why are there two versions of TakeRows?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, Ben split it into two functions (a pure-Arrow implementation, and the R binding) presumably for convenience. But maybe we can just port the implementation into the C++ library and expose it to Python as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice but probably not necessary, this was a (de facto) internal method. If it were me I'd just delete it, but since you've already done this, might as well keep it. Can you please just make a JIRA to delete this so we don't forget to clean this up after the release?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've been using ARROW-11782 so I'll note that in this PR and update the issue.

@lidavidm lidavidm force-pushed the 11797-Provide-Scanner-methods-t branch 4 times, most recently from 8e494c9 to 52f146e Compare April 9, 2021 21:44
@lidavidm
Copy link
Member

lidavidm commented Apr 9, 2021

@westonpace @bkietz this should be ready now, if either of you wants to take another look.

@lidavidm lidavidm force-pushed the 11797-Provide-Scanner-methods-t branch from 70c7bb8 to 5e885cd Compare April 12, 2021 18:15
@lidavidm
Copy link
Member

I restored the Scan(Visitor) overload that Ben pointed out in ARROW-12288, updated to use TaggedRecordBatch.

@lidavidm lidavidm force-pushed the 11797-Provide-Scanner-methods-t branch 2 times, most recently from 2f09670 to 67ed5a8 Compare April 13, 2021 13:35
@lidavidm
Copy link
Member

Rebased (unfortunately, I had to squash all commits or else the rebase would've been a pain).

@lidavidm lidavidm force-pushed the 11797-Provide-Scanner-methods-t branch from 67ed5a8 to dc97bb1 Compare April 13, 2021 14:22
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a JIRA for pushing down the index predicate into the scan?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filed ARROW-12369.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you not want to skip empty arrays (where length == 0)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that also exposed a bug (or well, poor error message) if all indices were out-of-bounds.

@lidavidm lidavidm force-pushed the 11797-Provide-Scanner-methods-t branch from 65f72a2 to ad14ea5 Compare April 14, 2021 22:00
@lidavidm
Copy link
Member

I am not sure why the JNI test is having so much trouble but it passes locally under Docker.

@lidavidm lidavidm closed this in d575858 Apr 15, 2021
@westonpace
Copy link
Member

FYI, there are probably some things we could do to improve the JNI build (ARROW-11633)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants