-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-12161: [C++][Dataset] Revert async CSV reader in datasets #10019
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
CC @lidavidm Sorry, I hadn't realized you were also working on this. This revert is a bit more extensive than yours as it removes some stuff that was put in just to get the async streaming reader working. |
lidavidm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @westonpace. Unfortunately CI will likely take a while but I'll circle back and merge this tonight.
|
I kicked AppVeyor as the Windows arrow-dataset-file-csv-test failed seemingly without explanation: https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/38687464/job/pth19ssutpbagn0s but it may very well be a Windows-only issue (the test does pass locally for me on Linux) |
|
@lidavidm It does indeed seem to be a Windows only issue. Local builds failed twice for me so I'm currently building on my laptop to investigate. |
303fa2b to
be19df4
Compare
|
Ok, after getting lost in the weeds for a while I was able to confirm that this is very much related to ARROW-12220. Some fun facts...
Unfortunately, merging in ARROW-12220 did not fix the issue. So...more debugging and I was able to discover... It is triggered by a thread exit. The serial thread readers use a dedicated thread that is destroyed when the reader finishes. That thread must have done a huge allocation. Huge is defined as "larger than So, I can workaround it by changing the |
|
@nealrichardson @jonkeane Tagging the mimalloc crowd. |
|
Just to be clear, ARROW-12220 is still a separate and valid bug. I'm not implying that ARROW-12220 is a fault of mimalloc. I started writing the first half of this message assuming ARROW-12220 was the fix and then probably should've rewritten or just removed that part. |
|
And for the last confirmation I modified the test to use a smaller block size and the test passes: https://github.com/westonpace/arrow/runs/2339553058?check_suite_focus=true |
This reverts commit 8780ca4 in order to avoid microsoft/mimalloc#363 as discovered in #10019. Closes #10024 from lidavidm/arrow-11475 Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>
|
@westonpace would you like to rebase this and check that Windows tests pass now that we've reverted mimalloc? |
326f062 to
2f55217
Compare
|
Rebased. I'll watch local CI. |
|
Appveyor & local Windows 2019 builds pass so that is promising. |
|
I'll give CI some more time but I'll merge this tonight and then rebase ARROW-11797. I've speculatively rebased the latter onto this branch at: https://github.com/lidavidm/arrow/tree/arrow-11797 so we can catch anything in CI; hopefully we'll have this all merged tonight or early tomorrow. |
|
Actually, it looks like between your fork's CI and AppVeyor here, all the relevant (C++, Python, R, etc.) CI jobs look good to go, with only Travis being queued. |
|
Ok, I don't think the Travis queue is clearing anytime soon. https://github.com/lidavidm/arrow/tree/arrow-11797 which is this + ARROW-11797 passes Actions/Travis/AppVeyor, so I'll merge this and rebase 11797. |
Reverts the streaming CSV reader and the async workaround introduced for it. It will be reintroduced, more cleanly, in ARROW-12355