Skip to content

Conversation

@westonpace
Copy link
Member

Reverts the streaming CSV reader and the async workaround introduced for it. It will be reintroduced, more cleanly, in ARROW-12355

@westonpace
Copy link
Member Author

CC @lidavidm Sorry, I hadn't realized you were also working on this. This revert is a bit more extensive than yours as it removes some stuff that was put in just to get the async streaming reader working.

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @westonpace. Unfortunately CI will likely take a while but I'll circle back and merge this tonight.

@lidavidm
Copy link
Member

lidavidm commented Apr 13, 2021

I kicked AppVeyor as the Windows arrow-dataset-file-csv-test failed seemingly without explanation: https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/38687464/job/pth19ssutpbagn0s

but it may very well be a Windows-only issue (the test does pass locally for me on Linux)

@westonpace
Copy link
Member Author

@lidavidm It does indeed seem to be a Windows only issue. Local builds failed twice for me so I'm currently building on my laptop to investigate.

@westonpace westonpace force-pushed the feature/revert-arrow-12161 branch from 303fa2b to be19df4 Compare April 13, 2021 23:47
@westonpace
Copy link
Member Author

Ok, after getting lost in the weeds for a while I was able to confirm that this is very much related to ARROW-12220. Some fun facts...

  • Since threading has been removed the teardown is a lot more deterministic so we get the error 100% of the time
  • It only happens on the Windows build that has mimalloc turned on
  • The reason there are no logs is because the failure is not a segmentation fault but a mimalloc assertion
  • The assertion happens as the background generator is being destroyed (I verified this with the debugger and this is pretty concrete evidence)

Unfortunately, merging in ARROW-12220 did not fix the issue. So...more debugging and I was able to discover...

microsoft/mimalloc#363

It is triggered by a thread exit. The serial thread readers use a dedicated thread that is destroyed when the reader finishes. That thread must have done a huge allocation. Huge is defined as "larger than 1<<21". The test in question that triggers this uses a block size of 1<<22

So, I can workaround it by changing the block_size in the test. We could conceivably limit the block size for users. I'm not fully aware of all the places we destroy threads (there aren't many so we can maybe get away with it). We may want to reconsider mimalloc 2.0 until this is fixed.

@westonpace
Copy link
Member Author

@nealrichardson @jonkeane Tagging the mimalloc crowd.

@westonpace
Copy link
Member Author

Just to be clear, ARROW-12220 is still a separate and valid bug. I'm not implying that ARROW-12220 is a fault of mimalloc. I started writing the first half of this message assuming ARROW-12220 was the fix and then probably should've rewritten or just removed that part.

@westonpace
Copy link
Member Author

And for the last confirmation I modified the test to use a smaller block size and the test passes: https://github.com/westonpace/arrow/runs/2339553058?check_suite_focus=true

@github-actions
Copy link

lidavidm added a commit that referenced this pull request Apr 14, 2021
This reverts commit 8780ca4 in order to avoid microsoft/mimalloc#363 as discovered in #10019.

Closes #10024 from lidavidm/arrow-11475

Authored-by: David Li <li.davidm96@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
@lidavidm
Copy link
Member

@westonpace would you like to rebase this and check that Windows tests pass now that we've reverted mimalloc?

@westonpace westonpace force-pushed the feature/revert-arrow-12161 branch from 326f062 to 2f55217 Compare April 14, 2021 20:15
@westonpace
Copy link
Member Author

Rebased. I'll watch local CI.

@westonpace
Copy link
Member Author

Appveyor & local Windows 2019 builds pass so that is promising.

@lidavidm
Copy link
Member

I'll give CI some more time but I'll merge this tonight and then rebase ARROW-11797. I've speculatively rebased the latter onto this branch at: https://github.com/lidavidm/arrow/tree/arrow-11797 so we can catch anything in CI; hopefully we'll have this all merged tonight or early tomorrow.

@lidavidm
Copy link
Member

Actually, it looks like between your fork's CI and AppVeyor here, all the relevant (C++, Python, R, etc.) CI jobs look good to go, with only Travis being queued.

@lidavidm
Copy link
Member

Ok, I don't think the Travis queue is clearing anytime soon. https://github.com/lidavidm/arrow/tree/arrow-11797 which is this + ARROW-11797 passes Actions/Travis/AppVeyor, so I'll merge this and rebase 11797.

@lidavidm lidavidm closed this in 05ec438 Apr 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants