Skip to content

SIGSEGV during join on large table #39332

@tmaxwell-anthropic

Description

@tmaxwell-anthropic

This Python script produces a segmentation fault in the join() call:

import pyarrow as pa

eight_mib = "xyzw" * (2048 * 1024)
gib = pa.array((eight_mib for i in range(128)), pa.string())
keys = pa.array(range(128), pa.int64())
left = pa.Table.from_pydict({"keys": keys, "gib": gib})

right_keys = pa.array(list(range(128)) * 4, pa.int64())
right = pa.Table.from_pydict({"keys": right_keys})

print("joining...")
left.join(right, "keys")
print("joined.")

The C++ call stack is:

#0  0x00007ffff43aba6a in arrow::compute::ExecBatchBuilder::AppendSelected(std::shared_ptr<arrow::ArrayData> const&, arrow::compute::ResizableArrayData*, int, unsigned short const*, arrow::MemoryPool*) () from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow.so.1200
#1  0x00007ffff43acfa7 in arrow::compute::ExecBatchBuilder::AppendSelected(arrow::MemoryPool*, arrow::compute::ExecBatch const&, int, unsigned short const*, int, int const*) () from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow.so.1200
#2  0x00007ffff67cbe44 in arrow::acero::JoinResultMaterialize::Append(arrow::compute::ExecBatch const&, int, unsigned short const*, unsigned int const*, unsigned int const*, int*) () from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#3  0x00007ffff67e0423 in arrow::acero::JoinProbeProcessor::OnNextBatch(long, arrow::compute::ExecBatch const&, arrow::util::TempVectorStack*, std::vector<arrow::compute::KeyColumnArray, std::allocator<arrow::compute::KeyColumnArray> >*) ()
   from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#4  0x00007ffff6802721 in arrow::acero::SwissJoin::ProbeSingleBatch(unsigned long, arrow::compute::ExecBatch) ()
   from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#5  0x00007ffff6825c07 in std::_Function_handler<arrow::Status (unsigned long, long), arrow::acero::HashJoinNode::Init()::{lambda(unsigned long, long)#8}>::_M_invoke(std::_Any_data const&, unsigned long&&, long&&) () from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#6  0x00007ffff67be225 in arrow::acero::TaskSchedulerImpl::ExecuteTask(unsigned long, int, long, bool*) ()
   from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#7  0x00007ffff67d5814 in std::_Function_handler<arrow::Status (unsigned long), arrow::acero::TaskSchedulerImpl::ScheduleMore(unsigned long, int)::{lambda(unsigned long)#1}>::_M_invoke(std::_Any_data const&, unsigned long&&) ()
   from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#8  0x00007ffff67baf47 in std::_Function_handler<arrow::Status (), arrow::acero::QueryContext::ScheduleTask(std::function<arrow::Status (unsigned long)>, std::basic_string_view<char, std::char_traits<char> >)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
   from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#9  0x00007ffff67f9260 in arrow::internal::FnOnce<void ()>::FnImpl<std::_Bind<arrow::detail::ContinueFuture (arrow::Future<arrow::internal::Empty>, std::function<arrow::Status ()>)> >::invoke() () from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#10 0x00007ffff44d9505 in arrow::internal::FnOnce<void ()>::operator()() && ()
   from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow.so.1200
#11 0x00007ffff44d5c38 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::{lambda()#1}> > >::_M_run() () from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow.so.1200
#12 0x00007ffff543b4a0 in execute_native_thread_routine () from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow.so.1200
#13 0x00007ffff7850ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#14 0x00007ffff78e2660 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Software versions: PyArrow 12.0.1, Python 3.11.6, Ubuntu 22.04.3.

If I change the pa.string() to a pa.large_string() then it works fine.

Component(s)

C++, Python

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions