-
Notifications
You must be signed in to change notification settings - Fork 4k
Closed
Labels
Component: C++Component: PythonCritical FixBugfixes for security vulnerabilities, crashes, or invalid data.Bugfixes for security vulnerabilities, crashes, or invalid data.Type: bug
Milestone
Description
This Python script produces a segmentation fault in the join() call:
import pyarrow as pa
eight_mib = "xyzw" * (2048 * 1024)
gib = pa.array((eight_mib for i in range(128)), pa.string())
keys = pa.array(range(128), pa.int64())
left = pa.Table.from_pydict({"keys": keys, "gib": gib})
right_keys = pa.array(list(range(128)) * 4, pa.int64())
right = pa.Table.from_pydict({"keys": right_keys})
print("joining...")
left.join(right, "keys")
print("joined.")The C++ call stack is:
#0 0x00007ffff43aba6a in arrow::compute::ExecBatchBuilder::AppendSelected(std::shared_ptr<arrow::ArrayData> const&, arrow::compute::ResizableArrayData*, int, unsigned short const*, arrow::MemoryPool*) () from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow.so.1200
#1 0x00007ffff43acfa7 in arrow::compute::ExecBatchBuilder::AppendSelected(arrow::MemoryPool*, arrow::compute::ExecBatch const&, int, unsigned short const*, int, int const*) () from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow.so.1200
#2 0x00007ffff67cbe44 in arrow::acero::JoinResultMaterialize::Append(arrow::compute::ExecBatch const&, int, unsigned short const*, unsigned int const*, unsigned int const*, int*) () from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#3 0x00007ffff67e0423 in arrow::acero::JoinProbeProcessor::OnNextBatch(long, arrow::compute::ExecBatch const&, arrow::util::TempVectorStack*, std::vector<arrow::compute::KeyColumnArray, std::allocator<arrow::compute::KeyColumnArray> >*) ()
from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#4 0x00007ffff6802721 in arrow::acero::SwissJoin::ProbeSingleBatch(unsigned long, arrow::compute::ExecBatch) ()
from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#5 0x00007ffff6825c07 in std::_Function_handler<arrow::Status (unsigned long, long), arrow::acero::HashJoinNode::Init()::{lambda(unsigned long, long)#8}>::_M_invoke(std::_Any_data const&, unsigned long&&, long&&) () from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#6 0x00007ffff67be225 in arrow::acero::TaskSchedulerImpl::ExecuteTask(unsigned long, int, long, bool*) ()
from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#7 0x00007ffff67d5814 in std::_Function_handler<arrow::Status (unsigned long), arrow::acero::TaskSchedulerImpl::ScheduleMore(unsigned long, int)::{lambda(unsigned long)#1}>::_M_invoke(std::_Any_data const&, unsigned long&&) ()
from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#8 0x00007ffff67baf47 in std::_Function_handler<arrow::Status (), arrow::acero::QueryContext::ScheduleTask(std::function<arrow::Status (unsigned long)>, std::basic_string_view<char, std::char_traits<char> >)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#9 0x00007ffff67f9260 in arrow::internal::FnOnce<void ()>::FnImpl<std::_Bind<arrow::detail::ContinueFuture (arrow::Future<arrow::internal::Empty>, std::function<arrow::Status ()>)> >::invoke() () from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#10 0x00007ffff44d9505 in arrow::internal::FnOnce<void ()>::operator()() && ()
from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow.so.1200
#11 0x00007ffff44d5c38 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::{lambda()#1}> > >::_M_run() () from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow.so.1200
#12 0x00007ffff543b4a0 in execute_native_thread_routine () from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow.so.1200
#13 0x00007ffff7850ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#14 0x00007ffff78e2660 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
Software versions: PyArrow 12.0.1, Python 3.11.6, Ubuntu 22.04.3.
If I change the pa.string() to a pa.large_string() then it works fine.
Component(s)
C++, Python
Metadata
Metadata
Assignees
Labels
Component: C++Component: PythonCritical FixBugfixes for security vulnerabilities, crashes, or invalid data.Bugfixes for security vulnerabilities, crashes, or invalid data.Type: bug