Skip to content

[C++] Crashed at TempStack alloc when use Hashing32::HashBatch independently #40431

@ZhangHuiGui

Description

@ZhangHuiGui

Describe the bug, including details regarding any error messages, version, and platform.

The issue is similar to #40007, but they are different.
I want to use the Hashing32::HashBatch api for produce a hash-array for a batch. Although the Hashing32 and Hashing64 are used in join based codes, but they can be used independently.

Like below codes:

  auto arr = arrow::ArrayFromJSON(arrow::int32(), "[9,2,6]");
  const int batch_len = arr->length();
  arrow::compute::ExecBatch exec_batch({arr}, batch_len);
  auto ctx = arrow::compute::default_exec_context();
  arrow::util::TempVectorStack stack;
  ASSERT_OK(stack.Init(ctx->memory_pool(), batch_len * sizeof(uint32_t))); // I just alloc the stack size as i needed.

  std::vector<uint32_t> hashes(batch_len);
  std::vector<arrow::compute::KeyColumnArray> temp_column_arrays;
  ASSERT_OK(arrow::compute::Hashing32::HashBatch(
      exec_batch, hashes.data(), temp_column_arrays,
      ctx->cpu_info()->hardware_flags(), &stack, 0, batch_len));

The crash stack in HashBatch is:

arrow::compute::Hashing32::HashBatch
  arrow::compute::Hashing32::HashMultiColumn
      arrow::util::TempVectorHolder<unsigned int>::TempVectorHolder
        arrow::util::TempVectorStack::alloc
          ARROW_DCHECK(top_ <= buffer_size_); // top_=4176, buffer_size_=160

The reason is blow codes:

constexpr uint32_t max_batch_size = util::MiniBatch::kMiniBatchLength;
auto hash_temp_buf = util::TempVectorHolder<uint32_t>(ctx->stack, max_batch_size);

The holder use the max_batch_size which is 1024 as it's num_elements, it's far more than the temp stack's init buffer_size.

I know that the HashBatch is only used in hash-join or related codes. For join, they have already done line clipping at the upper level, ensuring that each input batch size is less_equal to kMiniBatchLength and the stack size is bigger enough.

But it can be used independently. So maybe we could use the num_rows rather than util::MiniBatch::kMiniBatchLength in HashBatch related apis?

Component(s)

C++

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions