-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Describe the bug
There might be a regression in v46
After updating to 46.0.0 there is a stack overflow calling PhysicalPlanNode::try_from_physical_plan on a relatively simple plan panics with thread 'testing::should_not_overflow_stack' has overflowed its stack
This is a strange one to reproduce
To Reproduce
Create new datafusion project (for some reason I can't reproduce it at datafusion main nor tag 46.0.0)
[package]
name = "bug_reproducer"
version = "0.1.0"
edition = "2024"
[dependencies]
tokio = { version = "1", features = ["full"] }
datafusion = { version = "46" }
datafusion-proto = { version = "46" }#[cfg(test)]
mod testing {
use datafusion::prelude::*;
use datafusion_proto::physical_plan::{AsExecutionPlan, DefaultPhysicalExtensionCodec};
use datafusion_proto::protobuf::PhysicalPlanNode;
#[tokio::test]
async fn should_not_overflow_stack() {
let ctx = SessionContext::new();
let test_data = crate::common::example_test_data();
ctx.register_parquet(
"pt",
&format!("{test_data}/alltypes_plain.parquet"),
Default::default(),
)
.await
.unwrap();
let plan = ctx
.sql("select id, string_col, timestamp_col from pt where id > 4 order by string_col")
.await
.unwrap()
.create_physical_plan()
.await
.unwrap();
// this call panics
//
// thread 'testing::should_not_overflow_stack' has overflowed its stack
// fatal runtime error: stack overflow
//
let node: PhysicalPlanNode =
PhysicalPlanNode::try_from_physical_plan(plan, &DefaultPhysicalExtensionCodec {})
.unwrap();
let plan = node
.try_into_physical_plan(&ctx, &ctx.runtime_env(), &DefaultPhysicalExtensionCodec {})
.unwrap();
let _ = plan.execute(0, ctx.task_ctx()).unwrap();
}
}run
cargo testfails with
running 1 test
thread 'testing::should_not_overflow_stack' has overflowed its stack
fatal runtime error: stack overflow
works ok with
export RUST_MIN_STACK=20971520
cargo testcargo test --releaselooking at the plan and quick debugging:
SortPreservingMergeExec: [string_col@1 ASC NULLS LAST]
SortExec: expr=[string_col@1 ASC NULLS LAST], preserve_partitioning=[true]
CoalesceBatchesExec: target_batch_size=8192
FilterExec: id@0 > 4
RepartitionExec: partitioning=RoundRobinBatch(14), input_partitions=1
DataSourceExec: file_groups={1 group: [[Users/marko/git/arrow-datafusion-fork/parquet-testing/data/alltypes_plain.parquet]]}, projection=[id, string_col, timestamp_col], file_type=parquet, predicate=id@0 > 4, pruning_predicate=id_null_count@1 != row_count@2 AND id_max@0 > 4, required_guarantees=[]
last valid frame before it panics is at
if let Some(exec) = plan.downcast_ref::<RepartitionExec>() {
let input = protobuf::PhysicalPlanNode::try_from_physical_plan(
exec.input().to_owned(),
extension_codec,
)?;
let pb_partitioning =
serialize_partitioning(exec.partitioning(), extension_codec)?;
Expected behavior
Not sure if much could be done, expecting round trip to be successful, without stack increase for such a simple plan.
Setting RUST_MIN_STACK is an option, but would hurt usability if basic (ballista) examples fail without it.
Additional context
- reproduced on MacBook Pro (M4), rust 1.85
- [ballista action run](https://github.com/milenkovicm/arrow-ballista/actions/runs/137362402980
I'll note again, I haven't been able to reproduce it directly on main nor 46.0.0 tag (which puzzles me even more)