Skip to content

stack overflow on PhysicalPlanNode::try_from_physical_plan #15087

@milenkovicm

Description

@milenkovicm

Describe the bug

There might be a regression in v46

After updating to 46.0.0 there is a stack overflow calling PhysicalPlanNode::try_from_physical_plan on a relatively simple plan panics with thread 'testing::should_not_overflow_stack' has overflowed its stack

This is a strange one to reproduce

To Reproduce

Create new datafusion project (for some reason I can't reproduce it at datafusion main nor tag 46.0.0)

[package]
name = "bug_reproducer"
version = "0.1.0"
edition = "2024"

[dependencies]
tokio = { version = "1", features = ["full"] }
datafusion = { version = "46" }
datafusion-proto = { version = "46" }
#[cfg(test)]
mod testing {
    use datafusion::prelude::*;
    use datafusion_proto::physical_plan::{AsExecutionPlan, DefaultPhysicalExtensionCodec};
    use datafusion_proto::protobuf::PhysicalPlanNode;

    #[tokio::test]
    async fn should_not_overflow_stack() {
        let ctx = SessionContext::new();

        let test_data = crate::common::example_test_data();

        ctx.register_parquet(
            "pt",
            &format!("{test_data}/alltypes_plain.parquet"),
            Default::default(),
        )
        .await
        .unwrap();

        let plan = ctx
            .sql("select id, string_col, timestamp_col from pt where id > 4 order by string_col")
            .await
            .unwrap()
            .create_physical_plan()
            .await
            .unwrap();
        // this call panics
        //
        // thread 'testing::should_not_overflow_stack' has overflowed its stack
        // fatal runtime error: stack overflow
        //
        let node: PhysicalPlanNode =
            PhysicalPlanNode::try_from_physical_plan(plan, &DefaultPhysicalExtensionCodec {})
                .unwrap();

        let plan = node
            .try_into_physical_plan(&ctx, &ctx.runtime_env(), &DefaultPhysicalExtensionCodec {})
            .unwrap();

        let _ = plan.execute(0, ctx.task_ctx()).unwrap();
    }
}

run

cargo test

fails with

running 1 test

thread 'testing::should_not_overflow_stack' has overflowed its stack
fatal runtime error: stack overflow

works ok with

export RUST_MIN_STACK=20971520
cargo test
cargo test --release

looking at the plan and quick debugging:

 SortPreservingMergeExec: [string_col@1 ASC NULLS LAST]
     SortExec: expr=[string_col@1 ASC NULLS LAST], preserve_partitioning=[true]
          CoalesceBatchesExec: target_batch_size=8192
               FilterExec: id@0 > 4
                   RepartitionExec: partitioning=RoundRobinBatch(14), input_partitions=1
                       DataSourceExec: file_groups={1 group: [[Users/marko/git/arrow-datafusion-fork/parquet-testing/data/alltypes_plain.parquet]]}, projection=[id, string_col, timestamp_col], file_type=parquet, predicate=id@0 > 4, pruning_predicate=id_null_count@1 != row_count@2 AND id_max@0 > 4, required_guarantees=[]

last valid frame before it panics is at

        if let Some(exec) = plan.downcast_ref::<RepartitionExec>() {
            let input = protobuf::PhysicalPlanNode::try_from_physical_plan(
                exec.input().to_owned(),
                extension_codec,
            )?;

            let pb_partitioning =
                serialize_partitioning(exec.partitioning(), extension_codec)?;

Expected behavior

Not sure if much could be done, expecting round trip to be successful, without stack increase for such a simple plan.

Setting RUST_MIN_STACK is an option, but would hurt usability if basic (ballista) examples fail without it.

Additional context

I'll note again, I haven't been able to reproduce it directly on main nor 46.0.0 tag (which puzzles me even more)

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions