Skip to content

[C++][Acero] Apparent deadlock in Table.join_asof #46224

@erikhansenwong

Description

@erikhansenwong

Describe the bug, including details regarding any error messages, version, and platform.

On some calls to Table.join_asof my python process becomes unresponsive and is using zero cpu. It appears to be a thread deadlock or something similar. I have created an example that causes the deadlock with high probability.

Here are the details of my setup:

  • Python 3.12.7
  • pyarrow==19.0.1
  • numpy==2.2.4
  • pandas==2.2.3
  • Ubuntu 22.04.5
  • CPU: 13th Gen Intel(R) Core(TM) i9-13980HX

I was also able to produce the deadlock on a colleague's Mac laptop with Apple silicon using this example, so I assume it won't make a big difference what hardware it runs on.

On my laptop this always gets deadlocked before the 300th iteration

import numpy as np
import pandas as pd
import pyarrow as pa

n_left = 100
n_right = 200_000
left_start = pd.Timestamp("2025-04-07T07:45:55", tz="UTC")
right_start = pd.Timestamp("2025-04-07T00:00:00", tz="UTC")
time_end = pd.Timestamp("2025-04-07T12:05:59", tz="UTC")

tolerance_nanos = 60 * 1_000_000_000
np.random.seed(0)


def get_timestamps(start, end, n):
    seconds = (end - start).total_seconds()
    td = np.random.uniform(0, 1, n)
    td *= np.random.choice([0, 1], n)
    td *= seconds / td.sum()
    td = td.cumsum()
    return start + pd.to_timedelta(td, "seconds")


left_schema = pa.schema([pa.field("timestamp", pa.timestamp("ns", "UTC"))])
right_schema = pa.schema(
    [
        pa.field("timestamp", pa.timestamp("ns", "UTC")),
        pa.field("value", pa.float64()),
    ]
)

left = pa.table(
    {"timestamp": get_timestamps(left_start, time_end, n_left)},
    schema=left_schema,
)
right = pa.table(
    {
        "timestamp": get_timestamps(right_start, time_end, n_right),
        "value": np.random.normal(100, 5, n_right),
    },
    schema=right_schema,
)

for i in range(1000):
    print(f"{i:>5} | {pd.Timestamp.now()}")
    left.join_asof(
        right,
        on="timestamp",
        by=[],
        tolerance=tolerance_nanos,
    )

Component(s)

Python

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions