Skip to content

fix: ensure I/O cancels correctly when scan is dropped#5129

Merged
westonpace merged 2 commits intolance-format:mainfrom
westonpace:fix/avoid-io-deadlock-dropped-scan
Nov 4, 2025
Merged

fix: ensure I/O cancels correctly when scan is dropped#5129
westonpace merged 2 commits intolance-format:mainfrom
westonpace:fix/avoid-io-deadlock-dropped-scan

Conversation

@westonpace
Copy link
Copy Markdown
Member

@westonpace westonpace commented Nov 3, 2025

Previously the scheduler took a "poison the well approach". When the scheduler was dropped, it would poison any remaining tasks so they returned an error.

Unfortunately, this approach is not easily plugged into Datafusion, which does not have any kind of asynchronous cancellation of the stream.

Instead, Datafusion encourages a "abort all tasks" approach when the stream is cancelled. This PR migrates things to the "abort all tasks" approach. It also uses SpawnedTask::spawn in filtered_read instead of tokio::task::spawn to create abort-on-drop fire and forget tasks.

In addition, this PR connects the io_buffer_size property to filtered read. This scanner property was previously being ignored. I needed it for a unit test.

@github-actions github-actions Bot added the bug Something isn't working label Nov 3, 2025
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +8418 to +8429
let start = Instant::now();
while start.elapsed() < Duration::from_secs(10) {
if runtime.handle().metrics().num_alive_tasks() == 0 {
break;
}
std::thread::sleep(Duration::from_millis(100));
}

assert!(
runtime.handle().metrics().num_alive_tasks() == 0,
"Tasks should have finished within 10 seconds but there are still {} tasks running",
runtime.handle().metrics().num_alive_tasks()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 Badge Unstable tokio metrics API breaks build

The new test uses runtime.handle().metrics().num_alive_tasks() to wait for tasks to finish. Handle::metrics is gated behind the tokio_unstable configuration flag, and the workspace Cargo.toml only enables rt-multi-thread, macros, fs, and sync for tokio. Because tokio_unstable is not enabled, this code will not compile, causing the entire test suite to fail to build. The metrics-based polling should either be conditional on tokio_unstable or replaced with a stable mechanism for detecting task completion.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some metrics are unstable. The ones in use here are stable.

Copy link
Copy Markdown
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to see this!

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 81.84%. Comparing base (6d43982) to head (779cb2b).
⚠️ Report is 16 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5129      +/-   ##
==========================================
+ Coverage   81.73%   81.84%   +0.11%     
==========================================
  Files         340      341       +1     
  Lines      138875   140539    +1664     
  Branches   138875   140539    +1664     
==========================================
+ Hits       113503   115028    +1525     
- Misses      21631    21706      +75     
- Partials     3741     3805      +64     
Flag Coverage Δ
unittests 81.84% <100.00%> (+0.11%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@westonpace westonpace merged commit 2341378 into lance-format:main Nov 4, 2025
26 of 27 checks passed
jackye1995 pushed a commit to jackye1995/lance that referenced this pull request Jan 21, 2026
…5129)

Previously the scheduler took a "poison the well approach". When the
scheduler was dropped, it would poison any remaining tasks so they
returned an error.

Unfortunately, this approach is not easily plugged into Datafusion,
which does not have any kind of asynchronous cancellation of the stream.

Instead, Datafusion encourages a "abort all tasks" approach when the
stream is cancelled. This PR migrates things to the "abort all tasks"
approach. It also uses SpawnedTask::spawn in filtered_read instead of
tokio::task::spawn to create abort-on-drop fire and forget tasks.

In addition, this PR connects the `io_buffer_size` property to filtered
read. This scanner property was previously being ignored. I needed it
for a unit test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants