Skip to content

Conversation

@pitrou
Copy link
Member

@pitrou pitrou commented Feb 18, 2021

In this model, a StopSource is instantiated by the consumer, which passes a corresponding StopToken to producer API(s).

@github-actions
Copy link

@pitrou pitrou marked this pull request as draft February 18, 2021 19:27
@pitrou pitrou force-pushed the ARROW-8732-cancel-v2 branch 3 times, most recently from 27885ff to 5a05f3f Compare February 23, 2021 18:52
@pitrou pitrou force-pushed the ARROW-8732-cancel-v2 branch 3 times, most recently from 7e9c191 to 5c81fd2 Compare March 1, 2021 19:23
@pitrou pitrou force-pushed the ARROW-8732-cancel-v2 branch 8 times, most recently from a7d3d2f to e0d0092 Compare March 3, 2021 19:03
@pitrou pitrou marked this pull request as ready for review March 3, 2021 19:05
@pitrou pitrou changed the title ARROW-8732: [C++] Add basic cancellation API (v2) ARROW-8732: [C++] Add basic cancellation API Mar 3, 2021
@pitrou
Copy link
Member Author

pitrou commented Mar 3, 2021

@ursabot crossbow submit -g python

@pitrou
Copy link
Member Author

pitrou commented Mar 3, 2021

@github-actions crossbow submit -g python

@pitrou pitrou force-pushed the ARROW-8732-cancel-v2 branch from e0d0092 to 848e020 Compare March 3, 2021 19:39
@github-actions
Copy link

github-actions bot commented Mar 3, 2021

Revision: 848e020cfaf234411d3a0167b58dc39c030823a4

Submitted crossbow builds: ursacomputing/crossbow @ actions-174

Task Status
test-conda-python-3.6 Github Actions
test-conda-python-3.6-pandas-0.23 Github Actions
test-conda-python-3.7 Github Actions
test-conda-python-3.7-dask-latest Github Actions
test-conda-python-3.7-hdfs-3.2 Github Actions
test-conda-python-3.7-kartothek-latest Github Actions
test-conda-python-3.7-kartothek-master Github Actions
test-conda-python-3.7-pandas-latest Github Actions
test-conda-python-3.7-pandas-master Github Actions
test-conda-python-3.7-spark-branch-3.0 Github Actions
test-conda-python-3.7-turbodbc-latest Github Actions
test-conda-python-3.7-turbodbc-master Github Actions
test-conda-python-3.8 Github Actions
test-conda-python-3.8-dask-master Github Actions
test-conda-python-3.8-hypothesis Github Actions
test-conda-python-3.8-jpype Github Actions
test-conda-python-3.8-pandas-latest Github Actions
test-conda-python-3.8-pandas-nightly Github Actions
test-conda-python-3.8-spark-master Github Actions
test-debian-10-python-3 Azure
test-fedora-33-python-3 Azure
test-ubuntu-18.04-python-3 Azure

@pitrou
Copy link
Member Author

pitrou commented Mar 4, 2021

@westonpace @bkietz Feel free to review.

Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great!

A few comments:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
}
} else {
DCHECK_EQ(impl_->requested_.load(), -1);
}

Copy link
Member Author

@pitrou pitrou Mar 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you call Poll() twice, then the cancel_error corresponding to the signal will have been cached. So you can have both a cancel_error and a positive signal number.

Comment on lines +846 to +845
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems we might prefer to rewrite this constructor to take an IOContext

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We must support the legacy TableReader::Make taking both a MemoryPool* and an IOContext.

@pitrou pitrou force-pushed the ARROW-8732-cancel-v2 branch 2 times, most recently from a92611f to b82c13f Compare March 4, 2021 18:03
@pitrou pitrou requested a review from bkietz March 4, 2021 18:48
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: can use self_->stop_token_ instead.

Copy link
Member Author

@pitrou pitrou Mar 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: simply re-lock outside of scope? (to let task be destroyed naturally)

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I learned a lot about signals just reading this 😄

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For an iterator, no. In the future, if this becomes a generator, possibly. The only way it could be useful is if the I/O could be cancelled somehow. Generally that's not possible. With some non-blocking I/O schemes you can at least give up on the user-land side of the I/O. For networked I/O it may be possible but I seem to recall you saying S3 had no such mechanism.

You could maybe make use of the stop token in the readahead to stop the readahead but that doesn't seem too urgent. It will fill up the readahead queue, and then, there will be no active references, and it will all be cleaned up. Although it may be nice to add a unit test for that scenario. I wonder if I could get a consumer side reference count of some kind and abort it as soon as all consumer references are lost 🤔 . I'll add a JIRA for it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, the AWS SDK doesn't support cancellation.

Agreed about the readahead part. This can be done later. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. I was just thinking of writing something like this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can get rid of this mutex. If you changed this to something like...

if (!impl_->requested_.fetch_or(-1)) {
    impl_->signum = signum;
}

signum would be a new (non-atomic) int member that replaces the mutex. Then RequestStop becomes...

if(!impl->requested_.fetch_or(-1)) {
  impl->cancel_error_ = std::move(st);
}

The main advantage is there would be no way to lose your thread when calling Poll.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we need the mutex to protect cancel_error_ (which we cannot store atomically).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If someone calls ResetSignalStopSource while you are handling the signal wouldn't this become the sole reference and, on exit, attempt to destroy it (which would not be async-signal-safe)?

Admittedly, probably not a common occurrence.

Copy link
Member Author

@pitrou pitrou Mar 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ouch, I think you're right. I'll try to find a way to do this differently...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a JIRA to expose StopSource to python at some point? There are reasons other than signals someone may want to cancel an operation. For example, a GUI-based application may have a cancel button. A web server may want to cancel if the TCP connection for some analysis request is lost.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No JIRA yet. We can open one, or do it when needed.

In this model, a StopToken is instantiated by the consumer and passed to producer APIs.
@pitrou pitrou force-pushed the ARROW-8732-cancel-v2 branch from b82c13f to db498a8 Compare March 8, 2021 17:22
@pitrou
Copy link
Member Author

pitrou commented Mar 8, 2021

I pushed an improvement for the deallocation in signal handler problem, but it's still not 100% safe (though probably extremely unlikely). I think solving it entirely would need a lock-free doubly-linked list.

@pitrou
Copy link
Member Author

pitrou commented Mar 8, 2021

@github-actions crossbow submit -g python

@westonpace
Copy link
Member

Abseil (seems to be Google's equivalent of Folly) has an async-signal-safe spin lock. That might be easier than a doubly linked list. However, it doesn't seem to be easily vendored (just based on the fact that it has 10 abseil includes) :(

https://github.com/abseil/abseil-cpp/blob/master/absl/base/internal/spinlock.h

I'd agree that it seems too unlikely to worry about.

@pitrou
Copy link
Member Author

pitrou commented Mar 8, 2021

How would a spinlock help the deallocation issue?

@westonpace
Copy link
Member

The handler would obtain the lock before calling atomic_load on the pointer and reset the pointer before it releases the lock. Any call to ResetSignalStopSource would grab the lock before modifying the pointer. This should ensure that the pointer obtained by the load cannot go out of scope while the signal handler references it and thus would never have to free.

@pitrou
Copy link
Member Author

pitrou commented Mar 8, 2021

Hmm, but resetting the pointer may still deallocate the underlying object... which is certainly not async-signal-safe.

@westonpace
Copy link
Member

It could only deallocate it if the global reference was changed and all attempts to change the global reference should go through the same spinlock. The signal handler should always be given the "second reference" no matter what.

Or, since you're using spinlocks to guard all changes to the global reference, you could just change the global reference to a plain old pointer which might make it more obvious.

@pitrou
Copy link
Member Author

pitrou commented Mar 8, 2021

Ah, I see: no need to take a strong reference inside the handler since the spinlock guarantees the global reference isn't mutated. That said, the signal-safe spinlock apparently requires one to block signals.

@westonpace
Copy link
Member

Yep. I think what you have, with the comment, is fine.

@pitrou pitrou closed this in 79ae4f6 Mar 9, 2021
@pitrou pitrou deleted the ARROW-8732-cancel-v2 branch March 9, 2021 16:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants