Skip to content

Dart: Smoother handling of stage early-exit.#17228

Merged
cryptoe merged 1 commit intoapache:masterfrom
gianm:fix-dart-early-exit-stages
Oct 3, 2024
Merged

Dart: Smoother handling of stage early-exit.#17228
cryptoe merged 1 commit intoapache:masterfrom
gianm:fix-dart-early-exit-stages

Conversation

@gianm
Copy link
Copy Markdown
Contributor

@gianm gianm commented Oct 3, 2024

Stages can be instructed to exit before they finish, especially when a downstream stage includes a "LIMIT". This patch has improvements related to early-exiting stages.

Bug fix:

  • WorkerStageKernel: Don't allow fail() to set an exception if the stage is already in a terminal state (FINISHED or FAILED). If fail() is called while in a terminal state, log the exception, then throw it away. If it's a cancellation exception, don't even log it. This fixes a bug where a stage that exited early could transition to FINISHED and then to FAILED, causing the overall query to fail.

Performance:

  • DartWorkerManager previously sent stopWorker commands to workers even when "interrupt" was false. Now it only sends those commands when "interrupt" is true. The method javadoc already claimed this is what the method did, but the implementation did not match the javadoc. This reduces the number of RPCs by 1 per worker per query.

Quieter logging:

  • In ReadableByteChunksFrameChannel, skip logging exception from setError if the channel has been closed. Channels are closed when readers are done with them, so at that point, we wouldn't be interested in the errors.

  • In RunWorkOrder, skip calling notifyListener on failure of the main work, in the case when stop() has already been called. The stop() method will set its own error using CanceledFault. This enables callers to detect when a stage was canceled vs. failed for some other reason.

  • In WorkerStageKernel, skip logging cancellation errors in fail(). This is made possible by the previous change in RunWorkOrder.

Stages can be instructed to exit before they finish, especially when a
downstream stage includes a "LIMIT". This patch has improvements related
to early-exiting stages.

Bug fix:

- WorkerStageKernel: Don't allow fail() to set an exception if the stage is
  already in a terminal state (FINISHED or FAILED). If fail() is called while
  in a terminal state, log the exception, then throw it away. If it's a
  cancellation exception, don't even log it. This fixes a bug where a stage
  that exited early could transition to FINISHED and then to FAILED, causing
  the overall query to fail.

Performance:

- DartWorkerManager previously sent stopWorker commands to workers
  even when "interrupt" was false. Now it only sends those commands when
  "interrupt" is true. The method javadoc already claimed this is what the
  method did, but the implementation did not match the javadoc. This reduces
  the number of RPCs by 1 per worker per query.

Quieter logging:

- In ReadableByteChunksFrameChannel, skip logging exception from setError if
  the channel has been closed. Channels are closed when readers are done with
  them, so at that point, we wouldn't be interested in the errors.

- In RunWorkOrder, skip calling notifyListener on failure of the main work,
  in the case when stop() has already been called. The stop() method will
  set its own error using CanceledFault. This enables callers to detect
  when a stage was canceled vs. failed for some other reason.

- In WorkerStageKernel, skip logging cancellation errors in fail(). This is
  made possible by the previous change in RunWorkOrder.
@gianm gianm added the Bug label Oct 3, 2024
@github-actions github-actions Bot added Area - Batch Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Oct 3, 2024
@gianm gianm added this to the 31.0.0 milestone Oct 3, 2024
@cryptoe cryptoe merged commit db7cc46 into apache:master Oct 3, 2024
kfaraz pushed a commit to kfaraz/druid that referenced this pull request Oct 5, 2024
Stages can be instructed to exit before they finish, especially when a
downstream stage includes a "LIMIT". This patch has improvements related
to early-exiting stages.

Bug fix:

- WorkerStageKernel: Don't allow fail() to set an exception if the stage is
  already in a terminal state (FINISHED or FAILED). If fail() is called while
  in a terminal state, log the exception, then throw it away. If it's a
  cancellation exception, don't even log it. This fixes a bug where a stage
  that exited early could transition to FINISHED and then to FAILED, causing
  the overall query to fail.

Performance:

- DartWorkerManager previously sent stopWorker commands to workers
  even when "interrupt" was false. Now it only sends those commands when
  "interrupt" is true. The method javadoc already claimed this is what the
  method did, but the implementation did not match the javadoc. This reduces
  the number of RPCs by 1 per worker per query.

Quieter logging:

- In ReadableByteChunksFrameChannel, skip logging exception from setError if
  the channel has been closed. Channels are closed when readers are done with
  them, so at that point, we wouldn't be interested in the errors.

- In RunWorkOrder, skip calling notifyListener on failure of the main work,
  in the case when stop() has already been called. The stop() method will
  set its own error using CanceledFault. This enables callers to detect
  when a stage was canceled vs. failed for some other reason.

- In WorkerStageKernel, skip logging cancellation errors in fail(). This is
  made possible by the previous change in RunWorkOrder.
kfaraz added a commit that referenced this pull request Oct 5, 2024
…) (#17256)

* MSQ: Properly report errors that occur when starting up RunWorkOrder. (#17069)
* Dart: Smoother handling of stage early-exit. (#17228)
---------
Co-authored-by: Gian Merlino <gianmerlino@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area - Batch Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 Bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants