Skip to content

Change default handoffConditionTimeout to 15 minutes.#14539

Merged
gianm merged 4 commits intoapache:masterfrom
gianm:conf-default-hct
Jul 13, 2023
Merged

Change default handoffConditionTimeout to 15 minutes.#14539
gianm merged 4 commits intoapache:masterfrom
gianm:conf-default-hct

Conversation

@gianm
Copy link
Copy Markdown
Contributor

@gianm gianm commented Jul 6, 2023

Most of the time, when handoff is taking this long, it's because something is preventing Historicals from loading new data. In this case, we have two choices:

  1. Stop making progress on ingestion, wait for Historicals to load stuff,
    and keep the waiting-for-handoff segments available on realtime tasks.
    (handoffConditionTimeout = 0, the current default)

  2. Continue making progress on ingestion, by exiting the realtime tasks
    that were waiting for handoff. Once the Historicals get their act
    together, the segments will be loaded, as they are still there on
    deep storage. They will just not be continuously available.
    (handoffConditionTimeout > 0)

I believe most users would prefer [2], because [1] risks ingestion falling behind the stream, which causes many other problems. It can cause data loss if the stream ages-out data before we have a chance to ingest it.

Due to the way tuningConfigs are serialized -- defaults are baked into the serialized form that is written to the database -- this default change will not change anyone's existing supervisors. It will take effect for newly created supervisors.

Most of the time, when handoff is taking this long, it's because something
is preventing Historicals from loading new data. In this case, we have
two choices:

1) Stop making progress on ingestion, wait for Historicals to load stuff,
   and keep the waiting-for-handoff segments available on realtime tasks.
   (handoffConditionTimeout = 0, the current default)

2) Continue making progress on ingestion, by exiting the realtime tasks
   that were waiting for handoff. Once the Historicals get their act
   together, the segments will be loaded, as they are still there on
   deep storage. They will just not be continuously available.
   (handoffConditionTimeout > 0)

I believe most users would prefer [2], because [1] risks ingestion falling
behind the stream, which causes many other problems. It can cause data loss
if the stream ages-out data before we have a chance to ingest it.

Due to the way tuningConfigs are serialized -- defaults are baked into the
serialized form that is written to the database -- this default change will
not change anyone's existing supervisors. It will take effect for newly
created supervisors.
@abhishekagarwal87
Copy link
Copy Markdown
Contributor

Thinking a bit about this change and discussing it with @kfaraz, it probably is ok to make this change. Initially, I was worried about this leading to holes in the data and recovery. But since it's only the hand-off, those holes will be temporary since the segment has been created.

Comment thread docs/development/extensions-core/kafka-supervisor-reference.md Outdated
Comment thread docs/development/extensions-core/kinesis-ingestion.md Outdated
Copy link
Copy Markdown
Contributor

@ektravel ektravel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of minor nits.

gianm and others added 2 commits July 13, 2023 07:58
Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>
Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>
@gianm gianm merged commit 95ca430 into apache:master Jul 13, 2023
@gianm gianm deleted the conf-default-hct branch July 13, 2023 20:17
sergioferragut pushed a commit to sergioferragut/druid that referenced this pull request Jul 21, 2023
* Change default handoffConditionTimeout to 15 minutes.

Most of the time, when handoff is taking this long, it's because something
is preventing Historicals from loading new data. In this case, we have
two choices:

1) Stop making progress on ingestion, wait for Historicals to load stuff,
   and keep the waiting-for-handoff segments available on realtime tasks.
   (handoffConditionTimeout = 0, the current default)

2) Continue making progress on ingestion, by exiting the realtime tasks
   that were waiting for handoff. Once the Historicals get their act
   together, the segments will be loaded, as they are still there on
   deep storage. They will just not be continuously available.
   (handoffConditionTimeout > 0)

I believe most users would prefer [2], because [1] risks ingestion falling
behind the stream, which causes many other problems. It can cause data loss
if the stream ages-out data before we have a chance to ingest it.

Due to the way tuningConfigs are serialized -- defaults are baked into the
serialized form that is written to the database -- this default change will
not change anyone's existing supervisors. It will take effect for newly
created supervisors.

* Fix tests.

* Update docs/development/extensions-core/kafka-supervisor-reference.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/development/extensions-core/kinesis-ingestion.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

---------

Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>
@LakshSingla LakshSingla added this to the 28.0 milestone Oct 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants