Skip to content

Change default segment loading to http#11760

Closed
Caroline1000 wants to merge 3 commits intoapache:masterfrom
Caroline1000:segment-loading20210929
Closed

Change default segment loading to http#11760
Caroline1000 wants to merge 3 commits intoapache:masterfrom
Caroline1000:segment-loading20210929

Conversation

@Caroline1000
Copy link
Copy Markdown
Contributor

@Caroline1000 Caroline1000 commented Sep 29, 2021

Description

We have observed more stability with http segment loading than curator segment loading in production clusters. For example, we have observed that problems with zookeeper can lead to the inability to query realtime data.

This PR has:

  • [X ] added documentation for new or modified features or behaviors.
  • [X ] been tested in a test Druid cluster.

Copy link
Copy Markdown
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR, @Caroline1000 !
Please add a description that explains the change and the reasons involved and update DruidCoordinatorConfigTest to verify the new default value.

@samarthjain
Copy link
Copy Markdown
Contributor

Is HTTP based loading ready for prime time? I am curious about any at-scale testing that has been done to verify HTTP based loading is performing as expected. Also, whether all major functional issues with it are fixed before we make it the default. I see at least one open bug right now.

@Caroline1000
Copy link
Copy Markdown
Contributor Author

@samarthjain +1 on fixing #11717. If I'm not mistaken, that issue was first observed when multiple load rules were changed across different tiers, so hopefully that makes the bug less likely to run into(?)

fwiw, I have seen http segment loading work without issue in many production environments (and actually have seen many problems related to curator loading)

@stale
Copy link
Copy Markdown

stale Bot commented Apr 16, 2022

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

@stale stale Bot added the stale label Apr 16, 2022
@didip
Copy link
Copy Markdown
Contributor

didip commented Jun 18, 2022

I am not sure if http is ready for prime time, the problem with http arises when jetty http server runs low on threads.

@stale
Copy link
Copy Markdown

stale Bot commented Jun 18, 2022

This issue is no longer marked as stale.

@stale stale Bot removed the stale label Jun 18, 2022
@imply-cheddar
Copy link
Copy Markdown
Contributor

ZK segment loading is broken right now. As of ~2 years ago, a PR was merged that breaks the order of segment loading and dropping via ZK, such that the assignment can enter into deadlocks when a cluster is mostly full. This wasn't widely an issue (personally, I only learned about it ~6 months ago) because the largest clusters (at least that I'm aware of) have all been using http segment assignment.

#11717 has been merged. While it is and was a bug, it was a corner case that we've only seen in development environments and never actually saw it in a production environment. Every cluster I touch, I move from ZK assignment to HTTP assignment because my experience is that HTTP assignment is more stable. I'm +1 on this directionally, but the PR does need the tests fixed as Kashif suggested before it can be approved.

@didip
Copy link
Copy Markdown
Contributor

didip commented Jun 23, 2022

Also, anyone else have this problem with http loading where Coordinator somehow cached the old Historical's IP addresses?

We saw this often in our Kubernetes deployments.

@kfaraz
Copy link
Copy Markdown
Contributor

kfaraz commented Jun 24, 2022

@didip , please create an issue for the http loading discrepancies if you have been facing them recently.

@capistrant
Copy link
Copy Markdown
Contributor

I am not sure if http is ready for prime time, the problem with http arises when jetty http server runs low on threads.

@didip

Isn't this addressed with a combination of:

  • using async IO on historical side to avoid holding threads while work is being done
  • the explicit recommendation of setting jetty thread count above aggregate count of connections from query servers (broker) to avoid queries consuming all jetty threads

Or are you saying there is a risk of exhaustion on the outgoing side from the coordinator?

@Caroline1000
Copy link
Copy Markdown
Contributor Author

closing now that #13092 has been merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants