Skip to content

Add QOS filtering of overlord requests#18033

Merged
gianm merged 13 commits intoapache:masterfrom
cryptoe:overlord_heath_checks
Nov 19, 2025
Merged

Add QOS filtering of overlord requests#18033
gianm merged 13 commits intoapache:masterfrom
cryptoe:overlord_heath_checks

Conversation

@cryptoe
Copy link
Copy Markdown
Contributor

@cryptoe cryptoe commented May 26, 2025

Adds QOS filtering for the overlord so that health check threads are not blocked.

@cryptoe cryptoe requested a review from kfaraz May 26, 2025 07:10
@cryptoe cryptoe force-pushed the overlord_heath_checks branch from 107aadb to 45deabe Compare May 26, 2025 07:13
Comment thread services/src/main/java/org/apache/druid/cli/CliOverlord.java Fixed
Comment thread services/src/test/java/org/apache/druid/cli/CliOverlordTest.java Fixed
Copy link
Copy Markdown
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the improvement, @cryptoe .
I have left some suggestions.

public class ServerConfig
{
public static final int DEFAULT_GZIP_INFLATE_BUFFER_SIZE = 4096;
public static final int DEFAULT_NUM_PACKING_THREADS = 30;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "packing" signify here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its the extra threads we add . Couldn't really find a good word.


threadPool.setDaemon(true);
jettyServerThreadPool = threadPool;
jettyServerThreadPool.setDaemon(true);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Original code was more appropriate where jettyServerThreadPool was assigned only after the value threadPool was fully baked.

Otherwise, the doMonitor method might emit erratic metrics.

The ideal fix here would to be to make jettyServerThreadPool non-static but that is not needed in this PR and it would require other clean up too.

Copy link
Copy Markdown
Contributor Author

@cryptoe cryptoe May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the jettyServerPool to get initialized in one go with the relevant threads.

The daemon stuff should not effect the monitor emitting metrics.
Yeah there is a race here. Agreed we can fix it as part of another PR.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the changes to this file really needed in this PR?
I am not sure how it affects the main QoS change.

Comment on lines +511 to +512
final QueuedThreadPool queuedThreadPool = (QueuedThreadPool) server.getThreadPool();
final int maxThreads = queuedThreadPool.getMaxThreads();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be nice to have the logic to compute the max threads commoned out somewhere that can be used here. It seems weird to need a handle to the actual ThreadPool object just to determine the max threads.

}
}

protected static boolean addQOSFiltering(ServletContextHandler root, int threadsForOvelordWork)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a short javadoc.

Suggested change
protected static boolean addQOSFiltering(ServletContextHandler root, int threadsForOvelordWork)
protected static boolean addQosFiltering(ServletContextHandler root, int threadsForOvelordWork)

Comment on lines +471 to +472
* As QOS filtering is enabled on overlord requests, we need to update the QOS filter paths in
* {@link org.apache.druid.cli.CliOverlord#addQOSFiltering(ServletContextHandler, int)} when a new jersey resource is added.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* As QOS filtering is enabled on overlord requests, we need to update the QOS filter paths in
* {@link org.apache.druid.cli.CliOverlord#addQOSFiltering(ServletContextHandler, int)} when a new jersey resource is added.
* Since QoS filtering is enabled for Overlord requests, update the QoS filter paths in
* {@link #addQosFiltering} whenever a new jersey resource is added here.

},
threadsForOvelordWork
);
JettyServerInitUtils.addFilters(root, Collections.singleton(filterHolder));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If already bound using JettyBindings.addQosFilter in addOverlordJerseyResources, we wouldn't need to call this at all since JettyServerInitUtils.addQosFilters is already being called from OverlordJettyServerInitializer.initialize.

Comment on lines +578 to +584
JettyBindings.QosFilterHolder filterHolder = new JettyBindings.QosFilterHolder(
new String[]{
"/druid-internal/v1/*",
"/druid/indexer/v1/*"
},
threadsForOvelordWork
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other modules seem to use JettyBindings.addQosFilter for this purpose. Please see comment on addOverlordJerseyResources method.

Comment on lines +589 to +591
"QOS filter is disabled for the overlord requests." +
"Set `druid.server.http.numThread` to a value greater than %d to enable QoSFilter.",
ServerConfig.DEFAULT_NUM_PACKING_THREADS
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"QOS filter is disabled for the overlord requests." +
"Set `druid.server.http.numThread` to a value greater than %d to enable QoSFilter.",
ServerConfig.DEFAULT_NUM_PACKING_THREADS
"QoS filtering is disabled for Overlord requests. " +
"Set `druid.server.http.numThreads` to a value greater than [%d] to enable.",
ServerConfig.DEFAULT_NUM_PACKING_THREADS

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch.

protected static boolean addQOSFiltering(ServletContextHandler root, int threadsForOvelordWork)
{
if (threadsForOvelordWork >= ServerConfig.DEFAULT_NUM_PACKING_THREADS) {
log.info("Enabling QOS filter on overlord requests with limit [%d].", threadsForOvelordWork);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log.info("Enabling QOS filter on overlord requests with limit [%d].", threadsForOvelordWork);
log.info("Enabling QoS filtering for Overlord requests with limit[%d].", threadsForOvelordWork);

* {@link org.apache.druid.cli.CliOverlord#addQOSFiltering(ServletContextHandler, int)} when a new jersey resource is added.
*/
private static class OverlordJettyServerInitializer implements JettyServerInitializer
private static void addOverlordJerseyResources(Binder binder)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going by what other modules such as LookupModule are doing, we should bind the qos filter in this method itself using JettyBindings.addQosFilter.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we donot have the exact thread pool handy, at the time of binding we can probably not do this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay Found a way to do this. Adjusting.

@github-actions
Copy link
Copy Markdown

This pull request has been marked as stale due to 60 days of inactivity.
It will be closed in 4 weeks if no further activity occurs. If you think
that's incorrect or this pull request should instead be reviewed, please simply
write any comment. Even if closed, you can still revive the PR at any time or
discuss it on the dev@druid.apache.org list.
Thank you for your contributions.

@github-actions github-actions Bot added the stale label Jul 27, 2025
@cryptoe
Copy link
Copy Markdown
Contributor Author

cryptoe commented Jul 27, 2025

Will get to this soon.

@github-actions github-actions Bot removed the stale label Jul 28, 2025
@github-actions
Copy link
Copy Markdown

This pull request has been marked as stale due to 60 days of inactivity.
It will be closed in 4 weeks if no further activity occurs. If you think
that's incorrect or this pull request should instead be reviewed, please simply
write any comment. Even if closed, you can still revive the PR at any time or
discuss it on the dev@druid.apache.org list.
Thank you for your contributions.

@github-actions github-actions Bot added the stale label Sep 27, 2025
@github-actions
Copy link
Copy Markdown

This pull request/issue has been closed due to lack of activity. If you think that
is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions Bot closed this Oct 25, 2025
@cryptoe
Copy link
Copy Markdown
Contributor Author

cryptoe commented Nov 17, 2025

reopen

@cryptoe cryptoe reopened this Nov 17, 2025
Comment thread services/src/main/java/org/apache/druid/cli/CliOverlord.java Fixed
@github-actions github-actions Bot removed the stale label Nov 18, 2025

final int threadsForOverlordWork = serverHttpNumThreads - THREADS_RESERVED_FOR_HEALTH_CHECK;

if (threadsForOverlordWork >= ServerConfig.DEFAULT_MIN_QOS_THRESHOLD) {
Copy link
Copy Markdown
Contributor

@gianm gianm Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this based on ServerConfig.DEFAULT_MIN_QOS_THRESHOLD (30)? If someone sets druid.server.http.numThreads = 25 then they won't have QoS; I don't understand why that is good.

Secondly, IMO it would be better to apply the QoS to the action API only. That way, if the OL is bogged down with handling actions, it can still respond to the APIs that the web console uses. Applying QoS to all OL APIs, like this PR currently does, would make it able to respond to health checks but not actually function in any other way. It seems to defeat the purpose of a health check.

So, how about having a config like druid.indexer.server.maxConcurrentActions where the default is based on serverHttpNumThreads? Perhaps max(1, serverHttpNumThreads - 4) or max(1, serverHttpNumThreads * 0.8). And allow admins to set it to druid.indexer.server.maxConcurrentActions = 0 which would disable the QoS.

Copy link
Copy Markdown
Contributor Author

@cryptoe cryptoe Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was optimizing the patch for larger clusters, because that was the use case I had at that time, but thinking more you are correct it should happen across the board.

Added a new config:
"druid.indexer.server.maxConcurrentActions"

Went with max(1, max(serverHttpNumThreads - 4, serverHttpNumThreads * 0.8)) so that large clusters are not impacted much.



final int serverHttpNumThreads = properties.containsKey(CliIndexerServerModule.SERVER_HTTP_NUM_THREADS_PROPERTY)
? Integer.parseInt(properties.getProperty(CliIndexerServerModule.SERVER_HTTP_NUM_THREADS_PROPERTY))

Check notice

Code scanning / CodeQL

Missing catch of NumberFormatException Note

Potential uncaught 'java.lang.NumberFormatException'.

final int maxConcurrentActions;
if (properties.containsKey("druid.indexer.server.maxConcurrentActions")) {
maxConcurrentActions = Integer.parseInt(properties.getProperty("druid.indexer.server.maxConcurrentActions"));

Check notice

Code scanning / CodeQL

Missing catch of NumberFormatException Note

Potential uncaught 'java.lang.NumberFormatException'.
@gianm gianm merged commit 713be4a into apache:master Nov 19, 2025
57 checks passed
@kgyrtkirk kgyrtkirk added this to the 36.0.0 milestone Jan 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants