-
Notifications
You must be signed in to change notification settings - Fork 478
Enable check of Fate queue size for recommending property change #5129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Added two new properties that control the checking of the Fate transaction queue size in the background thread that resizes the Fate thread pool. This new logic will print a warning in the log that the Fate thread pool size property should be increased if the number of transactions in the queue is consistently larger than some multiple of the Fate transaction thread pool size.
|
This could easily be introduced in 2.1.4 I think, but it's not a bugfix. It could also be easily introduced in 3.1, but I'm not sure it's needed there as the major Fate changes are in 4.0. This PR is a compromise of my suggestion here and @keith-turner's response. |
| } | ||
| } | ||
| } | ||
| queueSizeHistory.add(workQueue.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at how things work this queue size may always be zero (or something low, not sure what it will actually be). When this code places something on the queue it calls tryTransfer which should try to send something directly to a waiting thread and avoid queuing.
The threads that run fate ops are in an infinite loop and just poll the transfer queue. We could try to count the number of threads active in this block of code and use that to detect how busy fate is for this check. Could maybe use an atomic integer and leverage the existing try/finally in the code for the counting.
This reminds of another problem I ran into related to this code. A bit ago I was trying to use the fate thread pool metrics to understand what was going on with fate and was not getting anything useful out of those metrics. Looking into it determined that because the way this code works and puts task in an infinite loop on the pool that was the reason the metrics were not useful. If we added a counter of active fate threads for this PR it may be useful in follow on PR to turn off merics for the fate thread pool and add a gauge that exposes this counter. That metrics would be useful, could see how many fate threads are busy at any given time. Can not get that info from the current metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot that we changed the Fate implementation to used a LinkedTransferQueue and that the WorkFinder threads only put things on the queue when there is an available consumer thread. I could invert this check, instead of checking that the queue size is larger than the number of threads, I could use getWaitingConsumerCount. If the value is always zero, then log the warning to increase the threads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could use getWaitingConsumerCount. If the value is always zero, then log the warning to increase the threads.
That sounds good. That is neat, that gives us the idle fate thread count w/o doing any work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good.
Updated in d1fdf91.
that gives us the idle fate thread count w/o doing any work.
yeah, but it's just a point in time approximation. If the property is 60m, then if it's zero for 120 intervals, then log the warning. I feel like that might be a good solution.
| private final AtomicBoolean keepRunning = new AtomicBoolean(true); | ||
| private final TransferQueue<FateId> workQueue; | ||
| private final Thread workFinder; | ||
| private final ConcurrentLinkedQueue<Integer> queueSizeHistory = new ConcurrentLinkedQueue<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| private final ConcurrentLinkedQueue<Integer> queueSizeHistory = new ConcurrentLinkedQueue<>(); | |
| private final ConcurrentLinkedQueue<Integer> idleCountHistory = new ConcurrentLinkedQueue<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implemented in cf88581
| boolean needMoreThreads = true; | ||
| for (Integer i : queueSizeHistory) { | ||
| if (i > 0) { | ||
| needMoreThreads = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a system that is consistently busy if sampling alot will probably catch this as non zero a few times, just catching threads that are just about to get work. Could count the non-zero and make needMoreThreads based on seeing 95% of sample events being zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implemented in cf88581
Co-authored-by: Keith Turner <kturner@apache.org>
Added two new properties that control the checking of the Fate transaction queue size in the background thread that resizes the Fate thread pool. This new logic will print a warning in the log that the Fate thread pool size property should be increased if the number of transactions in the queue is consistently larger than some multiple of the Fate transaction thread pool size.