Skip to content

Conversation

@dlmarion
Copy link
Contributor

@dlmarion dlmarion commented Dec 3, 2024

Added two new properties that control the checking of the Fate transaction queue size in the background thread that resizes the Fate thread pool. This new logic will print a warning in the log that the Fate thread pool size property should be increased if the number of transactions in the queue is consistently larger than some multiple of the Fate transaction thread pool size.

Added two new properties that control the checking of the Fate
transaction queue size in the background thread that resizes the
Fate thread pool. This new logic will print a warning in the log
that the Fate thread pool size property should be increased if
the number of transactions in the queue is consistently larger
than some multiple of the Fate transaction thread pool size.
@dlmarion dlmarion added this to the 4.0.0 milestone Dec 3, 2024
@dlmarion dlmarion self-assigned this Dec 3, 2024
@dlmarion
Copy link
Contributor Author

dlmarion commented Dec 3, 2024

This could easily be introduced in 2.1.4 I think, but it's not a bugfix. It could also be easily introduced in 3.1, but I'm not sure it's needed there as the major Fate changes are in 4.0. This PR is a compromise of my suggestion here and @keith-turner's response.

@dlmarion dlmarion requested a review from keith-turner December 6, 2024 14:10
}
}
}
queueSizeHistory.add(workQueue.size());
Copy link
Contributor

@keith-turner keith-turner Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at how things work this queue size may always be zero (or something low, not sure what it will actually be). When this code places something on the queue it calls tryTransfer which should try to send something directly to a waiting thread and avoid queuing.

The threads that run fate ops are in an infinite loop and just poll the transfer queue. We could try to count the number of threads active in this block of code and use that to detect how busy fate is for this check. Could maybe use an atomic integer and leverage the existing try/finally in the code for the counting.

This reminds of another problem I ran into related to this code. A bit ago I was trying to use the fate thread pool metrics to understand what was going on with fate and was not getting anything useful out of those metrics. Looking into it determined that because the way this code works and puts task in an infinite loop on the pool that was the reason the metrics were not useful. If we added a counter of active fate threads for this PR it may be useful in follow on PR to turn off merics for the fate thread pool and add a gauge that exposes this counter. That metrics would be useful, could see how many fate threads are busy at any given time. Can not get that info from the current metrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot that we changed the Fate implementation to used a LinkedTransferQueue and that the WorkFinder threads only put things on the queue when there is an available consumer thread. I could invert this check, instead of checking that the queue size is larger than the number of threads, I could use getWaitingConsumerCount. If the value is always zero, then log the warning to increase the threads.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could use getWaitingConsumerCount. If the value is always zero, then log the warning to increase the threads.

That sounds good. That is neat, that gives us the idle fate thread count w/o doing any work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good.

Updated in d1fdf91.

that gives us the idle fate thread count w/o doing any work.

yeah, but it's just a point in time approximation. If the property is 60m, then if it's zero for 120 intervals, then log the warning. I feel like that might be a good solution.

@dlmarion dlmarion requested a review from keith-turner December 6, 2024 17:06
private final AtomicBoolean keepRunning = new AtomicBoolean(true);
private final TransferQueue<FateId> workQueue;
private final Thread workFinder;
private final ConcurrentLinkedQueue<Integer> queueSizeHistory = new ConcurrentLinkedQueue<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private final ConcurrentLinkedQueue<Integer> queueSizeHistory = new ConcurrentLinkedQueue<>();
private final ConcurrentLinkedQueue<Integer> idleCountHistory = new ConcurrentLinkedQueue<>();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented in cf88581

boolean needMoreThreads = true;
for (Integer i : queueSizeHistory) {
if (i > 0) {
needMoreThreads = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a system that is consistently busy if sampling alot will probably catch this as non zero a few times, just catching threads that are just about to get work. Could count the non-zero and make needMoreThreads based on seeing 95% of sample events being zero.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented in cf88581

@dlmarion dlmarion merged commit e69d8b3 into apache:main Dec 6, 2024
8 checks passed
@dlmarion dlmarion deleted the fate-queue-check-warning branch December 6, 2024 21:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants