Remove stray reference to fix OOM while merging sketches#13475
Remove stray reference to fix OOM while merging sketches#13475cryptoe merged 10 commits intoapache:masterfrom
Conversation
rohangarg
left a comment
There was a problem hiding this comment.
Is this fix UT testable? Or if there already exists a UT for this, can you please point me to it? Further, please run the functional test on this change which led to the discovery of this bug.
| })); | ||
| }); | ||
|
|
||
| partitionFuture.whenComplete((result, exception) -> { |
There was a problem hiding this comment.
in a corner case, it could happen that partitionFuture finishes before all futures are queued up in the executor. To avoid that one simple solution could be to have a CountDownLatch with value as workerCount. The whenComplete would then wait for latch to become 0 before cancelling all the futures.
Maybe we could also have a better solution with cancellation - will update on this thread if I find any utility.
There was a problem hiding this comment.
Is this a failsafe condition or would this be required? From my understanding, the successful completion of partitionFuture depends on the completion of the underlying futures, and in case of unsuccessful completion, we shouldn't have any issue with prematurely cancelling any underlying future. I might have missed something though.
There was a problem hiding this comment.
I also have a question in addition to Laksh's point, since we are attaching this callback in the same thread after queueing up the futures, would this trigger without all the futures present anyway?
There was a problem hiding this comment.
Yes, agree that since attaching this callback is in the same thread after queuing - this shouldn't trigger without futures. Although, that raises the question that if the future finishes before this callback is attached, will this callback execute while attaching it? if so, will that callback execution happen on the same thread attaching it?
There was a problem hiding this comment.
Also, is it necessary to cancel the pending/running futures? It would be incase we'll be retrying the stats collection stage - otherwise if we're going to fail the whole job, it is not necessary to fail them
There was a problem hiding this comment.
Yes, the future that is added as a callback should still get executed if the completable future is already finished executing, though the javadoc doesn't guarantee which thread would be executing it if an executor is also not passed as an argument.
It is also not a hard necessity that we cancel every pending future but it depends on number of workers which could be up to a 1000, which would waste a lot of time, but correctness would still be present.
Also, with regard to retrying, part of this will be updated with the worker retry PR, to only fetch the statistics which are needed even if a worker restarts, so after that we would be retrying the stats collection.
There was a problem hiding this comment.
Checked the code, I think it would be the thread which is attaching the callback which runs it incase the future is already finished.
Oh ok - thanks for the context! My point was more towards the fact that if we're going to fail the whole job and throw the whole executor on any failure, then we could skip the whole cancellation as well - but from what you told seems like it is not the case in the retry logic.
|
There are unit tests in WorkerSketchFetcherTest which test if tasks would be cancelled, but there aren't any tests to see if any references to large objects like key stats are leaked. I suppose we could test something like that with WeakReference? I have reproduced the bug and functionally tested this out locally, and these changes seems to fix the bug. |
| latch.await(); | ||
| return snapshotListenableFuture; | ||
| return Futures.immediateFuture(mock(ClusterByStatisticsSnapshot.class)); | ||
| }).when(workerClient).fetchClusterByStatisticsSnapshot(any(), any(), anyInt()); |
There was a problem hiding this comment.
- shouldn't it be
eq("1")in the first argument, since below we're adding a different mock foreq("2")? - Can we return a custom future from here which is never done and also captures the
cancelinvocation to increment the cancel counter - that could remove the need for spying executor service and creating new testing constructors for fetcher? - Why are we doing
latch.await()in every thread and using sleep in the main test thread? Didn't understand that part
There was a problem hiding this comment.
Changed the answers to be more concise and clear instead of layering mocks.
I didn't quite understand the second point, if the executorService future is cancelled, it does not call cancel on the custom future, since it is unaware of it entirely. It instead throws an InterruptedException if the task has already started (which is find and handled properly in our code), however, this our custom mock would not know that it was interrupted.
The sleep in the main thread was to resolve a rare test failure where the asserts were happening before the futures could be cancelled, causing the test to fail prematurely.
There was a problem hiding this comment.
Ah, yes I missed that this method is for a part of the computation and not the whole computation itself.
Although the same concept can be still applied and we can make the computation in this mocking long timed and wait for an interruption arisen from the cancel. I tried the following approach locally : c536a3f
The goal of the above is to minimize mocking/spying from the attributes of the class being UT-ed - but I'm ok with the any of the approaches for now since this change is planned for release 25.0
I think doing the functional test is ok for now, maybe in future one way to test it could be to generate multiple small sketches in a test and write them to file. Then we can setup the fetcher for those sketches and check if the fetcher runs stably - it will automatically OOM incase it ends up taking a lot of memory. |
|
UT's and IT's passed for MSQ. As the worker retry PR #13353 is redoing |
* Remove stray reference to fix OOM while merging sketches * Update future to add result from executor service * Update tests and address review comments * Address review comments * Moved mock * Close threadpool on teardown * Remove worker task cancel
…3529) * Remove stray reference to fix OOM while merging sketches * Update future to add result from executor service * Update tests and address review comments * Address review comments * Moved mock * Close threadpool on teardown * Remove worker task cancel
This fixes a potential OOM issue introduced in #13205. While fetching sketches, a reference to the future sketch is kept to cancel them, in case one of the fetches fail. However, this results in a reference to the sketch even after the merge is successful. This PR changes this to a set of futures from which completed futures are removed so that these sketches can be garbage collected.
This PR has: