Skip to content

reset keySerde when closing groupers to clear out heap dictionaries#16114

Merged
LakshSingla merged 1 commit intoapache:masterfrom
clintropolis:fix-leaky-spiller
Mar 13, 2024
Merged

reset keySerde when closing groupers to clear out heap dictionaries#16114
LakshSingla merged 1 commit intoapache:masterfrom
clintropolis:fix-leaky-spiller

Conversation

@clintropolis
Copy link
Copy Markdown
Member

Description

ConcurrentGrouper kind of misuses ThreadLocal to hold a SpillingGrouper, and never calls remove() on it, which can result in large amounts of heap being retained as weak references even after grouping is finished.

Its kind of difficult to rework this such that we use ThreadLocal and actually call remove(), since the ConcurrentGrouper is created on the 'qtp' threads, but the ThreadLocal is set from the processing threads, so removing them from the ThreadLocalMap of the processing threads is a bit tricksy.

Poking around in the debugger sort of shows the problem, with several 'closed' SpillingGrouper present in the ThreadLocalMap of processing threads:
Screenshot 2024-03-12 at 9 46 50 PM

The ThreadLocalMap stores these all as WeakReference, so they should eventually be reclaimed, but it might take several GC cycles, so OOM can still occur.

Rather than fixing usage of ThreadLocal to call remove, a much lower budget way to make this less painful is for Grouper.close() to more aggressively free stuff up, since ConcurrentGrouper does call close on all of the associated SpillingGrouper, including those stored in the ThreadLocalMap. The main thing using heap as far as I can tell is the dictionaries built by the KeySerde, so calling keySerde.reset() on all of the Grouper.close() implementations which have a KeySerde should free up a bunch of space that is no longer needed.

The RowBasedKeySerde implementation of reset was missing clearing out the array dictionaries, so I also added that.

I couldn't think of an easy way to write tests for this because its kind of stuffed down pretty deep, but if anyone has any ideas that aren't a ton of work I'm happy to add them.


This PR has:

  • been self-reviewed.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

@abhishekagarwal87
Copy link
Copy Markdown
Contributor

Nice find @clintropolis. Wouldn't JVM claim those weak references more aggressively under memory pressure?

@clintropolis
Copy link
Copy Markdown
Member Author

Nice find @clintropolis. Wouldn't JVM claim those weak references more aggressively under memory pressure?

So i think it often does get reclaimed or else this would probably be a much bigger problem, but as I understand it can still take multiple cycles, so I suspect it is possible in really tight cases that the oom can still occur. It seems .. complicated https://stackoverflow.com/questions/17104452/threadlocal-garbage-collection

@LakshSingla
Copy link
Copy Markdown
Contributor

LakshSingla commented Mar 13, 2024

so removing them from the ThreadLocalMap of the processing threads is a bit tricksy.

Can something like this be done:

  1. ConcurrentGrouper passes a Closeable to the SpillingGrouper: () -> threadLocalGrouper.remove();
  2. This closer gets invoked in the close() of the SpillingGrouper.

In this case,I suspect the qtp thread's entry should be removed 🤔

@clintropolis
Copy link
Copy Markdown
Member Author

clintropolis commented Mar 13, 2024

Can something like this be done:

  1. ConcurrentGrouper passes a Closeable to the SpillingGrouper: () -> threadLocalGrouper.remove();
  2. This closer gets invoked in the close() of the SpillingGrouper.

In this case,I suspect the qtp thread's entry should be removed 🤔

That's part of the problem, it isn't the qtp thread, its the processing pool thread that has the SpillingGrouper on the ThreadLocalMap, but the qtp thread is what makes the stuff and is what is is calling close on the ConcurrentGrouper

@LakshSingla LakshSingla merged commit aa2959b into apache:master Mar 13, 2024
@clintropolis clintropolis deleted the fix-leaky-spiller branch March 13, 2024 09:40
@LakshSingla
Copy link
Copy Markdown
Contributor

LakshSingla commented Mar 13, 2024

Thanks for the PR and the explanation!

@adarshsanjeev adarshsanjeev added this to the 30.0.0 milestone May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants