Skip to content

Eagerly unmap during resize#7540

Closed
shikhar wants to merge 1 commit intoapache:trunkfrom
shikhar:patch-1
Closed

Eagerly unmap during resize#7540
shikhar wants to merge 1 commit intoapache:trunkfrom
shikhar:patch-1

Conversation

@shikhar
Copy link
Copy Markdown
Contributor

@shikhar shikhar commented Oct 17, 2019

The mmap field gets re-initialized here, so the old object will become garbage-collectible. We should be eagerly unmapping here too, not just for Windows support.

The mmap field gets re-initialized here, so the old object will become garbage-collectible. We should be eagerly unmapping here too, not just for Windows support.
@shikhar
Copy link
Copy Markdown
Contributor Author

shikhar commented Oct 17, 2019

cc @ijuma since you were involved in such PRs

Motivation for this change is that we are seeing pauses on the brokers which seem a lot like https://issues.apache.org/jira/browse/KAFKA-4614. That has a fix version of 0.10.2.0 (PR was #2352). I also saw that you made an improvement here #5757 which has been in Kafka since 2.2.

We are on Kafka 2.2.

@shikhar
Copy link
Copy Markdown
Contributor Author

shikhar commented Oct 18, 2019

After further investigation today it doesn't look like the pauses are during GC, but also other safepoints - time-to-safepoint is high, which seems like can easily bite with memory-mapped IO and page faults to disks in cloud environments.

Interesting threads
https://groups.google.com/forum/m/#!msg/mechanical-sympathy/htQ3Rc1JEKk/ThuVpe5kBgAJ
https://groups.google.com/forum/#!msg/mechanical-sympathy/tepoA7PRFRU/wyKeIyCjBwAJ

UPDATE: For reference our safepoint issues, and high IO throttling to cloud disks, went away with tweaks to virtual memory sysctls

vm.dirty_ratio=60
vm.dirty_background_ratio=5
vm.dirty_expire_centisecs=500

@mlex
Copy link
Copy Markdown

mlex commented Jul 10, 2020

We are running into a similar issue and see lots of references to already deleted index files held by the kafka process. Is there a reason why the safeForceUnmap shouldn't be called inside the resize method? With the current code, the trimToValidSize will always leave behind a mmap reference that's not unmapped (and will be dealt with by gc - which can lead to exactly the problem described in https://issues.apache.org/jira/browse/KAFKA-4614).

@ijuma
Copy link
Copy Markdown
Member

ijuma commented Jul 17, 2021

The challenge is that we don't lock on operations like lookup on Linux. So, we have to ensure no reference to mmap is held before we can unmap.

@shikhar shikhar closed this Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants