WriteOutBytes improvements#16698
Conversation
| file.toPath(), | ||
| StandardOpenOption.READ, | ||
| StandardOpenOption.WRITE | ||
| return new LazilyAllocatingHeapWriteOutBytes( |
There was a problem hiding this comment.
why not implement a new SegmentWriteOutMedium instead of changing the behavior of an existing one?
There was a problem hiding this comment.
Because it has broad-reaching performance improving impacts? The alternative would be to create a new thing, update everything to use it and then delete the old, which is basically the same as changing in place. The reason to leave the old one would be if we wanted to maintain configurability to do the old thing, but I don't know why that would be useful here.
There was a problem hiding this comment.
yea im for the strategy in general, just kind of worried about the slightly larger memory footprint here without any form of escape hatch. This looks like it can grow up to 16k (up from the 4k heap buffer of the old FileWriteOutBytes buffer that has been replaced with a direct buffer in this PR). Its probably cool, I'm just being nervous.
I was suggesting mostly just making a new thing and making it the default and leave the old thing just in case, but i can be convinced that this is close enough to not matter i think.
There was a problem hiding this comment.
Actually it is 32KB off heap instead of 4KB heap. I wonder if that could be a problem with large number of columns. Although with large number of columns we also see heap getting exhausted as well. So this might even be better in some cases?
| file.toPath(), | ||
| StandardOpenOption.READ, | ||
| StandardOpenOption.WRITE | ||
| return new LazilyAllocatingHeapWriteOutBytes( |
There was a problem hiding this comment.
Because it has broad-reaching performance improving impacts? The alternative would be to create a new thing, update everything to use it and then delete the old, which is basically the same as changing in place. The reason to leave the old one would be if we wanted to maintain configurability to do the old thing, but I don't know why that would be useful here.
|
Can you add please tests to cover the code changes? E.g. is there a test already that makes sure that |
rohangarg
left a comment
There was a problem hiding this comment.
Mostly LGTM!
Left a minor comment for understanding.
I would want to wait for the memory allocation thread to be concluded before merging.
| catch (Exception e) { | ||
| throw new RuntimeException(e); | ||
| } | ||
| Assert.assertEquals(1, writeOutMedium.getNumLocallyCreated()); |
There was a problem hiding this comment.
should this count be 3 instead of 1? I think I'm misunderstanding the test.
There was a problem hiding this comment.
That's the count on the thread, which is one. The total count asserted outside is 3.
| file.toPath(), | ||
| StandardOpenOption.READ, | ||
| StandardOpenOption.WRITE | ||
| return new LazilyAllocatingHeapWriteOutBytes( |
There was a problem hiding this comment.
Actually it is 32KB off heap instead of 4KB heap. I wonder if that could be a problem with large number of columns. Although with large number of columns we also see heap getting exhausted as well. So this might even be better in some cases?
|
@adarshsanjeev I submitted a PR (adarshsanjeev#1) to your branch that adds the old TmpFileSegmentWriteOutMedium back as a legacy option. Please merge that PR, which will cause this to update. It should address the concerns that the current changes to the code don't have a method of rolling back to the old behavior if something does go sideways. |
Have it as a legacy option
Reintroduce old TmpFileSegmentWriteOutMedium
| @JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = "type", defaultImpl = TmpFileSegmentWriteOutMediumFactory.class) | ||
| @JsonSubTypes(value = { | ||
| @JsonSubTypes.Type(name = "tmpFile", value = TmpFileSegmentWriteOutMediumFactory.class), | ||
| @JsonSubTypes.Type(name = "legacyTmpFile", value = LegacyTmpFileSegmentWriteOutMediumFactory.class), |
There was a problem hiding this comment.
nit: i think putting legacy in the names of config stuff is not usually great to do since other things could become legacy too i suppose, but i don't feel super strongly either way i guess. We should maybe update docs to indicate that this exists? https://druid.apache.org/docs/latest/configuration/#segmentwriteoutmediumfactory
This PR generally improves the working of WriteOutBytes and WriteOutMedium. Some analysis of usage of TmpFileSegmentWriteOutMedium shows that they periodically get used for very small things. The overhead of creating a tmp file is actually very large. To improve the performance in these cases, this PR modifies TmpFileSegmentWriteOutMedium to return a heap-based WriteOutBytes that falls back to making a tmp file when it actually fills up.
This PR has: