Historical unloads damaged segments automatically when lazy on start.#10688
Conversation
|
How about trigger unload or force delete when the query failed due to segment corruption? After the unload, the coordinator will ask another historical to load this segment. |
|
Hi @kaijianding. As you can see, there are few error messages when query failed due to segments damaged. So that it is hard to distinguish what kind of exception is caused by damaged segments, wrong query or others and find out which specific files are damaged. Because a query usually reaches multiple segments. What's more, let query failure trigger a |
|
@zhangyue19921010 I guess the exception from loadIndex can be a signal of corruption. When you distinguish the corruption exception, it is safe to unload this segment and let the coordinator choose another historical to load(usually, you have replicas). Compare to restart behavior, I think the unloading behavior is more efficient and no human involvement is needed. |
|
Hi @kaijianding. This PR is just aiming to let druid regain the ability which is broken by lazyOnStart to check the availability of all segments and reload corrupted segments during historical restart. Is what you proposed that make druid have the ability to find and solve corrupted segments during runtime no matter lazyOnStart or not? It is actually a nice idea which can enhance the robustness of Druid and solve the corrupted segments problems once and for all. I will keep digging maybe make an another PR to make it happen if I understand right. What do you think? :) |
Unfortunately there is no extra exception in |
Only when lazy load should be good enough. Maybe do something in this IOException catch and trigger SegmentManager.dropSegment(dataSegment) somehow. |
46303d4 to
7775658
Compare
|
Hi @kaijianding , sorry for late response. I just change the strategy for lazyOnStart action as you suggested. |
|
Here is the full logs during query failed and trigger unload action. damage segment files -> lazy On start -> query failed -> segment unload automatically -> load segment according to coordinator -> query successfully. |
|
|
||
| package org.apache.druid.coordination; | ||
|
|
||
| public interface CommonCallback |
There was a problem hiding this comment.
Use a more meaningful name, like SegmentLazyLoadFailCallback
| { | ||
| return loadIndex(inDir, false); | ||
| return loadIndex(inDir, false, CommonCallback.NOOP); | ||
| } |
|
|
||
| private ColumnHolder deserializeColumn(ObjectMapper mapper, ByteBuffer byteBuffer, SmooshedFileMapper smooshedFiles) | ||
| throws IOException | ||
| throws IOException, RuntimeException |
There was a problem hiding this comment.
why RuntimeException here?
There was a problem hiding this comment.
No need actually. Remove now.
| * under the License. | ||
| */ | ||
|
|
||
| package org.apache.druid.coordination; |
There was a problem hiding this comment.
I guess this interface should be in org.apache.druid.segment, beside the IndexIO
|
|
||
| package org.apache.druid.coordination; | ||
|
|
||
| public interface CommonCallback |
There was a problem hiding this comment.
why not just use Runnable? There is even a Runnables.getNoopRunnable()
There was a problem hiding this comment.
I think a meaningful name is better for code readiness through the method arguments
There was a problem hiding this comment.
I don't feel super strongly about this, while I find it pretty obvious what Runnable does and don't think a new interface adds much here, I can see your point as well.
I would agree that if we insist on using a separate interface that it should have a better name more indicative of its function.
There was a problem hiding this comment.
Change the name to SegmentLazyLoadFailCallback as you guys suggested. Thanks for review :)
| * @throws SegmentLoadingException if the segment cannot be loaded | ||
| */ | ||
| public boolean loadSegment(final DataSegment segment, boolean lazy) throws SegmentLoadingException | ||
| public boolean loadSegment(final DataSegment segment, boolean lazy, CommonCallback loadFailed) throws SegmentLoadingException |
There was a problem hiding this comment.
add a new @param in the method comment
|
And rename the PR title to match the changes |
| # resolving the SIGAR dependency. | ||
| - > | ||
| MAVEN_OPTS='-Xmx800m' ${MVN} test -pl ${MAVEN_PROJECTS} | ||
| MAVEN_OPTS='-Xmx1100m' ${MVN} test -pl ${MAVEN_PROJECTS} |
There was a problem hiding this comment.
Current UTs don't cover load segment in a lazy way. This PR add a UT to test lazy loading which needs extra memory.
There was a problem hiding this comment.
👍 was just surprised how much more it needed, if travis passes then I assume the change is fine, was just curious mostly
|
@kaijianding @clintropolis Thanks for your review and merge! |
…apache#10688) * ready to test * tested on dev cluster * tested * code review * add UTs * add UTs * ut passed * ut passed * opti imports * done * done * fix checkstyle * modify uts * modify logs * changing the package of SegmentLazyLoadFailCallback.java to org.apache.druid.segment * merge from master * modify import orders * merge from master * merge from master * modify logs * modify docs * modify logs to rerun ci * modify logs to rerun ci * modify logs to rerun ci * modify logs to rerun ci * modify logs to rerun ci * modify logs to rerun ci * modify logs to rerun ci * modify logs to rerun ci Co-authored-by: yuezhang <yuezhang@freewheel.tv>

Description
The feature mentioned in #6988 is very useful and can greatly improve the speed of historical node restart. But there are some potential risks due to push the loading work to query time.
If segment files like
meta.smooshordata.smooshwhich Historical is served already are damaged, Historical with lazyOnStart enable can restart successfully but will not detect file corruption during restarting. When queries hit these damaged files, these queries will fail and will not expose which specific files are damaged like mentioned in #10608Tests
Here are the undamaged segment files.
First case : meta.smoosh is damaged, for example
Remove a line
video_cro_network_name,0,997718245,1000412345in that meta.smoosh file.Restart Historical node with lazy on start enable.
Second case : data.smoosh is damaged, for example
Delete several bytes in that 00000.smoosh file randomly.
Restart Historical node with lazy on start enable.
In all scenarios, Historical can restart successfully, but queries will failed.
Solution
This PR make Historical with lazy on start have the ability that:
Failed queries caused by damaged segment files will trigger unload action automatically. After the unload, the coordinator will ask another historical to load this segment.
Through this PR, the system heals itself and the administrator need to do nothing.
This PR has:
Key changed/added classes in this PR
IndexIO.java