-
Notifications
You must be signed in to change notification settings - Fork 594
HDDS-8090. When getBlock from a datanode fails, retry other datanodes. #4357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that client will log the warning for the first and second failure, but the third failure it will simply throw the exception. I think it would be clear if the message is something like "Failed to get Block from " + d + ", will try another DataNode"
|
@jojochuang , thanks a lot for review this! Just have pushed a commit to address your comments and to add new tests. Unfortunately, the cannot completely fix TestHSync. Will continue the work in https://issues.apache.org/jira/browse/HDDS-8146 |
|
@adoroszlai , thanks for the ci.yml for running TestHSync! Will continue using it in https://issues.apache.org/jira/browse/HDDS-8146 |
👍 Note: I think it's better to create a separate branch for repeated tests, to avoid the need for a revert or force-push at the end. If you tweak the fix, just merge the fix branch into the repeat branch. Also, since CI runs in forks, too, no need to create PR until the fix is verified by repetitions. |
jojochuang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @szetszwo !
Just a very minor comment otherwise this is goo to go.
| public void testGetBlockRetryAlNodes() { | ||
| final ArrayList<DatanodeDetails> allDNs = new ArrayList<>(dns); | ||
| Assertions.assertTrue(allDNs.size() > 1); | ||
| try (XceiverClientGrpc client = new XceiverClientGrpc(pipeline, conf) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated, but looks like the existing tests cases don't close XceiverClientGrpc, potentially leaking resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, let me fix the existing problems.
| return Collections.singletonList(validator); | ||
| } | ||
|
|
||
| public static List<CheckedBiFunction> getValidatorList( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO a getXXX() method typically implies a O( 1 ) operation. It would be better off to rename it, such as toValidatorList() ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... a getXXX() method typically implies a O( 1 ) operation. ...
This is not true for TreeMap.get(..), LinkedList.get(..), etc. But I am okay to do the rename.
|
@jojochuang , thanks for reviewing and merging this! @adoroszlai , thanks for the ci hints! |
* master: (262 commits) HDDS-8153. Integrate ContainerBalancer with MoveManager (apache#4391) HDDS-8090. When getBlock from a datanode fails, retry other datanodes. (apache#4357) HDDS-8163 Use try-with-resources to ensure close rockdb connection in SstFilteringService (apache#4402) HDDS-8065. Provide GNU long options (apache#4394) HDDS-7930. [addendum] input stream does not refresh expired block token. HDDS-7930. input stream does not refresh expired block token. (apache#4378) HDDS-7740. [Snapshot] Implement SnapshotDeletingService (apache#4244) HDDS-8076. Use container cache in Key listing API. (apache#4346) HDDS-8091. [addendum] Generate list of config tags from ConfigTag enum - Hadoop 3.1 compatibility fix (apache#4374) HDDS-8144. TestDefaultCertificateClient#testTimeBeforeExpiryGracePeriod fails as we approach DST. (apache#4382) HDDS-8151. Support fine grained lifetime for root CA certificate (apache#4386) HDDS-8150. RpcClientTest and ConfigurationSourceTest not run due to naming convention (apache#4388) HDDS-8131. Add Configuration for OM Ratis Log Purge Tuning Parameters. (apache#4371) HDDS-8133. Create ozone sh key checksum command (apache#4375) HDDS-8142. Check if no entries in Block DB for a container on container delete (apache#4379) HDDS-8118. Fail container delete on non empty chunks dir (apache#4367) HDDS-8028. JNI for RocksDB SST Dump tool (apache#4315) HDDS-8129. ContainerStateMachine allows two different tasks with the same container id running in parallel. (apache#4370) HDDS-8119. Remove loosely related AutoCloseable from SendContainerOutputStream (apache#4368) close db connection (apache#4366) ...
|
@szetszwo Sorry for the random comment, but from my understanding we already have a logic to retry other datanodes in for (DatanodeDetails dn : datanodeList) {
try {
if (LOG.isDebugEnabled()) {
LOG.debug("Executing command {} on datanode {}",
processForDebug(request), dn);
}
// In case the command gets retried on a 2nd datanode,
// sendCommandAsyncCall will create a new channel and async stub
// in case these don't exist for the specific datanode.
reply.addDatanode(dn);
responseProto = sendCommandAsync(request, dn).getResponse().get();
if (validators != null && !validators.isEmpty()) {
for (Validator validator : validators) {
validator.accept(request, responseProto);
}
}
...The call trace is This means that getBlock will be retried 27 times (3 from Please correct me if I'm mistaken. Thanks in advance. |
|
@ivandika3 , you are right there are multiple levels of retries. It seems the original retries (which are not added here) may not work as expected -- some tests were failing when a datanode cannot return the chunk. |
What changes were proposed in this pull request?
Similar to HDDS-8024, the client should retry other datanodes if it has failed to getBlock from a datanode.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-8090
How was this patch tested?
This is to fix TestHSync