-
Notifications
You must be signed in to change notification settings - Fork 594
HDDS-7618. Replication Commands should timeout if not processed on datanodes in time #4069
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ct to use for time and in tests
|
@swamirishi @duongkame can you'll please take a look. |
adoroszlai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sodonnel for the patch. Looks good overall, few minor items suggested.
| ReplicationTask task = new ReplicationTask(containerID, sourceDatanodes); | ||
| task.setDeadline(deadline); | ||
| supervisor.addTask(task); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For HDDS-7620 I will likely need to add term to ReplicationTask to be able to check it before the task is executed. I was wondering if we should pass the entire command to ReplicationTask instead of duplicating more and more fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea let me look at this. There is a similar pattern in the ECReconstruction command, where we basically duplicate the entire command into another object which really only has getters the same as the command, so we might be able to make that go away too.
| supervisor.getReplicationRequestCount()); | ||
| supervisor.getReplicationRequestCount()) | ||
| .addGauge(Interns.info("numTimeoutReplications", | ||
| "Number of timeout replications in queue"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possible clarification:
| "Number of timeout replications in queue"), | |
| "Number of replication requests timed out before being processed"), |
| if (deadlineMsSinceEpoch > 0 && | ||
| currentEpochMs > deadlineMsSinceEpoch) { | ||
| return true; | ||
| } | ||
| return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
| if (deadlineMsSinceEpoch > 0 && | |
| currentEpochMs > deadlineMsSinceEpoch) { | |
| return true; | |
| } | |
| return false; | |
| return deadlineMsSinceEpoch > 0 && | |
| currentEpochMs > deadlineMsSinceEpoch; |
| if (!(val > 0) || (val > 1)) { | ||
| throw new IllegalArgumentException(val | ||
| + " must be greater than 0 and less than equal to 1"); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validation is better added to a separate method, annotated with @PostConstruct. This ensures validation is performed even if the instance is created and initialized via reflections (also allows multiple config items to be cross-validated).
Some examples:
ozone/hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/OzoneClientConfig.java
Lines 194 to 220 in d0e6824
| @PostConstruct | |
| private void validate() { | |
| Preconditions.checkState(streamBufferSize > 0); | |
| Preconditions.checkState(streamBufferFlushSize > 0); | |
| Preconditions.checkState(streamBufferMaxSize > 0); | |
| Preconditions.checkArgument(bufferIncrement < streamBufferSize, | |
| "Buffer increment should be smaller than the size of the stream " | |
| + "buffer"); | |
| Preconditions.checkState(streamBufferMaxSize % streamBufferFlushSize == 0, | |
| "expected max. buffer size (%s) to be a multiple of flush size (%s)", | |
| streamBufferMaxSize, streamBufferFlushSize); | |
| Preconditions.checkState(streamBufferFlushSize % streamBufferSize == 0, | |
| "expected flush size (%s) to be a multiple of buffer size (%s)", | |
| streamBufferFlushSize, streamBufferSize); | |
| if (bytesPerChecksum < | |
| OzoneConfigKeys.OZONE_CLIENT_BYTES_PER_CHECKSUM_MIN_SIZE) { | |
| LOG.warn("The checksum size ({}) is not allowed to be less than the " + | |
| "minimum size ({}), resetting to the minimum size.", | |
| bytesPerChecksum, | |
| OzoneConfigKeys.OZONE_CLIENT_BYTES_PER_CHECKSUM_MIN_SIZE); | |
| bytesPerChecksum = | |
| OzoneConfigKeys.OZONE_CLIENT_BYTES_PER_CHECKSUM_MIN_SIZE; | |
| } | |
| } |
Lines 171 to 179 in d0e6824
| @PostConstruct | |
| public void validate() { | |
| if (replicationMaxStreams < 1) { | |
| LOG.warn(REPLICATION_STREAMS_LIMIT_KEY + " must be greater than zero " + | |
| "and was set to {}. Defaulting to {}", | |
| replicationMaxStreams, REPLICATION_MAX_STREAMS_DEFAULT); | |
| replicationMaxStreams = REPLICATION_MAX_STREAMS_DEFAULT; | |
| } | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I had a feeling there should be something like this. I will add this in.
adoroszlai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sodonnel for updating the patch, LGTM.
What changes were proposed in this pull request?
The new and old replication manager sends commands to the datanodes. If the command has not processed on the datanodes within the replicationManager event.timeout, RM assumes the command has failed for some reason, and may send another one to the same or a different host.
It makes sense to drop any command not processed on the datanode slightly before ReplicationManager gives up on it. Especially with delete container commands, we don't want to have two or more deletes pending in the system for the same container, when RM thinks there is only 1.
To facilitate dropping the commands, we can add a deadline to all commands. Only for commands we want to enforce a deadline on, we can set the deadline in SCM and check for it on the DN side. The deadline is the epoch time that the command should be processed by, otherwise the DN should ignore it. A deadline of zero is the default, and means no deadline set.
This change ensure that all commands sent to a datanode from RM will have a deadline set to 0.9 * event.timeout. On the datanode side, we only enforce the deadline on ReplicationContainer, DeleteContainer and ECReconstruction commands.
This has turned into quite a large change, but it is split into individual commits for each stage so they can be reviewed in isolation.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-7618
How was this patch tested?
Various new unit tests added.