-
Notifications
You must be signed in to change notification settings - Fork 3.4k
HBASE-25627: [Backport]HBase replication should have a metric to represent if the source is stuck getting initialized #3009
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HBASE-25627: [Backport]HBase replication should have a metric to represent if the source is stuck getting initialized #3009
Conversation
|
💔 -1 overall
This message was automatically generated. |
wchevreuil
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this relevant/affecting branch-1 only? Shouldn't we target master branch as the primary branch for the changes, then backport it to all related lower branches?
| } | ||
|
|
||
| @Override | ||
| public void setPeerZkConnectionFailures(boolean success) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is a numeric metric, we should comply with the other numeric metrics and provide an incrPeerZkConnectionFailures method, rather than a setter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wchevreuil This was intentional since we want to reset it to zero once we get the success connection.
Do you suggest having a separate method for a reset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic in a global source doesn't give the intended result here because peerZkConnectionFailures is reset to 0 if any source succeeds. That doesn't seem right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, add a separate method for set, but set methods on counter metrics accept the integer value as the new value for the metric, not a weird boolean. Set and Increment is fine. Set(0) to reset.
...a/org/apache/hadoop/hbase/replication/regionserver/MetricsReplicationGlobalSourceSource.java
Outdated
Show resolved
Hide resolved
| while (this.isSourceActive() && this.peerClusterId == null) { | ||
| this.peerClusterId = replicationEndpoint.getPeerUUID(); | ||
| if (this.isSourceActive() && this.peerClusterId == null) { | ||
| metrics.setPeerZkConnectionFailures(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the motivations described in the jira is that there's no evidence of this condition in the logs. Since log files are normally the first source of info operators normally look after, is it possible to add a WARN reporting the peerId info is missing, possibly due to ZK connectivity issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree that WARN should be there as well but the metric helps to monitor and alert in this case. Since connection to peer's ZK blocks the whole replication, it should be good to monitor this as a metric.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a brief discussion offline with @sandeepvinayak on how to generalize this. I think the intent here is capture and flag any sources that are stuck during initialization and ZK connection failure is just a symptom. So capturing those ZK failure count may not add much value, instead we track number of uninitialized sources, (gauge) that'd be much more helpful.
So one way forward is to track number of such uninitialized sources at a global scope (that the monitoring tooling can flag if its > 0 for a time window) and then backport https://issues.apache.org/jira/browse/HBASE-22731 to branch-1. These two together should help us narrow down it to the right RS and root cause.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @wchevreuil , metrics and logs are consumed differently by operators and a reasonable request to add a log is not solved by emitting a metric
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I have added a metric as well as a log.
cc209ea to
3fdf54e
Compare
| } | ||
|
|
||
| @Override | ||
| public void setPeerZkConnectionFailures(boolean success) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic in a global source doesn't give the intended result here because peerZkConnectionFailures is reset to 0 if any source succeeds. That doesn't seem right.
| while (this.isSourceActive() && this.peerClusterId == null) { | ||
| this.peerClusterId = replicationEndpoint.getPeerUUID(); | ||
| if (this.isSourceActive() && this.peerClusterId == null) { | ||
| metrics.setPeerZkConnectionFailures(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a brief discussion offline with @sandeepvinayak on how to generalize this. I think the intent here is capture and flag any sources that are stuck during initialization and ZK connection failure is just a symptom. So capturing those ZK failure count may not add much value, instead we track number of uninitialized sources, (gauge) that'd be much more helpful.
So one way forward is to track number of such uninitialized sources at a global scope (that the monitoring tooling can flag if its > 0 for a time window) and then backport https://issues.apache.org/jira/browse/HBASE-22731 to branch-1. These two together should help us narrow down it to the right RS and root cause.
Thoughts?
| public static final String SOURCE_FAILED_RECOVERY_QUEUES = "source.failedRecoverQueues"; | ||
| /* Used to track the age of oldest wal in ms since its creation time */ | ||
| String OLDEST_WAL_AGE = "source.oldestWalAge"; | ||
| public static final String SOURCE_PEER_ZK_CONNECTION_FAILURE = "source.peerZkConnectionFailure"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just call this peerConnectionFailure? ZK may or may not be the reason, if not now, then in the future. What we want to count is connection failures, let the naming reflect that. (We need to care about metric names because it becomes part of operational compat.)
| void incrFailedRecoveryQueue(); | ||
| void setOldestWalAge(long age); | ||
| long getOldestWalAge(); | ||
| void setPeerZkConnectionFailures(boolean success); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is weird. It should be an increment function. See above incrFailedRecoveryQueue as example.
| void setOldestWalAge(long age); | ||
| long getOldestWalAge(); | ||
| void setPeerZkConnectionFailures(boolean success); | ||
| long getPeerZkConnectionFailures(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getters are not needed by convention here, remove
| } | ||
|
|
||
| @Override | ||
| public void setPeerZkConnectionFailures(boolean success) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, add a separate method for set, but set methods on counter metrics accept the integer value as the new value for the metric, not a weird boolean. Set and Increment is fine. Set(0) to reset.
| while (this.isSourceActive() && this.peerClusterId == null) { | ||
| this.peerClusterId = replicationEndpoint.getPeerUUID(); | ||
| if (this.isSourceActive() && this.peerClusterId == null) { | ||
| metrics.setPeerZkConnectionFailures(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @wchevreuil , metrics and logs are consumed differently by operators and a reasonable request to add a log is not solved by emitting a metric
|
I also agree with @wchevreuil that starting with branch-1 for an issue that affects all branches is upside down. We should have a master patch and merge it before merging this. Sure, a separate PR may be needed for branch-1 because of code difference, that is fine, but patch application should proceed in the normal order, which is master -> branch-2 -> releasing branch-2s -> branch-1. |
|
@apurtell I believe our reviews overlapped. I proposed an alternative metric to track. It helps to track the number of uninitialized sources (stuck during initialization) that we can flag via monitoring right away (along with additional logging from back port of HBASE-22731). |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
@bharathv @apurtell @wchevreuil I believe @bharathv suggestion to have a metric at source level instead makes sense to me. That metric will eventually catch the situation of peer connection failure as well. Once that is committed, I can change this one on |
|
@sandeepvinayak Mind refreshing this PR with the latest patch? Thanks. |
…he source is stuck getting initialized
3fdf54e to
d4071c6
Compare
|
💔 -1 overall
This message was automatically generated. |
…he source is stuck getting initialized
|
🎊 +1 overall
This message was automatically generated. |
|
fyi @bharathv |
bharathv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll let other reviewers take a look or merge by EOD.
Please refer to jira for the description.
https://issues.apache.org/jira/browse/HBASE-25627