Add start_time column to sys.servers #13358
Add start_time column to sys.servers #13358abhishekagarwal87 merged 14 commits intoapache:masterfrom
Conversation
LakshSingla
left a comment
There was a problem hiding this comment.
Thanks for the contribution! The Java code changes LGTM barring a minor question.
| public DateTime getStartTime() | ||
| { | ||
| return startTime; | ||
| } |
There was a problem hiding this comment.
Why is startTime omitted while verifying the equality of this class?
There was a problem hiding this comment.
Since a DiscoveryDruidNode object is primarily identified by its DruidNode, role and the service map, I wanted to preserve the equality condition. Also, start_time might not be a concrete enough to decide the equality between two DiscoveryDruidNode objects, if the other field values are the same. What do you think?
There was a problem hiding this comment.
Thanks for the explanation! I think it's better to keep it the current way.
|
Oh damn! so cool! I can not wait to add an "uptime" column to the web console! |
| 'Current size', | ||
| 'Max size', | ||
| 'Usage', | ||
| 'Start Time', |
There was a problem hiding this comment.
Please capitalize as Start time
| }, | ||
| }, | ||
| { | ||
| Header: 'Start Time', |
There was a problem hiding this comment.
Ditto re: capitalization
| curr_size: s.currSize, | ||
| max_size: s.maxSize, | ||
| tls_port: -1, | ||
| start_time: s.start_time, |
There was a problem hiding this comment.
did you add start_time to the /druid/coordinator/v1/servers?simple response also? If so you should update https://github.com/apache/druid/blob/master/docs/operations/api-reference.md#L496 also, if not then update the line of code above.
There was a problem hiding this comment.
Good catch, removed it.
|
|
||
| const tableColumns: Record<CapabilitiesMode, string[]> = { | ||
| 'full': allColumns, | ||
| 'no-sql': allColumns, |
There was a problem hiding this comment.
ok, so since you are not updating /druid/coordinator/v1/servers?simple you should make sure that Start time is no in the no-sql column list. no-sql is the mode used when the user had no SQL access so it falls back to the old endpoint. You should set no-sql to:
[
'Service',
'Type',
'Tier',
'Host',
'Port',
'Current size',
'Max size',
'Usage',
'Detail',
ACTION_COLUMN_LABEL,
];
Also at this point you can inline the allColumns constant as it's reasons for existence was to set full and no-sql to the same thing.
vogievetsky
left a comment
There was a problem hiding this comment.
👍 for everything but the Java (I did not review the Java code - only TS + general idea). Thank you for promptly responding to feedback!
|
@vogievetsky Thanks for the review. Do you know what is going on with the Travis job failure: web console end-to-end test ? |
|
not sure, will have a look in a bit. |
|
I found an issue when coordinator starts up with |
|
Was that what was causing the e2e failures? |
@vogievetsky I believe so, since we run coordinator with asOverlord enabled in our builds. |
|
What is the status of this PR: is it good to merge if conflicts are resolved? |
|
@vogievetsky Sorry for the delay, I'll find some time this week to fix up this PR |
|
@vogievetsky @abhishekagarwal87 @LakshSingla Sorry for the delay in fixing up the conflicts for this PR, but it'd be great if you could take a quick look at this again. |
| created = true; | ||
| } else if (!Arrays.equals(oldBytes, bytes)) { | ||
| throw new IAE("Cannot reannounce different values under the same path"); | ||
| log.error("Ignoring attempt to announce different values under same path"); |
There was a problem hiding this comment.
what is the rationale behind this change?
There was a problem hiding this comment.
When the coordinator is run in overlord mode, Announcer.announce() is called twice since the Announcer module is part of the coordinator and overlord lifecycle modules. The second call is a no-op since the existing DiscoveryDruidNode bytes announced at the path is the same as the node bytes in the second call.
With this patch, the start time is now part of DiscoveryDruidNode and so in some cases, there could be a millisecond delay between the two announce calls. This causes the node objects to be different and the second announce call fails due to the validation check.
I couldn't find another scenario where it would be useful to fail the process in this condition and so I'm logging it here instead. Let me know if you have any thoughts on this approach.
There was a problem hiding this comment.
it's called 4 times. From my local logs of a run
2023-03-22T13:24:51,527 INFO [main] org.apache.druid.curator.discovery.CuratorDruidNodeAnnouncer - Announced self [{"druidNode":{"service":"druid/coordinator","host":"localhost","bindOnHost":false,"plaintextPort":8081,"port":-1,"tlsPort":-1,"enablePlaintextPort":true,"enableTlsPort":false},"nodeType":"coordinator","services":{}}].
2023-03-22T13:24:51,530 INFO [main] org.apache.druid.curator.discovery.CuratorDruidNodeAnnouncer - Announced self [{"druidNode":{"service":"druid/coordinator","host":"localhost","bindOnHost":false,"plaintextPort":8081,"port":-1,"tlsPort":-1,"enablePlaintextPort":true,"enableTlsPort":false},"nodeType":"overlord","services":{}}].
2023-03-22T13:24:51,530 INFO [main] org.apache.druid.curator.discovery.CuratorDruidNodeAnnouncer - Announced self [{"druidNode":{"service":"druid/coordinator","host":"localhost","bindOnHost":false,"plaintextPort":8081,"port":-1,"tlsPort":-1,"enablePlaintextPort":true,"enableTlsPort":false},"nodeType":"coordinator","services":{}}].
2023-03-22T13:24:51,530 INFO [main] org.apache.druid.curator.discovery.CuratorDruidNodeAnnouncer - Announced self [{"druidNode":{"service":"druid/coordinator","host":"localhost","bindOnHost":false,"plaintextPort":8081,"port":-1,"tlsPort":-1,"enablePlaintextPort":true,"enableTlsPort":false},"nodeType":"overlord","services":{}}].
I debugged this a bit and you are right about it being called from two modules. I think that we can also skip the duplicate announcement when Overlord is not in standalone mode. That should fix the problem you ran into. And we keep this assert in place. what do you think?
There was a problem hiding this comment.
I tried this change and it's working fine.
diff --git a/services/src/main/java/org/apache/druid/cli/CliOverlord.java b/services/src/main/java/org/apache/druid/cli/CliOverlord.java
index e4383c673c..79b90e63b5 100644
--- a/services/src/main/java/org/apache/druid/cli/CliOverlord.java
+++ b/services/src/main/java/org/apache/druid/cli/CliOverlord.java
@@ -267,13 +267,13 @@ public class CliOverlord extends ServerRunnable
if (standalone) {
LifecycleModule.register(binder, Server.class);
- }
- bindAnnouncer(
- binder,
- IndexingService.class,
- DiscoverySideEffectsProvider.create()
- );
+ bindAnnouncer(
+ binder,
+ IndexingService.class,
+ DiscoverySideEffectsProvider.create()
+ );
+ }
Jerseys.addResource(binder, SelfDiscoveryResource.class);
LifecycleModule.registerKey(binder, Key.get(SelfDiscoveryResource.class));
There was a problem hiding this comment.
Thanks, that seems like the better solution. I've tested out the changes and it works as expected.
|
@a2l007 - the PR is ready to go. I just had one question on a change that you made in this PR. |
|
thank you @a2l007. I merged this PR. This missed the cut for the 26 milestone. if you want this in 26 release, please create a backport PR against 26.0.0 branch. |
Adds a new column start_time to sys.servers that captures the time at which the server was added to the cluster.
Fixes #12090
Description
Adds a new column
start_timeto sys.servers that captures the time at which the server was added to the cluster.Key changed/added classes in this PR
DiscoveryDruidNodeSystemSchemaThis PR has: