Add start_time column to sys.servers by a2l007 · Pull Request #13358 · apache/druid

a2l007 · 2022-11-11T22:02:09Z

Fixes #12090

Description

Adds a new column start_time to sys.servers that captures the time at which the server was added to the cluster.

Key changed/added classes in this PR

DiscoveryDruidNode
SystemSchema

This PR has:

been self-reviewed.
added documentation for new or modified features or behaviors.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
been tested in a test Druid cluster.

LakshSingla

Thanks for the contribution! The Java code changes LGTM barring a minor question.

LakshSingla · 2022-11-15T09:02:08Z

+  public DateTime getStartTime()
+  {
+    return startTime;
+  }


Why is startTime omitted while verifying the equality of this class?

Since a DiscoveryDruidNode object is primarily identified by its DruidNode, role and the service map, I wanted to preserve the equality condition. Also, start_time might not be a concrete enough to decide the equality between two DiscoveryDruidNode objects, if the other field values are the same. What do you think?

Thanks for the explanation! I think it's better to keep it the current way.

vogievetsky · 2022-11-15T19:43:28Z

Oh damn! so cool! I can not wait to add an "uptime" column to the web console!

vogievetsky · 2022-11-15T19:45:08Z

  'Current size',
  'Max size',
  'Usage',
+  'Start Time',


Please capitalize as Start time

vogievetsky · 2022-11-15T19:45:42Z

            },
          },
+          {
+            Header: 'Start Time',


Ditto re: capitalization

vogievetsky · 2022-11-15T19:49:16Z

        curr_size: s.currSize,
        max_size: s.maxSize,
        tls_port: -1,
+        start_time: s.start_time,


did you add start_time to the /druid/coordinator/v1/servers?simple response also? If so you should update https://github.com/apache/druid/blob/master/docs/operations/api-reference.md#L496 also, if not then update the line of code above.

Good catch, removed it.

vogievetsky · 2022-11-16T18:23:51Z


 const tableColumns: Record<CapabilitiesMode, string[]> = {
  'full': allColumns,
  'no-sql': allColumns,


ok, so since you are not updating /druid/coordinator/v1/servers?simple you should make sure that Start time is no in the no-sql column list. no-sql is the mode used when the user had no SQL access so it falls back to the old endpoint. You should set no-sql to:

[ 'Service', 'Type', 'Tier', 'Host', 'Port', 'Current size', 'Max size', 'Usage', 'Detail', ACTION_COLUMN_LABEL, ];

Also at this point you can inline the allColumns constant as it's reasons for existence was to set full and no-sql to the same thing.

vogievetsky

👍 for everything but the Java (I did not review the Java code - only TS + general idea). Thank you for promptly responding to feedback!

a2l007 · 2022-11-17T19:24:55Z

@vogievetsky Thanks for the review. Do you know what is going on with the Travis job failure: web console end-to-end test ?
It builds fine locally but Travis seems to keep failing that job.

vogievetsky · 2022-11-17T22:18:17Z

not sure, will have a look in a bit.

a2l007 · 2022-11-23T00:14:55Z

I found an issue when coordinator starts up with druid.coordinator.asOverlord.enabled as true. In this case, the coordinator and overlord services are announced twice and in each case with a different instance of DiscoveryDruidNode. This breaks the announcer flow as each of the DiscoveryDruidNode objects have a slightly different startTime which causes a mismatch between the node bytes announced at the path.
Converting this PR to a draft until I find a better way to fix this.

vogievetsky · 2022-11-23T23:09:48Z

Was that what was causing the e2e failures?

a2l007 · 2022-11-28T17:23:02Z

Was that what was causing the e2e failures?

@vogievetsky I believe so, since we run coordinator with asOverlord enabled in our builds.

vogievetsky · 2023-03-20T18:07:07Z

What is the status of this PR: is it good to merge if conflicts are resolved?

a2l007 · 2023-03-20T18:39:11Z

@vogievetsky Sorry for the delay, I'll find some time this week to fix up this PR

a2l007 · 2023-03-24T22:53:19Z

@vogievetsky @abhishekagarwal87 @LakshSingla Sorry for the delay in fixing up the conflicts for this PR, but it'd be great if you could take a quick look at this again.

abhishekagarwal87 · 2023-03-29T13:29:51Z

          created = true;
        } else if (!Arrays.equals(oldBytes, bytes)) {
-          throw new IAE("Cannot reannounce different values under the same path");
+          log.error("Ignoring attempt to announce different values under same path");


what is the rationale behind this change?

When the coordinator is run in overlord mode, Announcer.announce() is called twice since the Announcer module is part of the coordinator and overlord lifecycle modules. The second call is a no-op since the existing DiscoveryDruidNode bytes announced at the path is the same as the node bytes in the second call.
With this patch, the start time is now part of DiscoveryDruidNode and so in some cases, there could be a millisecond delay between the two announce calls. This causes the node objects to be different and the second announce call fails due to the validation check.
I couldn't find another scenario where it would be useful to fail the process in this condition and so I'm logging it here instead. Let me know if you have any thoughts on this approach.

it's called 4 times. From my local logs of a run

2023-03-22T13:24:51,527 INFO [main] org.apache.druid.curator.discovery.CuratorDruidNodeAnnouncer - Announced self [{"druidNode":{"service":"druid/coordinator","host":"localhost","bindOnHost":false,"plaintextPort":8081,"port":-1,"tlsPort":-1,"enablePlaintextPort":true,"enableTlsPort":false},"nodeType":"coordinator","services":{}}]. 2023-03-22T13:24:51,530 INFO [main] org.apache.druid.curator.discovery.CuratorDruidNodeAnnouncer - Announced self [{"druidNode":{"service":"druid/coordinator","host":"localhost","bindOnHost":false,"plaintextPort":8081,"port":-1,"tlsPort":-1,"enablePlaintextPort":true,"enableTlsPort":false},"nodeType":"overlord","services":{}}]. 2023-03-22T13:24:51,530 INFO [main] org.apache.druid.curator.discovery.CuratorDruidNodeAnnouncer - Announced self [{"druidNode":{"service":"druid/coordinator","host":"localhost","bindOnHost":false,"plaintextPort":8081,"port":-1,"tlsPort":-1,"enablePlaintextPort":true,"enableTlsPort":false},"nodeType":"coordinator","services":{}}]. 2023-03-22T13:24:51,530 INFO [main] org.apache.druid.curator.discovery.CuratorDruidNodeAnnouncer - Announced self [{"druidNode":{"service":"druid/coordinator","host":"localhost","bindOnHost":false,"plaintextPort":8081,"port":-1,"tlsPort":-1,"enablePlaintextPort":true,"enableTlsPort":false},"nodeType":"overlord","services":{}}].

I debugged this a bit and you are right about it being called from two modules. I think that we can also skip the duplicate announcement when Overlord is not in standalone mode. That should fix the problem you ran into. And we keep this assert in place. what do you think?

I tried this change and it's working fine.

diff --git a/services/src/main/java/org/apache/druid/cli/CliOverlord.java b/services/src/main/java/org/apache/druid/cli/CliOverlord.java index e4383c673c..79b90e63b5 100644 --- a/services/src/main/java/org/apache/druid/cli/CliOverlord.java +++ b/services/src/main/java/org/apache/druid/cli/CliOverlord.java @@ -267,13 +267,13 @@ public class CliOverlord extends ServerRunnable if (standalone) { LifecycleModule.register(binder, Server.class); - } - bindAnnouncer( - binder, - IndexingService.class, - DiscoverySideEffectsProvider.create() - ); + bindAnnouncer( + binder, + IndexingService.class, + DiscoverySideEffectsProvider.create() + ); + } Jerseys.addResource(binder, SelfDiscoveryResource.class); LifecycleModule.registerKey(binder, Key.get(SelfDiscoveryResource.class));

Thanks, that seems like the better solution. I've tested out the changes and it works as expected.

abhishekagarwal87 · 2023-04-04T07:51:45Z

@a2l007 - the PR is ready to go. I just had one question on a change that you made in this PR.

abhishekagarwal87 · 2023-04-14T09:54:29Z

thank you @a2l007. I merged this PR. This missed the cut for the 26 milestone. if you want this in 26 release, please create a backport PR against 26.0.0 branch.

Adds a new column start_time to sys.servers that captures the time at which the server was added to the cluster.

Server start time

232631f

kfaraz added the Area - Operations label Nov 14, 2022

LakshSingla reviewed Nov 15, 2022

View reviewed changes

vogievetsky reviewed Nov 15, 2022

View reviewed changes

vogievetsky added the Area - Web Console label Nov 15, 2022

a2l007 added 3 commits November 15, 2022 16:41

Merge branch 'master' of github.com:apache/druid into servercreatedtime

d80dd80

Spelling fixes

737c77c

Changes to test

f9f4415

vogievetsky reviewed Nov 16, 2022

View reviewed changes

Fix services view for nosql

cdc0ed7

vogievetsky approved these changes Nov 17, 2022

View reviewed changes

abhishekagarwal87 approved these changes Nov 22, 2022

View reviewed changes

Merge branch 'master' of github.com:apache/druid into servercreatedtime

6def47d

a2l007 marked this pull request as draft November 23, 2022 00:15

a2l007 added 2 commits November 28, 2022 09:20

Log attempt to announce diff values under same path

f1c46d3

Merge branch 'master' of github.com:apache/druid into servercreatedtime

b3e5a9f

Checkstyle

4aaa647

Merge branch 'master' of github.com:apache/druid into servercreatedtime

3523071

github-actions Bot added the Area - Documentation label Mar 21, 2023

Default start time for int tests

1a9c5e9

a2l007 marked this pull request as ready for review March 22, 2023 20:30

abhishekagarwal87 reviewed Mar 29, 2023

View reviewed changes

techdocsmith reviewed Apr 6, 2023

View reviewed changes

Comment thread docs/querying/sql-metadata-tables.md Outdated

a2l007 added 3 commits April 10, 2023 16:37

Merge branch 'master' of github.com:apache/druid into servercreatedtime

ded8668

Fix doc comments

839f2a5

bind announcer only in standalone mode

e2ae733

abhishekagarwal87 added the Release Notes label Apr 14, 2023

abhishekagarwal87 merged commit e3c160f into apache:master Apr 14, 2023

abhishekagarwal87 added this to the 27.0 milestone Jul 19, 2023

vogievetsky mentioned this pull request Aug 3, 2023

DO NOT MERGE - 27.0.0 WIP release notes #14600

Closed

churromorales pushed a commit to churromorales/druid that referenced this pull request Sep 13, 2023

Add start_time column to sys.servers (apache#13358)

33092ed

Adds a new column start_time to sys.servers that captures the time at which the server was added to the cluster.

Conversation

a2l007 commented Nov 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Key changed/added classes in this PR

Uh oh!

LakshSingla left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vogievetsky commented Nov 15, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vogievetsky left a comment

Choose a reason for hiding this comment

Uh oh!

a2l007 commented Nov 17, 2022

Uh oh!

vogievetsky commented Nov 17, 2022

Uh oh!

a2l007 commented Nov 23, 2022

Uh oh!

vogievetsky commented Nov 23, 2022

Uh oh!

a2l007 commented Nov 28, 2022

Uh oh!

vogievetsky commented Mar 20, 2023

Uh oh!

a2l007 commented Mar 20, 2023

Uh oh!

a2l007 commented Mar 24, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abhishekagarwal87 commented Apr 4, 2023

Uh oh!

Uh oh!

abhishekagarwal87 commented Apr 14, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

a2l007 commented Nov 11, 2022 •

edited

Loading