Skip to content

Plumb task peon host/ports back out to the overlord.#2419

Merged
fjy merged 1 commit intoapache:masterfrom
gianm:task-hostports
Feb 26, 2016
Merged

Plumb task peon host/ports back out to the overlord.#2419
fjy merged 1 commit intoapache:masterfrom
gianm:task-hostports

Conversation

@gianm
Copy link
Copy Markdown
Contributor

@gianm gianm commented Feb 9, 2016

The intent is to get rid of the need for Curator service discovery to find tasks.

Motivation
Curator based service discovery is annoying because it needs ZK, and also because it doesn't clean up after itself when a service goes away, requiring hacks like this in tranquility: https://github.com/druid-io/tranquility/blob/v0.7.2/core/src/main/scala/com/metamx/tranquility/druid/DruidBeamMaker.scala#L229. This should also make life easier for the ingestion supervisors needed by #2220, as they will likely run on the overlord and will benefit from the overlord knowing where tasks are.

Intended usage
Tranquility would use this by implementing a resolver (similar to the DiscoResolver for Curator discovery) that polls the overlord's runningTasks endpoint instead of watching ZK.

Overlord-based ingestion supervisors (like the kafka one implied by #2220) would probably register a listener directly with the TaskRunner.

Implementation

  • Add TaskLocation class to represent the peon that tasks are running on
  • Add TaskRunner listeners that make it possible for callers to be notified when tasks move around (this happens when they get assigned to a peon for the first time, or potentially on restore)
  • Add getLocation to TaskRunnerWorkItem, mostly for the overlord servlet
  • Rework WorkerTaskMonitor to do management out of a single thread so it can handle status and location updates more simply.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u add some comments about what this does and how to extend it?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not have the listener take an executor?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works better when the runner takes it

@gianm gianm force-pushed the task-hostports branch 4 times, most recently from 999af33 to 8d882ae Compare February 9, 2016 03:26
@drcrallen
Copy link
Copy Markdown
Contributor

I do want to review this but it will take a bit to chew through.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this just use Worker?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not really semantically a worker- one worker is going to have many tasks at many taskLocations (all at different ports from the parent worker).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then can they be DruidNodes?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess chatPort makes that a no?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to simply propagate DruidNode and have chatPort discoverable from the node data exposed at DruidNode?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about just using HostAndPort? (I'm trying to minimize the number of items that are added to the code)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw, in #2242 I am making DruidServerMetadata the source of truth for Druid server's metadata. I think it's reasonable to make DruidServerMetadata to contain a minimum set of metadata(e.g., host, port, name, etc), and then let Worker, TaskLocation, DruidServer extend it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hostAndPort doesn't have jackson annotations. We could register a jackson module for it, I suppose (or maybe it's already part of the GuavaModule?). I am also ok with replacing this with the stuff from #2242 when that pr is ready.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like HostAndPort serde is included in the GuavaModule, although it uses the ToStringSerializer, which will be kind of a pain for people to deserialize that aren't linking guava in their app. so I am leaning towards keeping TaskLocation as a thing and potentially changing it after #2242.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

@drcrallen
Copy link
Copy Markdown
Contributor

FYI, this also uses similar functionality to the remote task runner replacement in #2246

@fjy fjy added this to the 0.9.1 milestone Feb 9, 2016
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if a task finishes in this block of code?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, should be fine if unannounceTask is idempotent

@fjy
Copy link
Copy Markdown
Contributor

fjy commented Feb 10, 2016

👍 even if my comments not addressed

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you could skip this announcement if details.location was already same as location.

@himanshug
Copy link
Copy Markdown
Contributor

@gianm i see that this PR allows you to know host and port of peons from runningTask HTTP endpoint at overlord. so, in tranquility, you would use same to find the tasks . however realtime indexing task open a separate chat handler port which you do not obtain from the "TaskLocation" . are you planning to continue using service discovery for finding the chat handler port or should the TaskLocation be updated to have some kind of metadata object inside it too so that it can have additional information like chat handler port?

@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Feb 10, 2016

@himanshug that's a good point, I was thinking that the chat handler was still on the same port as the main servlet, but that's not true anymore since the separateIngestionEndpoint was added.

ideally this should be part of the location as well. will adjust that.

@gianm gianm closed this Feb 10, 2016
@gianm gianm reopened this Feb 10, 2016
@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Feb 10, 2016

@himanshug reopened with chatPort included.

@gianm gianm force-pushed the task-hostports branch 2 times, most recently from b4aa259 to ab9ee21 Compare February 11, 2016 00:44
@gianm gianm closed this Feb 11, 2016
@gianm gianm reopened this Feb 11, 2016
@gianm gianm force-pushed the task-hostports branch 2 times, most recently from 0c6298e to 0a5956b Compare February 24, 2016 22:38
@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Feb 24, 2016

@guobingkun a task should only have one location at a time, if it is running in two places then the RTR should kill the one it doesn't like

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no guarantee of execution order or completion here (nor error reporting on error?)

For example, if an over-burdened executor is used that does not have a FIFO queue, location changes can be processed in no particular order compared to the call to notifyLocationChanged.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added comments to registerListener on TaskRunner clarifying the intended usage.

@gianm gianm force-pushed the task-hostports branch 3 times, most recently from 2acbfef to e18f9ad Compare February 24, 2016 23:12
- Add TaskLocation class
- Add registerListener to TaskRunner
- Add getLocation to TaskRunnerWorkItem
- Implement location tracking in existing TaskRunners
- Rework WorkerTaskMonitor to do management out of a single thread so it can
  handle status and location updates more simply.
@drcrallen
Copy link
Copy Markdown
Contributor

Cool, 👍 but suggest removing #2419 (diff)

@gianm gianm closed this Feb 25, 2016
@gianm gianm reopened this Feb 25, 2016
@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Feb 25, 2016

I think this failed due to #2430

(https://travis-ci.org/druid-io/druid/builds/111621322)

testSessionKilled(io.druid.curator.announcement.AnnouncerTest)  Time elapsed: 61.322 sec  <<< ERROR!
java.lang.Exception: test timed out after 60000 milliseconds
    at java.lang.Thread.sleep(Native Method)
    at io.druid.curator.announcement.AnnouncerTest.testSessionKilled(AnnouncerTest.java:176)

fjy added a commit that referenced this pull request Feb 26, 2016
Plumb task peon host/ports back out to the overlord.
@fjy fjy merged commit 143e85e into apache:master Feb 26, 2016
seoeun25 pushed a commit to seoeun25/incubator-druid that referenced this pull request Jan 10, 2020
@gianm gianm deleted the task-hostports branch September 23, 2022 19:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants