Skip to content

stop supervisor in case of metadata failures#3456

Merged
b-slim merged 1 commit intoapache:masterfrom
pjain1:kafka_indexing_fix
Oct 4, 2016
Merged

stop supervisor in case of metadata failures#3456
b-slim merged 1 commit intoapache:masterfrom
pjain1:kafka_indexing_fix

Conversation

@pjain1
Copy link
Copy Markdown
Member

@pjain1 pjain1 commented Sep 13, 2016

Possibly Fixes #3455

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why not doing is gracefully by having true instead of false?

Copy link
Copy Markdown
Member Author

@pjain1 pjain1 Sep 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we stop it gracefully that means tasks that may have been started by this supervisor are allowed to publish what they have read which might be really tiny data as the metadata insert happens just after the supervisor starts.

Ideally supervisor.start() should be called only if metadata insert succeeds but it is not possible here.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the code again and stop gracefully set to false means running tasks are not killed. So, false flag is meant to handle cases like leadership changes. Here, we would like to just kill the tasks so doing what @cheddar suggested would be correct.

@pjain1 pjain1 added the Bug label Sep 15, 2016
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice this comment here. I wonder if the supervisor shouldn't have a .verify() call on it to make sure the spec is legit, then persist to DB then start()?

Or, maybe just store in the DB first, start() and if start() doesn't work, fail out and remove from the DB again?

@pjain1
Copy link
Copy Markdown
Member Author

pjain1 commented Sep 15, 2016

updated the fix to start the supervisor first and then persist the spec if it succeeds otherwise write tombstone to prevent it being from picked up again.
@dclim does this make sense ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering is this needs to have a retry for some kind of transitory exception or just fail the task. I think with metadata storage we have in place some retry mechanism.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is ok for now .. i don't think we have retry mechanism at least for this insert operation

@b-slim
Copy link
Copy Markdown
Contributor

b-slim commented Sep 15, 2016

fix seems legit, it will be nice if we can gracefully retry if the insert has a transient kind of issues. 👍

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does supervisor.start() failure guarantee that things are not left in a corrupted state and supervisor.stop(true) does not need to be called?

@dclim ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@himanshug to be safe, we should probably make a call to consumer.close() in KafkaSupervisor.start() if the Kafka consumer was instantiated but an exception was thrown afterwards; @pjain1 do you mind adding this?

@dclim
Copy link
Copy Markdown
Contributor

dclim commented Sep 20, 2016

@pjain1 yes that seems good to me, sorry for missing your comment; logic looks good to me, maybe just ensure the Kafka consumer gets cleaned up if start() fails which is something I should've done before, thanks!

@pjain1
Copy link
Copy Markdown
Member Author

pjain1 commented Sep 20, 2016

@dclim added please verify

Copy link
Copy Markdown
Contributor

@dclim dclim Sep 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pjain1 Can you actually move the code below it:

      firstRunTime = DateTime.now().plus(ioConfig.getStartDelay());
      scheduledExec.scheduleAtFixedRate(
          buildRunTask(),
          ioConfig.getStartDelay().getMillis(),
          Math.max(ioConfig.getPeriod().getMillis(), MAX_RUN_FREQUENCY_MILLIS),
          TimeUnit.MILLISECONDS
      );

      started = true;
      log.info(
          "Started KafkaSupervisor[%s], first run in [%s], with spec: [%s]",
          dataSource,
          ioConfig.getStartDelay(),
          spec.toString()
      );

into the try-catch as well? And also you should probably do a (if consumer != null) {consumer.close();} to avoid an NPE

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@dclim
Copy link
Copy Markdown
Contributor

dclim commented Sep 20, 2016

@pjain1 sweet thanks, LGTM 👍

close kafka consumer in case supervisor start fails
@pjain1
Copy link
Copy Markdown
Member Author

pjain1 commented Sep 20, 2016

rebased with master

@pjain1 pjain1 added this to the 0.9.3 milestone Sep 25, 2016
@pjain1 pjain1 modified the milestones: 0.9.2, 0.9.3 Oct 4, 2016
@pjain1
Copy link
Copy Markdown
Member Author

pjain1 commented Oct 4, 2016

@dclim @b-slim @himanshug can we get this merged for 0.9.2-rc2

@b-slim
Copy link
Copy Markdown
Contributor

b-slim commented Oct 4, 2016

looks like we have 2 +1

@b-slim b-slim merged commit e419407 into apache:master Oct 4, 2016
@fjy
Copy link
Copy Markdown
Contributor

fjy commented Oct 4, 2016

@pjain1 backport plz

@pjain1
Copy link
Copy Markdown
Member Author

pjain1 commented Oct 4, 2016

@fjy #3532

@gianm gianm mentioned this pull request Oct 4, 2016
gianm pushed a commit that referenced this pull request Oct 5, 2016
* handle supervisor spec metadata failures
close kafka consumer in case supervisor start fails

* secure BrokerQueryResource endpoints
fundead pushed a commit to fundead/druid that referenced this pull request Dec 7, 2016
close kafka consumer in case supervisor start fails
@pjain1 pjain1 deleted the kafka_indexing_fix branch February 26, 2017 04:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants