Skip to content

Add errors and state to stream supervisor status API endpoint#7428

Merged
clintropolis merged 28 commits intoapache:masterfrom
justinborromeo:7217-Supervisor-Error-Endpoint-v4
Jun 1, 2019
Merged

Add errors and state to stream supervisor status API endpoint#7428
clintropolis merged 28 commits intoapache:masterfrom
justinborromeo:7217-Supervisor-Error-Endpoint-v4

Conversation

@justinborromeo
Copy link
Copy Markdown
Contributor

Closes #7217

Based on the proposal and discussion in #7217.

The PR adds the following config values (which weren't discussed in the proposal):

property description values default
druid.supervisor.stream.healthinessThreshold The number of successful iterations before the supervisor flips from an UNHEALTHY to a RUNNING state An integer in [3,2147483647] 3
druid.supervisor.stream.unhealthinessThreshold The number of iterations failed before the supervisor flips from a RUNNING to an UNHEALTHY state An integer in [3,2147483647] 3
druid.supervisor.stream.taskHealthinessThreshold The number of consecutive task successes before the supervisor flips from an UNHEALTHY_TASKS to a RUNNING state An integer in [3,2147483647] 3
druid.supervisor.stream.taskUnhealthinessThreshold The number of consecutive task failures before the supervisor flips from a RUNNING to an UNHEALTHY_TASKS state An integer in [3,2147483647] 3
druid.supervisor.stream.storingStackTraces Whether full stack traces of supervisor exceptions should be stored and returned by the supervisor /status endpoint true/false false
druid.supervisor.stream.maxStoredExceptionEvents The maximum number of exception events that can be returned through the supervisor /status endpoint An integer in [max(healthinessThreshold, unhealthinessThreshold), 2147483647] max(healthinessThreshold, unhealthinessThreshold)

@jon-wei jon-wei self-assigned this Apr 10, 2019
|UNHEALTHY_SUPERVISOR|The supervisor has encountered non-transient errors on the past `druid.supervisor.stream.unhealthinessThreshold` iterations|1|
|UNHEALTHY_TASKS|The last `druid.supervisor.stream.taskUnhealthinessThreshold` tasks have all failed|2|
|UNABLE_TO_CONNECT_TO_STREAM|The supervisor is encountering connectivity issues with Kafka and has not successfully connected in the past|3|
|LOST_CONTACT_WITH_STREAM|The supervisor is encountering transient connectivity issues with Kafka but has successfully connected in the past|4|
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: For LOST_CONTACT_WITH_STREAM, are the connectivity issues necessarily transient? Maybe sometimes the connectivity never comes back without operator intervention.

Notes about states:

- Since it's possible that 2+ states can apply to a supervisor at the same time, each state is given a priority. The
active state with the highest priority will be returned in the status report.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is priority 1 considered higher than priority 5, or the other way around?

}

private State supervisorState;
// Remove all throwableEvents that aren't in this set at the end of each run (transient)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment seems misplaced

tasksHealthy = false;
}
}
if (tasksHealthy && currentRunState == State.UNHEALTHY_TASKS) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can currentRunState ever be UNHEALTHY_TASKS here? It starts as RUNNING but I don't see anything above this setting it to UNHEALTHY_TASKS

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. That should have been supervisorState but that's an irrelevant check because it's already checked on line 173.

import org.junit.BeforeClass;
import org.junit.Test;

public class CircularBufferTest
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also a ChangeRequestHistoryTest test suite that has a test for CircularBuffer, can you move that test here now that you've added a separate test suite?

throws ExecutionException, InterruptedException, TimeoutException, JsonProcessingException
{
possiblyRegisterListener();
stateManager.setStateIfNoSuccessfulRunYet(SeekableStreamSupervisorStateManager.State.CONNECTING_TO_STREAM);
Copy link
Copy Markdown
Contributor

@jon-wei jon-wei Apr 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I feel like it would be cleaner if the state manager handled the decision of whether to transition to a particular state based on whether a successful run has occurred or not (I don't think the caller should have to know that it needs to call either setStateIfNoSuccessfulRunYet or setState depending on the state)


for (Map.Entry<Class, Queue<ExceptionEvent>> events : eventStore.getNonTransientRecentEvents().entrySet()) {
if (events.getValue().size() >= unhealthinessThreshold) {
if (events.getKey().equals(NonTransientStreamException.class)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that you once connected successfully to the stream, but later get a lot of NonTransientStreamException?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Although it shouldn't happen in normal operation, I could imagine that there's some edge cases where that might occur (e.g. someone changing a Kafka cluster's permissions causing an already successfully running supervisor to start throwing auth exceptions?). Do you think there's value in subdividing the LOST_CONTACT_WITH_STREAM state into LOST_CONTACT_WITH_STREAM_NON_TRANSIENT and LOST_CONTACT_WITH_STREAM_TRANSIENT states?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I'm not sure that tying transience to the type of exception is the right approach (I think transience is more about the number of consecutive failures regardless of the exception type).

Since the error states are trying to convey whether the supervisor is having stream connectivity issues (UNABLE_TO_CONNECT_TO_STREAM, LOST_CONTACT_WITH_STREAM) or if its some other kind of issue (UNHELAHTY_SUPERVISOR), I'm thinking it would be better to separate the exceptions into two categories:

  • Stream connection problems
  • problems unrelated to stream connectivity

UNABLE_TO_CONNECT_TO_STREAM would be the state when you have stream connection exceptions over the configured threshold, and you've not successfully connected before

LOST_CONTACT_WITH_STREAM would occur when you have stream connection exceptions over the configured threshold, and you have successfully connected before

UNHEALTHY_SUPERVISOR would occur when you have non-stream connection exceptions over the threshold

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I agree that your implementation would result in a cleaner API design, I still think the transience of an error is important to convey since it indicates severity. For example, a Kinesis supervisor that encounters LimitExceededExceptions (a temporary API limit exception) over the first threshold runs shouldn't be treated equally in severity to a supervisor that throws some sort of auth exception over the first threshold runs since only the former could possibly recover without user intervention. If we were to eliminate that concept, there'd be no difference from the API caller's perspective.

In the initial discussion on this change's proposal, @dclim said "If it transitions from RUNNING (healthy) to UNHEALTHY, assume that someone has hooked up a monitoring system to it and is going to get paged at 4am in the morning, so it better actually be UNHEALTHY, and not some transient error that is going to resolve in the next minute". If we're set on working under the assumption that someone's hooking up this endpoint to a pager, I think the transience of the error is very relevant.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're set on working under the assumption that someone's hooking up this endpoint to a pager, I think the transience of the error is very relevant.

If the exceptions have different severities, then I think you could be more lenient on the failure thresholds for low severity errors. UNABLE_TO_CONNECT and LOST_CONTACT should accurately reflect the real state/history though, I don't think fudging that for low severity errors is the right approach.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another consideration might be to have the status report contain some error severity suggestion, if you only had low severity errors the report could indicate that somehow and a user that didn't care too much low sev errors could choose not to alert in such situations, also, is the classification of low sev vs high sev errors very clear?

Copy link
Copy Markdown
Contributor

@jon-wei jon-wei Apr 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Splitting the states into transient vs non transient could be useful, but maybe better to indicate transience or severity in a separate field?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add a field to ExceptionEvent that indicates transience then have the states set regardless of error transience. Is that approach cool?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@justinborromeo That sounds good to me

} else {
currentRunState = State.UNHEALTHY_TASKS;
}
} else {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since UNHEALTHY_SUPERVISOR is higher priority than UNHEALTHY_TASKS, I think you could skip this else block when currentRunState has been set to UNHEALTHY_SUPERVISOR

currentRunState = getHigherPriorityState(currentRunState, State.UNABLE_TO_CONNECT_TO_STREAM);
} else if (events.getKey().equals(TransientStreamException.class) ||
events.getKey().equals(PossiblyTransientStreamException.class)) {
currentRunState = getHigherPriorityState(currentRunState, State.LOST_CONTACT_WITH_STREAM);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, is it possible that you've never successfully connected to the stream, but you've only gotten PossiblyTransient or Transient exceptions?

Copy link
Copy Markdown
Contributor Author

@justinborromeo justinborromeo Apr 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I was debating whether to give the supervisor the higher severity UNABLE_TO_CONNECT_TO_STREAM state in that case. I opted against doing so for the reason that I believe that it's better to incorrectly label a non-transient issue as LOST_CONTACT than to incorrectly label a transient issue as UNABLE_TO_CONNECT (assuming someone's pager is hooked up to this endpoint). Do you think there's value in subdividing the UNABLE_TO_CONNECT_TO_STREAM state into UNABLE_TO_CONNECT_TO_STREAM_NON_TRANSIENT and UNABLE_TO_CONNECT_TO_STREAM_TRANSIENT states?

this.stateHistory = new CircularBuffer<>(Math.max(healthinessThreshold, unhealthinessThreshold));
}

public Optional<State> setStateAndCheckIfFirstRun(State newState)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's more readable if this is just called setState, the logic around some states only occurring on the first iteration of the supervisor could be documented more fully in a method javadoc or somewhere else state-related

dclim
dclim previously requested changes Apr 24, 2019

// The number of runs failed before the supervisor flips from a RUNNING to an UNHEALTHY state
@JsonProperty
@Min(3)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any particular reason to limit the minimum to 3 in any of these?

@Min(3)
private int unhealthinessThreshold = 3;

// The number of successful before the supervisor flips from an UNHEALTHY to a RUNNING state
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'successful runs'?

* @return {@link AppenderatorDriverAddResult}
*
* @throws IOException if there is an I/O error while allocating or writing to a segment
* @throws IOException if there is an I/O error while allocating or writing to a segmentq
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

private DateTime timestamp;
private StreamErrorTransience streamErrorTransience;

public ExceptionEvent()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jackson's deserializer complained that there's no default constructor but I can replace this with a JsonCreator constructor

private final DateTime offsetsLastUpdated;
private final boolean suspended;
private final SeekableStreamSupervisorStateManager.State state;
private final Queue<SeekableStreamSupervisorStateManager.ExceptionEvent> recentErrors;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: I think typically we use List rather than Queue if there's no particular reason for it to be a queue. ExceptionEventStore.getRecentEvents() should probably return something more common likeList or Collection. Also while we're at it, it should probably copy the data into an immutable data type instead of passing the underlying ConcurrentLinkedQueue around to avoid accidental modification.

* checks if there's been at least one successful iteration if needed and sets supervisor state to an appropriate
* new state.
*/
public Optional<State> setState(State newState)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't seem like anyone reads the return value

public void storeThrowableEvent(Throwable t)
{
if (t instanceof PossiblyTransientStreamException && atLeastOneSuccessfulRun) {
t = new TransientStreamException(t);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about wrap the underlying exception instead of wrapping PossiblyTransientStreamException in a TransientStreamException?


public void markRunFinishedAndEvaluateHealth()
{
if (currentRunSuccessful) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More elegant as atLeastOneSuccessfulRun |= currentRunSuccessful;

}
}

supervisorStateHistory.add(currentRunState);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this come after the UNHEALTHY_SUPERVISOR state check?


State currentRunState = State.RUNNING;

for (Map.Entry<Class, Queue<ExceptionEvent>> events : eventStore.getNonTransientRecentEvents().entrySet()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is clearing error conditions properly. Let's say that you have maxStoredExceptionEvents set to something very high, such that ExceptionEventStore.storeThrowable() never removes any items from recentEventsQueue or recentEventsMap. Then let's say you have unhealthinessThreshold set to 3 and you accumulate 3 TransientStreamException events - It'll drop into a LOST_CONTACT_WITH_STREAM state. Then next run, say there were no errors encountered on the run - it'll go back into a RUNNING state. But the next run if you get another TransientStreamException, it'll immediately drop back into a LOST_CONTACT_WITH_STREAM state. Again assuming maxStoredExceptionEvents is very high and is never hit, you could have a stream running successfully for hours, and then on the next TransientStreamException it goes into an unhealthy state without respecting unhealthinessThreshold. Does this behavior sound like what would happen?

@dclim
Copy link
Copy Markdown
Contributor

dclim commented May 4, 2019

In 128edad I made some modifications to the implementation along two main lines:

  1. After some consideration, I felt to remove the whole concept of classifying exceptions by their transience as suggested in Add errors and state to stream supervisor status API endpoint #7428 (comment). I think it added more complexity than value, but more important could mislead users when we incorrectly classify an error as being transient but in reality it will never recover without user intervention. Some examples: in Kafka, TimeoutException gets classified as 'transient' if we've previously had a successful run, but without knowing why the timeout is happening, how could you know if it would ever resolve? Is it because the network was congested momentarily, or is it because the Kafka broker got zapped by lightning and is now a smoldering pile of ashes? In Kinesis, the generic AmazonKinesisException gets classified as 'non-transient', but I can bet that there is, or if not in a future release will be, a subclass exception that is actually a transient failure that we haven't accounted for because it hasn't been written yet. Bottom line is that it's fragile to try to classify exceptions, so better not try.

  2. In trying to resolve the issue mentioned in Add errors and state to stream supervisor status API endpoint #7428 (comment) + removing of the transience concept in 1), SeekableStreamSupervisorStateManager was fairly heavily modified from the original implementation. Most of the other files remain largely the same. I added some missed state capture points in SeekableStreamSupervisor and removed some that were capturing failures in non-run loop code blocks (e.g. I don't want the supervisor reporting an unhealthy state if someone repeatedly hits a status endpoint with a bad request but the main loop is fine).

@dclim dclim force-pushed the 7217-Supervisor-Error-Endpoint-v4 branch from 552cab1 to f686d2a Compare May 4, 2019 19:52
@dclim dclim dismissed their stale review May 6, 2019 20:14

Self-implemented proposed changes

Copy link
Copy Markdown
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had first pass, will do another soon 👍

|CREATING_TASKS (first iteration only)|The supervisor is creating tasks and discovering state|
|RUNNING|The supervisor has started tasks and is waiting for taskDuration to elapse|
|SUSPENDED|The supervisor has been suspended|
|SHUTTING_DOWN|Shutdown has been called but the supervisor hasn’t fully shutdown yet|
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this state should be STOPPING instead of SHUTTING_DOWN since it is tied to the supervisor 'stop' method and importantly to avoid confusion with the deprecated supervisor 'shutdown' API call which is now called 'terminate' which tombstones the supervisor.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good

{
public enum State
{
// Error states - ordered from high to low priority
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there still a priority in this PR? It looks to me like UNABLE_TO_CONNECT_TO_STREAM or LOST_CONTACT_WITH_STREAM would come first depending on if a successful run had happened, then UNHEALTHY_SUPERVISOR, then UNHEALTHY_TASKS?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the priority is more implicit now - there is no overlap between LOST_CONTACT_WITH_STREAM, UNABLE_TO_CONNECT_TO_STREAM, and UNHEALTHY_SUPERVISOR - it reports one of the first two if the last exception thrown was wrapped in a StreamException (which happens for exceptions from calls made to the Kafka/Kinesis client library), otherwise it's UNHEALTHY_SUPERVISOR. Any of those 3 take priority over UNHEALTHY_TASKS.


|Property|Description|Default|
|--------|-----------|-------|
|druid.supervisor.stream.healthinessThreshold|The number of successful iterations before the supervisor flips from an UNHEALTHY to a RUNNING state|3|
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inconsistent terminology, this should be UNEALTHY_SUPERVISOR?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily - it could also have been in a LOST_CONTACT_WITH_STREAM or UNABLE_TO_CONNECT_TO_STREAM which are also the 'unhealthy' states. Maybe it shouldn't be capitalized to avoid confusion?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly with the rest of the comments about UNHEALTHY -> UNHEALTHY_SUPERVISOR.

|Property|Description|Default|
|--------|-----------|-------|
|druid.supervisor.stream.healthinessThreshold|The number of successful iterations before the supervisor flips from an UNHEALTHY to a RUNNING state|3|
|druid.supervisor.stream.unhealthinessThreshold|The number of iterations failed before the supervisor flips from a RUNNING to an UNHEALTHY state|3|
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto UNHEALTHY_SUPERVISOR


|Property|Description|Default|
|--------|-----------|-------|
|druid.supervisor.stream.healthinessThreshold|The number of successful iterations before the supervisor flips from an UNHEALTHY to a RUNNING state|3|
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UNHEALTHY_SUPERVISOR?

|Property|Description|Default|
|--------|-----------|-------|
|druid.supervisor.stream.healthinessThreshold|The number of successful iterations before the supervisor flips from an UNHEALTHY to a RUNNING state|3|
|druid.supervisor.stream.unhealthinessThreshold|The number of iterations failed before the supervisor flips from a RUNNING to an UNHEALTHY state|3|
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UNHEALTHY_SUPERVISOR?

currentRunSuccessful = false;
}

public void storeCompletedTaskState(TaskState state)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this storing anything?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends on your definition of storing. Maybe something like trackCompletedTask() would be less confusing?

@JsonProperty
private boolean storingStackTraces = false;

// The number of runs failed before the supervisor flips from a RUNNING to an UNHEALTHY state
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UNHEALTHY -> UNHEALTHY_SUPERVISOR?

@JsonProperty
private int unhealthinessThreshold = 3;

// The number of successful runs before the supervisor flips from an UNHEALTHY to a RUNNING state
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UNHEALTHY -> UNHEALTHY_SUPERVISOR?

consumer lag per partition may be reported as negative values if the supervisor has not received a recent latest offset
response from Kafka. The aggregate lag value will always be >= 0.

The status report also contains the supervisor's state and a list of recently thrown exceptions (whose max size can be
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since afaict UNHEALTHY_SUPERVISOR, UNHEALTHY_TASK,UNABLE_TO_CONNECT_TO_STREAM, and LOST_CONTACT_TO_STREAM don't all seem mutually exclusive, what exactly is the value in distinguishing them at all instead of just a single UNHEALTHY state? It seems like the end result is going to be hitting the API to get the errors so you can find out the 'why' so you can determine how to resolve the situation, no matter which unhealthy state, no? Am I missing something?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, everything but UNHEALTHY_TASKS is mutually exclusive.

I think the value is in providing a bit more information to help in debugging, mainly in distinguishing between the 'unable to connect' and 'lost contact' cases. If you didn't distinguish this and just had a list of recent exceptions, how would you be able to tell if this supervisor ever worked and is possibly suffering a 'transient' issue, other than by looking through logs? The 'unhealthy supervisor' case is then a necessary third option to handle exceptions that don't fall into either of the first two categories because they're not stream-related. 'Unhealthy tasks' is more of a nice to have - that way monitoring systems don't have to additionally parse the response of the task API endpoints to figure out that a bunch of tasks are failing.

return partitions.stream().map(PartitionInfo::partition).collect(Collectors.toSet());
return wrapExceptions(() -> {
// use consumer.listTopics() instead of partitionsFor() to force a remote call so we can detect stream issues
Map<String, List<PartitionInfo>> topics = consumer.listTopics();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was explicitly changed to partitionsFor because of the overhead of listTopics if you have a ton of topics. See #6455

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hah nice, thanks for pointing that PR out. In that case, I can go back to using partitionsFor

Copy link
Copy Markdown
Contributor

@jon-wei jon-wei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest code changes LGTM, had some comments on the docs


|State|Description|
|-----|-----------|
|UNHEALTHY_SUPERVISOR|The supervisor has encountered errors on the past `druid.supervisor.unhealthinessThreshold` iterations|
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest adding some info about the difference between the basic states and the detailed implementation-specific states, and how the detailed implementation states here map to basic states

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

|SUSPENDED|The supervisor has been suspended|
|STOPPING|The supervisor is stopping|

States marked with "first iteration only" only occur on the supervisor's first iteration at startup or after suspension.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest adding a short high-level summary of the Kafka/Kinesis supervisor's runInternal() loop. The info is kind of there implicitly in the ordering of the states above, but I think a more explicit description of the per-iteration sequence would be useful

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Copy Markdown
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm 👍


|State|Description|
|-----|-----------|
|UNHEALTHY_SUPERVISOR|The supervisor has encountered errors on the past `druid.supervisor.unhealthinessThreshold` iterations|
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be broken down by how implementation state maps to universal state like kafka?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, thanks for catching

@clintropolis clintropolis merged commit 8032c4a into apache:master Jun 1, 2019
@justinborromeo justinborromeo deleted the 7217-Supervisor-Error-Endpoint-v4 branch June 1, 2019 01:12
jihoonson pushed a commit to implydata/druid-public that referenced this pull request Jun 26, 2019
…#7428)

* Add state and error tracking for seekable stream supervisors

* Fixed nits in docs

* Made inner class static and updated spec test with jackson inject

* Review changes

* Remove redundant config param in supervisor

* Style

* Applied some of Jon's recommendations

* Add transience field

* write test

* implement code review changes except for reconsidering logic of markRunFinishedAndEvaluateHealth()

* remove transience reporting and fix SeekableStreamSupervisorStateManager impl

* move call to stateManager.markRunFinished() from RunNotice to runInternal() for tests

* remove stateHistory because it wasn't adding much value, some fixes, and add more tests

* fix tests

* code review changes and add HTTP health check status

* fix test failure

* refactor to split into a generic SupervisorStateManager and a specific SeekableStreamSupervisorStateManager

* fixup after merge

* code review changes - add additional docs

* cleanup KafkaIndexTaskTest

* add additional documentation for Kinesis indexing

* remove unused throws class
gianm pushed a commit to implydata/druid-public that referenced this pull request Jul 3, 2019
…#7428)

* Add state and error tracking for seekable stream supervisors

* Fixed nits in docs

* Made inner class static and updated spec test with jackson inject

* Review changes

* Remove redundant config param in supervisor

* Style

* Applied some of Jon's recommendations

* Add transience field

* write test

* implement code review changes except for reconsidering logic of markRunFinishedAndEvaluateHealth()

* remove transience reporting and fix SeekableStreamSupervisorStateManager impl

* move call to stateManager.markRunFinished() from RunNotice to runInternal() for tests

* remove stateHistory because it wasn't adding much value, some fixes, and add more tests

* fix tests

* code review changes and add HTTP health check status

* fix test failure

* refactor to split into a generic SupervisorStateManager and a specific SeekableStreamSupervisorStateManager

* fixup after merge

* code review changes - add additional docs

* cleanup KafkaIndexTaskTest

* add additional documentation for Kinesis indexing

* remove unused throws class
@clintropolis clintropolis added this to the 0.16.0 milestone Aug 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[PROPOSAL] API Endpoint for Supervisor Errors

4 participants