Skip to content

HRTR: make pending task execution handling to go through all tasks on not finding worker slots#8697

Merged
himanshug merged 9 commits intoapache:masterfrom
himanshug:updates
Dec 12, 2019
Merged

HRTR: make pending task execution handling to go through all tasks on not finding worker slots#8697
himanshug merged 9 commits intoapache:masterfrom
himanshug:updates

Conversation

@himanshug
Copy link
Copy Markdown
Contributor

@himanshug himanshug commented Oct 18, 2019

Description

HRTR currently hangs on to a task try finding worker for it, with #7066 as I segregate tasks based on their type (to workaround problem mentioned in #8622 ) .. it is possible to find no worker for one task while other tasks could still find a worker, so it needs to go through all tasks and try to find workers for them.

HRTR is an experimental task runner. RTR based on zookeeper doesn't have this problem.

Additionally added few more endpoints to HttpRemoteTaskRunnerResource to introspect state of HttpRemoteTaskRunner for debugging.


This PR has:

  • been self-reviewed.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths.
  • been tested in a test Druid cluster.

Key changed/added classes in this PR
  • HttpRemoteTaskRunner

@lgtm-com
Copy link
Copy Markdown

lgtm-com Bot commented Oct 18, 2019

This pull request introduces 1 alert when merging 8cf89ca6b42fa72397ced3f6c25e8de89b086015 into 1ca8595 - view on LGTM.com

new alerts:

  • 1 for Dereferenced variable may be null

@lgtm-com
Copy link
Copy Markdown

lgtm-com Bot commented Nov 6, 2019

This pull request introduces 1 alert when merging c8c1f40 into 7b77cf1 - view on LGTM.com

new alerts:

  • 1 for Dereferenced variable may be null

@himanshug himanshug added this to the 0.17.0 milestone Nov 19, 2019
@jon-wei jon-wei added the Bug label Dec 10, 2019
@himanshug
Copy link
Copy Markdown
Contributor Author

lgtm issue can be ignored in this case and I have been running this patch in prod for quite some time. in any case, we are probably only one using HRTR for now.

@jihoonson
Copy link
Copy Markdown
Contributor

lgtm issue can be ignored in this case and I have been running this patch in prod for quite some time. in any case, we are probably only one using HRTR for now.

Hi @himanshug, I'm reviewing this PR. The LGTM warning is about the below code block.

        try {
          if (immutableWorker == null) {
            throw new ISE("NULL immutableWorker");
          }

          // this will send HTTP request to worker for assigning task
          if (!runTaskOnWorker(taskItem, immutableWorker.getWorker().getHost())) {
            if (taskItem.getState() == HttpRemoteTaskRunnerWorkItem.State.PENDING_WORKER_ASSIGN) {
              taskItem.revertStateFromPendingWorkerAssignToPending();
            }
          }
        }
        catch (InterruptedException ex) {
          log.info("Got InterruptedException while assigning task[%s].", taskId);
          throw ex;
        }
        catch (Throwable th) {
          log.makeAlert(th, "Exception while trying to assign task")
             .addData("taskId", taskId)
             .emit();

          taskComplete(taskItem, null, TaskStatus.failure(taskId));
        }
        finally {
          synchronized (statusLock) {
            workersWithUnacknowledgedTask.remove(immutableWorker.getWorker().getHost());
            statusLock.notifyAll();
          }
        }

It seems like immutableWorker can be null in the finally clause and so the LGTM warning looks legit to me. Would you explain more why this is ok and can be ignored?

Copy link
Copy Markdown
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@himanshug thanks for the fix. Left a couple of comments.

* Must not be used outside of this class and {@link HttpRemoteTaskRunnerResource}
*/
@SuppressWarnings("GuardedBy") // Read on workersWithUnacknowledgedTask is safe
public Map<String, ImmutableWorkerInfo> getWorkersEligibleToRunTasks()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks better to be package-private if it's not supposed to be used in general.

Copy link
Copy Markdown
Contributor Author

@himanshug himanshug Dec 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed, here and in few other similar places

workers,
input -> !lazyWorkers.containsKey(input.getKey()) &&
!workersWithUnacknowledgedTask.containsKey(input.getKey()) &&
!blackListedWorkers.containsKey(input.getKey()) &&
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this looks racy because each concurrent map is updated under different locks. But is this ok because this method is called periodically?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added comment to explain it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

State(int index)
private int index;
private boolean isPending;
private RunnerTaskState runnerTaskState;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These variables can be final.

Copy link
Copy Markdown
Contributor Author

@himanshug himanshug Dec 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed, for some reason I thought enum state was immutable without even explicitly doing so.. duh

{
if (log.isDebugEnabled()) {
log.debug(
new RuntimeException("Stacktrace..."),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this to print the stack trace so that you can see where this method was called from when the log was printed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, added a comment too.

.addData("taskId", taskId)
.emit();

taskComplete(taskItem, null, TaskStatus.failure(taskId));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is called outside of statusLock.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes taskCompleted(..) has to be called outside of statusLock .. I have added comments please take a look.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry. I misread codes that taskComplete requires to be called under statusLock. Thanks for adding comments.


if (taskItem.getTask() == null) {
log.makeAlert("No Task obj found in TaskItem for taskID[%s]. Failed.", taskId).emit();
taskComplete(taskItem, null, TaskStatus.failure(taskId));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is called outside of statusLock.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above


if (ti == null || !ti.getState().isPending()) {
// happens if the task was shutdown or was picked up earlier and no more pending
iter.remove();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like pendingTaskIds is updated only here. This will lead to processing the same taskId multiple times. HRTR will assign the task to some worker in the first loop, ignore it until the worker starts processing that task in the below if clause, and then finally remove the taskId here. Can we update pendingTaskIds when the task is successfully assigned?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before leaving the lock , ti state is updated to PENDING_WORKER_ASSIGN so no other task executor thread can pick up.

pendingTaskIds variable declaration has a comment that this variable is exclusively manipulated by only external task submitter threads or task executor threads which I preferred.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, sorry I don't think I fully understand this logic yet.

before leaving the lock , ti state is updated to PENDING_WORKER_ASSIGN so no other task executor thread can pick up.

Do you mean the other threads can pick up even after the ti state is set to PENDING_WORKER_ASSIGN but will ignore it in the below if clause? Or, do they really not pick up because somehow pendingTaskIds is updated before other threads start picking up?

pendingTaskIds variable declaration has a comment that this variable is exclusively manipulated by only external task submitter threads or task executor threads which I preferred.

Yes, I think it makes sense and I prefer that too. But can pendingTaskIds be updated after this line? Like as the below code:

            // set state to PENDING_WORKER_ASSIGN before releasing the lock so that this task item is not picked
            // up by another task execution thread.
            ti.setState(HttpRemoteTaskRunnerWorkItem.State.PENDING_WORKER_ASSIGN);
            iter.remove(); // new line to remove it from pendingTaskIds immediately
            taskItem = ti;
            break;

I don't think this is enough since the task id should be added back to pendingTaskIds if the worker couldn't execute the assigned task, but seems possible.

Copy link
Copy Markdown
Contributor Author

@himanshug himanshug Dec 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean the other threads can pick up even after the ti state is set to PENDING_WORKER_ASSIGN but will ignore it in the below if clause? Or, do they really not pick up because somehow pendingTaskIds is updated before other threads start picking up?

yes, other executor threads would see that but ignore because of the if clause you noted.

But can pendingTaskIds be updated after this line? Like as the below code: ...

if we removed here then we might have to add it back just in case task couldn't run for whatever reason, adding back would be complex because we will find out that we need to add it back at some later time and also adding back to its old position in the list would be difficult.
as opposed to all that we remove it from pendintTaskIds only when task is known to be no longer in PENDING or PENDING_WORKER_ASSIGN HttpRemoteTaskRunnerWorkItem.State by whichever pendingTasksExecutor thread notices that .

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you still have doubts, maybe describe a scenario where you think this doesn't work and I will try to explain how I think that scenario would play out :) .

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I believe this would work. I was just curious whether it was intended or not. I'm ok with it if this was intended. Would you please add a comment to avoid confusion in the future?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it. sure, added some commentary to hopefully make it clearer.

@himanshug
Copy link
Copy Markdown
Contributor Author

@jihoonson thanks for reviewing and you are right about LGTM alert ... I completely missed that it was used in finally block as well. moved the null check out of that try-catch .

Copy link
Copy Markdown
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! It looks more clear now. +1 after CI.

// back if this thread couldn't run this task for any reason, which we will know at some later time
// and also we will need to add it back to its old position in the list. that becomes complex quickly.
// Instead we keep the PENDING_WORKER_ASSIGN to notify other task execution threads not to pick this one up.
// And, it is automatically removed by any of the task exeuction threads when they notice that
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: exeuction -> execution

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, thanks

@himanshug himanshug merged commit 4510118 into apache:master Dec 12, 2019
himanshug added a commit to himanshug/druid that referenced this pull request Dec 12, 2019
… not finding worker slots (apache#8697)

* HRTR: make pending task execution handling to go through all tasks on
not finding worker slots

* make HRTR methods package private that are meant to be used only in HttpRemoteTaskRunnerResource

* mark HttpRemoteTaskRunnerWorkItem.State global variables final

* hrtr: move immutableWorker NULL check outside of try-catch or finally block could have NPE

* add some explanatory comments

* add comment on explaining mechanics around hand off of pending tasks from submission to it getting picked up by a task execution thread

* fix spelling
jihoonson pushed a commit that referenced this pull request Dec 13, 2019
… not finding worker slots (#8697) (#9022)

* HRTR: make pending task execution handling to go through all tasks on
not finding worker slots

* make HRTR methods package private that are meant to be used only in HttpRemoteTaskRunnerResource

* mark HttpRemoteTaskRunnerWorkItem.State global variables final

* hrtr: move immutableWorker NULL check outside of try-catch or finally block could have NPE

* add some explanatory comments

* add comment on explaining mechanics around hand off of pending tasks from submission to it getting picked up by a task execution thread

* fix spelling
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants