Timeout for LockAcquireAction by akashdw · Pull Request #4461 · apache/druid

akashdw · 2017-06-23T22:52:32Z

Acquiring task lock for overlapping intervals causes a deadlock. Also the Overlord can run out of jetty threads when the TaskActionClient times out and retries while acquiring the same lock. This PR introduces a timeout for LockAcquireAction

jihoonson · 2017-06-24T07:02:52Z

What do you think about adding timeout for tryLock() as well? I think it will be useful.

Acquiring task lock for overlapping intervals causes a deadlock.

Maybe we need to return TaskLockPosse when tryLock() or lock() is failed to figure out the current lock failure is due to deadlock or not?

akashdw · 2017-06-26T21:35:08Z

Thanks @jihoonson.
Timeout is not required for tryLock() because it is non blocking and returns immediately.
Made changes to throw an exception and alert incase of overlapping interval.

jihoonson

@akashdw thanks for the update. I left more comments.

jihoonson · 2017-06-27T08:49:35Z

    );
  }

+  @Test(expected = ISE.class)


It's recommended to use ExpectedException to check the exception is thrown from the right place. (#4292 (comment))

jihoonson · 2017-06-27T08:49:59Z

    Assert.assertFalse(lockbox.tryLock(task, new Interval("2015-01-01/2015-01-02")).isPresent());
  }

+  @Test(expected = InterruptedException.class)


It's recommended to use ExpectedException to check the exception is thrown from the right place. (#4292 (comment))

jihoonson · 2017-06-27T08:59:42Z

+
+  public TaskLockbox(
+      TaskStorage taskStorage,
+      long lockTimeoutMillis


Should every lockAcquireAction have the same timeout? I think it would be valuable if we can change the timeout depending on task specs in the future. For example, a task can have a long timeout if it should acquire a lock for a long interval.

Should every lockAcquireAction have the same timeout? Yes for now, it can also be extended where client can send a timeout.
For now serverConfig.getMaxIdleTime() is the default timeout b/c even if task gets a lock after this period, overlord can not write the response in the closed socket.

For now serverConfig.getMaxIdleTime() is the default timeout b/c even if task gets a lock after this period, overlord can not write the response in the closed socket.

Using maxIdleTime as a default value sounds good.

Yes for now, it can also be extended where client can send a timeout.

I think we need to set different timeouts for each lock request in very near future because I'm working on prioritized locking (#4479, #1679) and this timeout feature will be great if tasks can set different timeouts according to their priorities.

Sure, I'll create a separate PR where client can send locktimeout, if not default serverConfig.getMaxIdleTime() will be used.

Ok. Thanks!

@akashdw I raised #4533.

jihoonson · 2017-06-27T09:00:53Z

+  )
+  {
+    this.taskStorage = taskStorage;
+    this.lockTimeoutMillis = serverConfig.getMaxIdleTime().getMillis();


maxIdelTime looks to be deprecated (http://druid.io/docs/latest/configuration/indexing-service.html).

serverConfig.getMaxIdleTime() corresponds to druid.server.http.maxIdleTime and is not deprecated.

jihoonson · 2017-06-27T09:09:20Z

      Optional<TaskLock> taskLock;
      while (!(taskLock = tryLock(task, interval)).isPresent()) {
-        lockReleaseCondition.await();
+        lockReleaseCondition.await(lockTimeoutMillis, TimeUnit.MILLISECONDS);


lockTimeoutMillis should be updated if this line returns early.

If this line returns early that means Task got the lock, not sure why final variable of this class should be updated ?

A spurious wakeup can occur while awaiting (https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/locks/Condition.html). If it wakes up early but couldn't get a lock yet, the waiting time should be updated.

jihoonson · 2017-06-27T09:17:34Z

        derbyConnector
    );
-    taskLockbox = new TaskLockbox(taskStorage);
+    taskLockbox = new TaskLockbox(taskStorage, 300000);


I think the timeout of 300 seconds is too long. Of course, waiting for 300 seconds is not usual, but if the lock is acquired after waiting for 300 seconds, it is likely to make unit tests failed eventually on travis due to the job timeout. I think it would be better to make tests failed earlier rather than waiting for such a long time. This is same for other tests.

akashdw · 2017-07-01T00:25:04Z

Thanks @jihoonson. addressed comments.

jihoonson · 2017-07-01T06:16:55Z

            log.makeAlert("Same Task is trying to acquire lock for overlapping interval")
               .addData("task", task.getId())
               .addData("interval", interval);
-            throw new ISE(


Why did you remove the code throwing an exception here? I think it would be better to throw an exception immediately if a deadlock is found rather than waiting for the lock request to be expired.

@jihoonson Overlapping intervals causes deadlock for lockAquireAction and not for lockTryAquireAction. I think throwing exception for tryLock is unnecessary. Other problem I could see is in the SegmentAllocationAction, https://github.com/druid-io/druid/blob/master/indexing-service/src/main/java/io/druid/indexing/common/actions/SegmentAllocateAction.java?utf8=%E2%9C%93#L177. SegmentAllocateAction tries to acquire lock for the same task with different intervals and throwing exception for overlapping interval in tryLock might cause failure of segment allocations.
For now I'm just alerting and logging which might be useful if someone run into deadlock issue.

Sounds makes sense.

jihoonson · 2017-07-01T06:17:17Z

    EasyMock.replay(serverConfig);

-    ServiceEmitter emitter = EasyMock.createMock(ServiceEmitter.class);
+    ServiceEmitter emitter  = EasyMock.createMock(ServiceEmitter.class);


Please remove the unnecessary space.

jihoonson · 2017-07-01T06:18:29Z

  public void testLockAfterTaskComplete() throws InterruptedException
  {
    Task task = NoopTask.create();
+    exception.expect(IllegalStateException.class);


This should be ISE. Also, it would be better to check the exception message as well. Please refer to AppenderatorDriverFailTest as an example.

jihoonson · 2017-07-01T06:24:23Z

  public void testTryLockAfterTaskComplete() throws InterruptedException
  {
    Task task = NoopTask.create();
+    exception.expect(IllegalStateException.class);


This should be ISE. Also, it would be better to check the exception message as well. Please refer to AppenderatorDriverFailTest as an example.

jihoonson · 2017-07-01T06:25:22Z

  {
    final TaskConfig taskConfig = new TaskConfig(directory.getPath(), null, null, 50000, null, false, null, null);
-    final TaskLockbox taskLockbox = new TaskLockbox(taskStorage);
+    final TaskLockbox taskLockbox = new TaskLockbox(taskStorage, 300000);


Would you reduce the timeout here as well?

Ahh.. Forget to reduce the timeout here. Done

jihoonson · 2017-07-01T06:25:26Z

    INDEX_MERGER.persist(index, persistDir, indexSpec);

-    final TaskLockbox tl = new TaskLockbox(ts);
+    final TaskLockbox tl = new TaskLockbox(ts, 300000);


Would you reduce the timeout here as well?

jihoonson · 2017-07-01T06:25:30Z

    Preconditions.checkNotNull(emitter);

-    taskLockbox = new TaskLockbox(taskStorage);
+    taskLockbox = new TaskLockbox(taskStorage, 300000);


Would you reduce the timeout here as well?

jihoonson · 2017-07-06T00:53:08Z

@akashdw thank you for the update. The latest patch looks good to me.

akashdw added 2 commits June 23, 2017 16:29

Timeout for LockAcquireAction

e88e9bf

Static inner class.

0151042

akashdw force-pushed the timeout_for_aquirelock branch from 238a606 to 0151042 Compare June 23, 2017 23:32

Rebase changes.

d9ad40c

akashdw closed this Jun 26, 2017

akashdw reopened this Jun 26, 2017

makeAlert and throw exception incase of overlapping interval.

40930e0

jihoonson requested changes Jun 27, 2017

View reviewed changes

Addressed comments.

b7cbb86

remove unused import.

fb6dec4

jihoonson requested changes Jul 1, 2017

View reviewed changes

jihoonson reviewed Jul 1, 2017

View reviewed changes

Addressed comments

751ced2

jihoonson approved these changes Jul 11, 2017

View reviewed changes

jihoonson merged commit 5f411f1 into apache:master Jul 11, 2017

jihoonson mentioned this pull request Jul 12, 2017

Passing lockTimeout as a parameter for TaskLockBox.lock() #4533

Closed

jon-wei added this to the 0.11.0 milestone Oct 18, 2017

Conversation

akashdw commented Jun 23, 2017

Uh oh!

jihoonson commented Jun 24, 2017

Uh oh!

akashdw commented Jun 26, 2017

Uh oh!

jihoonson left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

akashdw commented Jul 1, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jihoonson commented Jul 6, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants