Timeout for LockAcquireAction#4461
Conversation
238a606 to
0151042
Compare
|
What do you think about adding timeout for
Maybe we need to return |
|
Thanks @jihoonson. |
| ); | ||
| } | ||
|
|
||
| @Test(expected = ISE.class) |
There was a problem hiding this comment.
It's recommended to use ExpectedException to check the exception is thrown from the right place. (#4292 (comment))
| Assert.assertFalse(lockbox.tryLock(task, new Interval("2015-01-01/2015-01-02")).isPresent()); | ||
| } | ||
|
|
||
| @Test(expected = InterruptedException.class) |
There was a problem hiding this comment.
It's recommended to use ExpectedException to check the exception is thrown from the right place. (#4292 (comment))
|
|
||
| public TaskLockbox( | ||
| TaskStorage taskStorage, | ||
| long lockTimeoutMillis |
There was a problem hiding this comment.
Should every lockAcquireAction have the same timeout? I think it would be valuable if we can change the timeout depending on task specs in the future. For example, a task can have a long timeout if it should acquire a lock for a long interval.
There was a problem hiding this comment.
Should every lockAcquireAction have the same timeout? Yes for now, it can also be extended where client can send a timeout.
For now serverConfig.getMaxIdleTime() is the default timeout b/c even if task gets a lock after this period, overlord can not write the response in the closed socket.
There was a problem hiding this comment.
For now serverConfig.getMaxIdleTime() is the default timeout b/c even if task gets a lock after this period, overlord can not write the response in the closed socket.
Using maxIdleTime as a default value sounds good.
Yes for now, it can also be extended where client can send a timeout.
I think we need to set different timeouts for each lock request in very near future because I'm working on prioritized locking (#4479, #1679) and this timeout feature will be great if tasks can set different timeouts according to their priorities.
There was a problem hiding this comment.
Sure, I'll create a separate PR where client can send locktimeout, if not default serverConfig.getMaxIdleTime() will be used.
| ) | ||
| { | ||
| this.taskStorage = taskStorage; | ||
| this.lockTimeoutMillis = serverConfig.getMaxIdleTime().getMillis(); |
There was a problem hiding this comment.
maxIdelTime looks to be deprecated (http://druid.io/docs/latest/configuration/indexing-service.html).
There was a problem hiding this comment.
serverConfig.getMaxIdleTime() corresponds to druid.server.http.maxIdleTime and is not deprecated.
| Optional<TaskLock> taskLock; | ||
| while (!(taskLock = tryLock(task, interval)).isPresent()) { | ||
| lockReleaseCondition.await(); | ||
| lockReleaseCondition.await(lockTimeoutMillis, TimeUnit.MILLISECONDS); |
There was a problem hiding this comment.
lockTimeoutMillis should be updated if this line returns early.
There was a problem hiding this comment.
If this line returns early that means Task got the lock, not sure why final variable of this class should be updated ?
There was a problem hiding this comment.
A spurious wakeup can occur while awaiting (https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/locks/Condition.html). If it wakes up early but couldn't get a lock yet, the waiting time should be updated.
| derbyConnector | ||
| ); | ||
| taskLockbox = new TaskLockbox(taskStorage); | ||
| taskLockbox = new TaskLockbox(taskStorage, 300000); |
There was a problem hiding this comment.
I think the timeout of 300 seconds is too long. Of course, waiting for 300 seconds is not usual, but if the lock is acquired after waiting for 300 seconds, it is likely to make unit tests failed eventually on travis due to the job timeout. I think it would be better to make tests failed earlier rather than waiting for such a long time. This is same for other tests.
|
Thanks @jihoonson. addressed comments. |
| log.makeAlert("Same Task is trying to acquire lock for overlapping interval") | ||
| .addData("task", task.getId()) | ||
| .addData("interval", interval); | ||
| throw new ISE( |
There was a problem hiding this comment.
Why did you remove the code throwing an exception here? I think it would be better to throw an exception immediately if a deadlock is found rather than waiting for the lock request to be expired.
There was a problem hiding this comment.
@jihoonson Overlapping intervals causes deadlock for lockAquireAction and not for lockTryAquireAction. I think throwing exception for tryLock is unnecessary. Other problem I could see is in the SegmentAllocationAction, https://github.com/druid-io/druid/blob/master/indexing-service/src/main/java/io/druid/indexing/common/actions/SegmentAllocateAction.java?utf8=%E2%9C%93#L177. SegmentAllocateAction tries to acquire lock for the same task with different intervals and throwing exception for overlapping interval in tryLock might cause failure of segment allocations.
For now I'm just alerting and logging which might be useful if someone run into deadlock issue.
| EasyMock.replay(serverConfig); | ||
|
|
||
| ServiceEmitter emitter = EasyMock.createMock(ServiceEmitter.class); | ||
| ServiceEmitter emitter = EasyMock.createMock(ServiceEmitter.class); |
There was a problem hiding this comment.
Please remove the unnecessary space.
| public void testLockAfterTaskComplete() throws InterruptedException | ||
| { | ||
| Task task = NoopTask.create(); | ||
| exception.expect(IllegalStateException.class); |
There was a problem hiding this comment.
This should be ISE. Also, it would be better to check the exception message as well. Please refer to AppenderatorDriverFailTest as an example.
| public void testTryLockAfterTaskComplete() throws InterruptedException | ||
| { | ||
| Task task = NoopTask.create(); | ||
| exception.expect(IllegalStateException.class); |
There was a problem hiding this comment.
This should be ISE. Also, it would be better to check the exception message as well. Please refer to AppenderatorDriverFailTest as an example.
| { | ||
| final TaskConfig taskConfig = new TaskConfig(directory.getPath(), null, null, 50000, null, false, null, null); | ||
| final TaskLockbox taskLockbox = new TaskLockbox(taskStorage); | ||
| final TaskLockbox taskLockbox = new TaskLockbox(taskStorage, 300000); |
There was a problem hiding this comment.
Would you reduce the timeout here as well?
There was a problem hiding this comment.
Ahh.. Forget to reduce the timeout here. Done
| INDEX_MERGER.persist(index, persistDir, indexSpec); | ||
|
|
||
| final TaskLockbox tl = new TaskLockbox(ts); | ||
| final TaskLockbox tl = new TaskLockbox(ts, 300000); |
There was a problem hiding this comment.
Would you reduce the timeout here as well?
| Preconditions.checkNotNull(emitter); | ||
|
|
||
| taskLockbox = new TaskLockbox(taskStorage); | ||
| taskLockbox = new TaskLockbox(taskStorage, 300000); |
There was a problem hiding this comment.
Would you reduce the timeout here as well?
|
@akashdw thank you for the update. The latest patch looks good to me. |
Acquiring task lock for overlapping intervals causes a deadlock. Also the Overlord can run out of jetty threads when the TaskActionClient times out and retries while acquiring the same lock. This PR introduces a timeout for
LockAcquireAction