Skip to content

Timeout for LockAcquireAction#4461

Merged
jihoonson merged 7 commits intoapache:masterfrom
akashdw:timeout_for_aquirelock
Jul 11, 2017
Merged

Timeout for LockAcquireAction#4461
jihoonson merged 7 commits intoapache:masterfrom
akashdw:timeout_for_aquirelock

Conversation

@akashdw
Copy link
Copy Markdown
Contributor

@akashdw akashdw commented Jun 23, 2017

Acquiring task lock for overlapping intervals causes a deadlock. Also the Overlord can run out of jetty threads when the TaskActionClient times out and retries while acquiring the same lock. This PR introduces a timeout for LockAcquireAction

@akashdw akashdw force-pushed the timeout_for_aquirelock branch from 238a606 to 0151042 Compare June 23, 2017 23:32
@jihoonson
Copy link
Copy Markdown
Contributor

What do you think about adding timeout for tryLock() as well? I think it will be useful.

Acquiring task lock for overlapping intervals causes a deadlock.

Maybe we need to return TaskLockPosse when tryLock() or lock() is failed to figure out the current lock failure is due to deadlock or not?

@akashdw akashdw closed this Jun 26, 2017
@akashdw akashdw reopened this Jun 26, 2017
@akashdw
Copy link
Copy Markdown
Contributor Author

akashdw commented Jun 26, 2017

Thanks @jihoonson.
Timeout is not required for tryLock() because it is non blocking and returns immediately.
Made changes to throw an exception and alert incase of overlapping interval.

Copy link
Copy Markdown
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akashdw thanks for the update. I left more comments.

);
}

@Test(expected = ISE.class)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's recommended to use ExpectedException to check the exception is thrown from the right place. (#4292 (comment))

Assert.assertFalse(lockbox.tryLock(task, new Interval("2015-01-01/2015-01-02")).isPresent());
}

@Test(expected = InterruptedException.class)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's recommended to use ExpectedException to check the exception is thrown from the right place. (#4292 (comment))

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


public TaskLockbox(
TaskStorage taskStorage,
long lockTimeoutMillis
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should every lockAcquireAction have the same timeout? I think it would be valuable if we can change the timeout depending on task specs in the future. For example, a task can have a long timeout if it should acquire a lock for a long interval.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should every lockAcquireAction have the same timeout? Yes for now, it can also be extended where client can send a timeout.
For now serverConfig.getMaxIdleTime() is the default timeout b/c even if task gets a lock after this period, overlord can not write the response in the closed socket.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now serverConfig.getMaxIdleTime() is the default timeout b/c even if task gets a lock after this period, overlord can not write the response in the closed socket.

Using maxIdleTime as a default value sounds good.

Yes for now, it can also be extended where client can send a timeout.

I think we need to set different timeouts for each lock request in very near future because I'm working on prioritized locking (#4479, #1679) and this timeout feature will be great if tasks can set different timeouts according to their priorities.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll create a separate PR where client can send locktimeout, if not default serverConfig.getMaxIdleTime() will be used.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Thanks!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akashdw I raised #4533.

)
{
this.taskStorage = taskStorage;
this.lockTimeoutMillis = serverConfig.getMaxIdleTime().getMillis();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

serverConfig.getMaxIdleTime() corresponds to druid.server.http.maxIdleTime and is not deprecated.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right.

Optional<TaskLock> taskLock;
while (!(taskLock = tryLock(task, interval)).isPresent()) {
lockReleaseCondition.await();
lockReleaseCondition.await(lockTimeoutMillis, TimeUnit.MILLISECONDS);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lockTimeoutMillis should be updated if this line returns early.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this line returns early that means Task got the lock, not sure why final variable of this class should be updated ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A spurious wakeup can occur while awaiting (https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/locks/Condition.html). If it wakes up early but couldn't get a lock yet, the waiting time should be updated.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

derbyConnector
);
taskLockbox = new TaskLockbox(taskStorage);
taskLockbox = new TaskLockbox(taskStorage, 300000);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the timeout of 300 seconds is too long. Of course, waiting for 300 seconds is not usual, but if the lock is acquired after waiting for 300 seconds, it is likely to make unit tests failed eventually on travis due to the job timeout. I think it would be better to make tests failed earlier rather than waiting for such a long time. This is same for other tests.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@akashdw
Copy link
Copy Markdown
Contributor Author

akashdw commented Jul 1, 2017

Thanks @jihoonson. addressed comments.

log.makeAlert("Same Task is trying to acquire lock for overlapping interval")
.addData("task", task.getId())
.addData("interval", interval);
throw new ISE(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you remove the code throwing an exception here? I think it would be better to throw an exception immediately if a deadlock is found rather than waiting for the lock request to be expired.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jihoonson Overlapping intervals causes deadlock for lockAquireAction and not for lockTryAquireAction. I think throwing exception for tryLock is unnecessary. Other problem I could see is in the SegmentAllocationAction, https://github.com/druid-io/druid/blob/master/indexing-service/src/main/java/io/druid/indexing/common/actions/SegmentAllocateAction.java?utf8=%E2%9C%93#L177. SegmentAllocateAction tries to acquire lock for the same task with different intervals and throwing exception for overlapping interval in tryLock might cause failure of segment allocations.
For now I'm just alerting and logging which might be useful if someone run into deadlock issue.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds makes sense.

EasyMock.replay(serverConfig);

ServiceEmitter emitter = EasyMock.createMock(ServiceEmitter.class);
ServiceEmitter emitter = EasyMock.createMock(ServiceEmitter.class);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the unnecessary space.

public void testLockAfterTaskComplete() throws InterruptedException
{
Task task = NoopTask.create();
exception.expect(IllegalStateException.class);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be ISE. Also, it would be better to check the exception message as well. Please refer to AppenderatorDriverFailTest as an example.

public void testTryLockAfterTaskComplete() throws InterruptedException
{
Task task = NoopTask.create();
exception.expect(IllegalStateException.class);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be ISE. Also, it would be better to check the exception message as well. Please refer to AppenderatorDriverFailTest as an example.

{
final TaskConfig taskConfig = new TaskConfig(directory.getPath(), null, null, 50000, null, false, null, null);
final TaskLockbox taskLockbox = new TaskLockbox(taskStorage);
final TaskLockbox taskLockbox = new TaskLockbox(taskStorage, 300000);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you reduce the timeout here as well?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh.. Forget to reduce the timeout here. Done

INDEX_MERGER.persist(index, persistDir, indexSpec);

final TaskLockbox tl = new TaskLockbox(ts);
final TaskLockbox tl = new TaskLockbox(ts, 300000);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you reduce the timeout here as well?

Preconditions.checkNotNull(emitter);

taskLockbox = new TaskLockbox(taskStorage);
taskLockbox = new TaskLockbox(taskStorage, 300000);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you reduce the timeout here as well?

@jihoonson
Copy link
Copy Markdown
Contributor

@akashdw thank you for the update. The latest patch looks good to me.

@jihoonson jihoonson merged commit 5f411f1 into apache:master Jul 11, 2017
@jon-wei jon-wei added this to the 0.11.0 milestone Oct 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants