Add TaskResourceCleaner; fix a couple of concurrency bugs in batch tasks by jihoonson · Pull Request #8236 · apache/druid

jihoonson · 2019-08-03T20:27:04Z

Description

There are a couple of bugs in CompactionTask and ParallelIndexSubTask. In CompactionTask, the current running indexTask should be killed when the compactionTask is killed. In ParallelIndexSubTask, I think this line should be in the synchronized block.

To reduce both of the possibility of mistakes and duplicate codes, I added a couple of helper methods to AbstractBatchIndexTask. I also removed the default implementation of stopGracefully() because it's now should be called properly for all production tasks.

This PR has:

been self-reviewed.
- using the concurrency checklist (Remove this item if the PR doesn't have any relation to concurrency.)
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.

jon-wei · 2019-08-04T02:01:35Z

        final Firehose firehose = firehoseFactory.connect(dataSchema.getParser(), firehoseTempDir)
    ) {
-      this.appenderator = appenderator;
+      registerResourceCloserOnAbnormalExit(config -> appenderator.closeNow());


hm, I just saw that Appenderator has the following javadoc:

Concurrency: all methods defined in this class directly, including {@link #close()} and {@link #closeNow()}, i. e. * all methods of the data appending and indexing lifecycle except {@link #drop} must be called from a single thread.

Maybe we should move the appenderator.closeNow() calls to an interrupt handler block in the main runTask methods and let interrupts through the resource closer trigger that.

Some of the appenderators are created in try-with, but AppenderatorImpl close() checks if it's already closed, so calling close() after closeNow() should be fine.

Good point. Will update the pr soon.

Took Closeable out of Appenderator. Callers of Appenderator should explicitly call close() or closeNow().

leventov · 2019-08-05T17:59:15Z

+  private List<IndexTask> indexTaskSpecs;
+
+  @Nullable
+  private volatile IndexTask currentRunningTaskSpec = null;


https://github.com/code-review-checklists/java-concurrency#justify-volatile

Added javadoc.

The added Javadoc doesn't answer the question "Why the semantics of volatile field reads and writes (as defined in the Java Memory Model) are required for the field?" from the referenced checklist item.

It's actually likely that just adding volatile to the field is either not enough to achieve certain concurrency goal (which is also not clarified in the Javadoc), or not needed.

"by an HTTP thread when {@link #stopGracefully} is called." - please point to a specific class and method where does this happen, because it's not obvious.

The added Javadoc doesn't answer the question "Why the semantics of volatile field reads and writes (as defined in the Java Memory Model) are required for the field?" from the referenced checklist item.

I don't understand your comment. Do you want me to use the terms defined in the JMM?

It's actually likely that just adding volatile to the field is either not enough to achieve certain concurrency goal (which is also not clarified in the Javadoc), or not needed.

I don't think this kind of comment helps. So do you think volatile here is necessary or not?

Do you want me to use the terms defined in the JMM?

Yes, I think the term "happens-before" should usually appear in this type of explanatory comments.

So do you think volatile here is necessary or not?

I don't know because I wasn't able to figure out the place you referred to in "by an HTTP thread when {@link #stopGracefully} is called."

So if stopGracefully() may be called at any time concurrently with run(), I think the concurrency property that is worth ensuring is not starting running the next task if stopGracefully() is already stopping the previous task. To achieve this, volatile is not enough. This property could be achieved the following way:

currentRunningTaskSpec is AtomicReference<Object>

The stopping callback:

Object currentRunningTask = currentRunningTaskSpec.getAndSet(SPECIAL_VALUE_STOPPED); if (currentRunningTask != null) { ((IndexTask) currentRunningTask).stopGracefull(config); }

The code in the loop in runTask() does something like

Object prevSpec = currentRunningTaskSpec.get(); if (prevSpec == SPECIAL_VALUE_STOPPED || !currentRunningTaskSpec.compareAndSet(prevSpec, eachSpec)) { log.info("Stopped concurrently"); ... }

That construction definitely looks better!

Side note, I don't think this applies here, but in general do you think it's a valid explanation to say: "this field is accessed by multiple threads, and I haven't conclusively proven that volatile is not necessary, so I've included it"?

Probably a lot of the volatiles in the codebase today are like this. I don't necessarily see a big problem with it. Even though some of them could probably be safely removed, the 'just-in-case' volatiles still help lower the cognitive overhead of documenting and dealing with the codebase. (Otherwise, you'd need to think through these arguments again every time you make a change.)

To me, the main problem is that volatile is often used to delude oneself or reviewers that some code is concurrently safe without forcing through the threading/concurrent control flow models of the code, the desired properties of the code, and the proof that the code has the desired properties. If "volatile = probably worse perf, but no concurrency defects" was the case, that wouldn't be too bad, indeed. Unfortunately, in reality, what we have is "volatile = high probability that there are some concurrency defects around the code".

This is why https://github.com/code-review-checklists/java-concurrency#justify-volatile demands to justify volatile in JMM terms rather than in vague phrases.

Thanks, that's a good point. Fixed as suggested.

Unfortunately, in reality, what we have is "volatile = high probability that there are some concurrency defects around the code".

Interesting point. I wonder if we can nail down what usages of volatile are especially likely to be buggy. My intuitions would be:

Not very risky: 'one-way' volatiles that are used to ensure safe publishing out of an 'owner' thread, but where the 'non-owner' threads aren't intended to call any mutation methods on the published object. This kind of pattern is sometimes used for publishing objects that monitoring or status-checking code will periodically inspect.

Not very risky: volatiles used for deferred initialization (updated once from null -> nonnull, but not updated ever again).

More risky: designs where a reference is updated multiple times by an 'owner' thread, and 'non-owner' threads may call some mutation methods on the object.

Even more risky: designs where there is no clear 'owner' thread.

(The original example of currentRunningTaskSpec would have seemed risky by the above intuitions: there is a clear owner thread, but the reference is updated more than once, and non-owners are going to be calling mutation methods)

jon-wei · 2019-08-07T01:17:39Z

   *
   * @param taskConfig TaskConfig for this task
+   *
+   * @see org.apache.druid.indexing.worker.http.WorkerResource#doShutdown(String)


Could mention that this shutdown APi will cause stopGracefully to be called either through lifecycle stop (ForkingTaskRunner on MM -> process shutdown -> SingleTaskBackgroundRunner on Peon) or directly from an HTTP thread (ThreadingTaskRunner for Indexer)

Added some description about it.

leventov · 2019-08-08T18:57:11Z

@@ -162,15 +162,21 @@ default int getPriority()



Could you please add an overview section to Task's class-level Javadoc, describing which methods of this interface could be called concurrently with which other methods and which could not? In other words, provide an overview concurrent access documentation for this interface.

Raised #8271.

leventov · 2019-08-08T18:59:20Z

-   * Its implementations should handle potential concurreny issues properly.
+   * terminated with extreme prejudice.
+   *
+   * This method can be called at any time while {@link #run} is being called when the task is killed.


Could you specify what should or should not happen if stopGracefull() is called concurrently (e. g. "before") run()? Does it guarantee that the task won't even start?

Good point. I added the below to javadoc.

This method can be called at any time while run() is being called when the task is killed. If this task is not started yet, that is run() is not called yet, this method will be never called. Once this task is started, this method can be called even after run() returns. Implementations of this method may want to avoid unnecessary work if run() already returned.

leventov · 2019-08-09T16:27:52Z

+   * is not started yet, that is {@link #run} is not called yet, this method will be never called.
+   * Once this task is started, this method can be called even after {@link #run} returns. Implementations of this
+   * method may want to avoid unnecessary work if {@link #run} already returned.
   * Depending on the task executor type, one of the two cases below can happen when the task is killed.


It would be clearer if there was an empty line above this line and no empty line below this line.

leventov · 2019-08-09T16:32:30Z

   *
-   * This method can be called at any time while {@link #run} is being called when the task is killed.
+   * This method can be called at any time while {@link #run} is being called when the task is killed. If this task
+   * is not started yet, that is {@link #run} is not called yet, this method will be never called.


This text sort of doesn't make sense, because the words like "yet", "before", "after" are not definable in terms of JMM which doesn't deal with time. Yet, in practical terms, currently, stopGracefully() can be called "before" run() in CompactionTask on one of the eachSpecs. So this contract is violated.

Good point. I think the contract now should say "this method can be called at any time no matter when run() is executed".

This text sort of doesn't make sense, because the words like "yet", "before", "after" are not definable in terms of JMM which doesn't deal with time.

I don't know what it should be to address your comment. What is your suggestion?

I think the contract now should say "this method can be called at any time no matter when run() is executed".

I agree.

I don't know what it should be to address your comment. What is your suggestion?

I would just put the phrase "Regardless when stopGracefully() is called w. r. t. run(), the implementation must not allow a resource leak or lingering executions (local or remote)."

Thanks, added.

Missed the first sentence: "this method can be called at any time no matter when run() is executed".

Other than that, I don't have more comments about this PR. I didn't review it in full but don't block it.

Thanks, added.

jon-wei

LGTM (one more minor comment)

jon-wei · 2019-08-12T20:51:06Z

  }

+  /**
+   * Run this task. Before running the task, ithecks the the current task is already stopped and


itchecks -> it checks

Thanks, fixed.

jihoonson added 3 commits August 3, 2019 13:04

Add TaskResourceCleaner; fix a couple of concurrency bugs in batch tasks

5a98275

kill runner when it's ready

77e2128

add comment

31faec2

jihoonson added Bug Area - Batch Ingestion labels Aug 3, 2019

jihoonson added 2 commits August 3, 2019 14:36

kill run thread

13fd909

fix test

befedf6

jon-wei reviewed Aug 4, 2019

View reviewed changes

leventov reviewed Aug 5, 2019

View reviewed changes

jihoonson added 5 commits August 5, 2019 16:07

Take closeable out of Appenderator

20a502d

add javadoc

6dbbf71

fix test

4404c2e

fix test

c13ebeb

update javadoc

a89b3fc

jon-wei reviewed Aug 7, 2019

View reviewed changes

jihoonson added 2 commits August 6, 2019 20:43

add javadoc about killed task

76f75fa

address comment

e430747

jihoonson mentioned this pull request Aug 8, 2019

Add support for parallel native indexing with shuffle for perfect rollup #8257

Merged

6 tasks

leventov reviewed Aug 8, 2019

View reviewed changes

jihoonson added 2 commits August 8, 2019 13:40

handling missing exceptions

cff9465

more clear javadoc for stopGracefully

2e19f5b

jihoonson mentioned this pull request Aug 8, 2019

Add an overview concurrent access documentation for Task class #8271

Open

leventov requested changes Aug 9, 2019

View reviewed changes

jihoonson added 2 commits August 12, 2019 13:21

update javadoc

6752994

Add missing statement in javadoc

97c0f12

jon-wei approved these changes Aug 12, 2019

View reviewed changes

typo

d3db7a6

jon-wei merged commit 312cdc2 into apache:master Aug 13, 2019

clintropolis added this to the 0.16.0 milestone Aug 23, 2019

Conversation

jihoonson commented Aug 3, 2019

Description

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leventov Aug 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leventov Aug 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leventov Aug 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jon-wei left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

leventov Aug 6, 2019 •

edited

Loading

leventov Aug 7, 2019 •

edited

Loading

leventov Aug 7, 2019 •

edited

Loading