Change early publishing to early pushing in indexTask & refactor AppenderatorDriver#5297
Change early publishing to early pushing in indexTask & refactor AppenderatorDriver#5297jihoonson merged 10 commits intoapache:masterfrom
Conversation
|
I think this bug is critical and this PR should be included in 0.12.0. |
|
@jihoonson will new segments have |
|
@himanshug, let me suppose an use case of kafka indexing service + compaction. The segments created by kafkaIndexTask will have the This can happen even for batch ingestion. For example, if some more data is appended to an existing dataSource, the partition id ( |
|
my understanding is that, if compaction produces a brand new partition set with a new version AND otherwise, yes ... partition set would be assumed complete with any number of segments available. However, IndexTask is more general than |
|
Ah, I got your point. The new segments created by a compactionTask will have the However, if |
| * commit metadata for this persist | ||
| */ | ||
| ListenableFuture<Object> persist(Collection<SegmentIdentifier> identifiers, Committer committer); | ||
| ListenableFuture<Object> persist(Collection<SegmentIdentifier> identifiers, @Nullable Committer committer); |
There was a problem hiding this comment.
why would persist be called with null committer when it would be a noop? shouldn't persist call throw IAE when committer is null ?
There was a problem hiding this comment.
Its behavior is different. With the noop committer, Appenderator commits intermediate state which involving writing data on local disk. If the committer is null, Appenderator skips committing intermediate state and writes nothing. Since IndexTask is not restorable, committing intermediate state is unnecessary.
There was a problem hiding this comment.
I mean IndexTask should never be calling persist and allowIncrementalPersists should be set to false on all add(..) calls (which then shouldn't/wouldn't call persist(..)) from IndexTask
There was a problem hiding this comment.
IndexTask should be able to persist intermediate data ingested so far to avoid out of disk problem. I guess you were confused because of the wrong javadoc. See #5297 (comment).
| * dropped from local storage</li> | ||
| * </ul> | ||
| */ | ||
| public class InfiniteAppenderatorDriver extends AppenderatorDriver |
There was a problem hiding this comment.
How about calling this StreamAppenderatorDriver instead?
| * <li>PUBLISHED: Segment's metadata is published to metastore.</li> | ||
| * </ul> | ||
| */ | ||
| public class FiniteAppenderatorDriver extends AppenderatorDriver |
There was a problem hiding this comment.
How about calling this BatchAppenderatorDriver instead?
There was a problem hiding this comment.
That was the one what I considered at first. :) Changed.
| * you pass in. It's wrapped in some extra metadata needed by the driver. | ||
| */ | ||
| public class AppenderatorDriver implements Closeable | ||
| public abstract class AppenderatorDriver implements Closeable |
|
|
||
| private final SegmentIdentifier segmentIdentifier; | ||
| private SegmentState state; | ||
| @Nullable private DataSegment dataSegment; |
There was a problem hiding this comment.
Could you include some comments about what these are used for?
It looks like you added dataSegment in this patch, and it's used to remember what DataSegment object was created after a push. I think it's worth a comment.
There was a problem hiding this comment.
Added comments for this variable and pushAndDrop() method.
| protected final Appenderator appenderator; | ||
| // sequenceName -> segmentsForSequence | ||
| // This map should be locked with itself before accessing it. | ||
| // Note: FiniteAppenderatorDriver currently doesn't need to lock this map because it doens't do anything concurrently. |
| // sequenceName -> segmentsForSequence | ||
| // This map should be locked with itself before accessing it. | ||
| // Note: FiniteAppenderatorDriver currently doesn't need to lock this map because it doens't do anything concurrently. | ||
| // However, it's desried to do some operations like indexing and pushing at the same time. Lockig this map is also |
| identifier.getInterval().getStartMillis(), | ||
| k -> new LinkedList<>() | ||
| ); | ||
| if (segmentWithState.getState() == SegmentState.APPENDING) { |
There was a problem hiding this comment.
Please keep the original comment from this moved code.
// always keep APPENDING segments for an interval start millis in the front
| */ | ||
| ListenableFuture<SegmentsAndMetadata> dropInBackground(SegmentsAndMetadata segmentsAndMetadata) | ||
| { | ||
| log.info("dropping segments[%s]", segmentsAndMetadata.getSegments()); |
There was a problem hiding this comment.
Please format the log message a bit nicer (capitalization)
|
Merged recent changes from master. |
| { | ||
| final Object metadata = appenderator.startJob(); | ||
| if (metadata != null) { | ||
| throw new ISE("Metadata should be null because batch ingestion doesn't support committing intermediate states"); |
There was a problem hiding this comment.
can we rephrase to, "Metadata should be null because FiniteAppenderatorDriver never persists it."
| .map(segmentIdentifier -> SegmentWithState.newSegment( | ||
| segmentIdentifier, | ||
| AppenderatorDriver.SegmentState.INACTIVE | ||
| SegmentState.APPEND_FINISHED |
There was a problem hiding this comment.
are changes to SegmentState backward compatible ?
it appears that possible values in SegmentState have changed with this patch. How will this work against the metadata persisted by previous versions of the code.
e.g. someone upgrades druid which stops all peons on middle manager and restarts them with new code which will see older values of SegmentState like ACTIVE, INACTIVE etc ?
| * <p> | ||
| * The add, clear, persist, persistAll, and push methods should all be called from the same thread to keep the | ||
| * metadata committed by Committer in sync. | ||
| * If committer is not provided, any data are NOT persisted. If it's provided, the add, clear, persist, persistAll, |
There was a problem hiding this comment.
so, allowIncrementalPersists is ignored if committer is null ?
shouldn't it be the other way around and if allowIncrementalPersists is set to false then committer is ignored but if allowIncrementalPersists is true then a non-null committer must be provided ?
There was a problem hiding this comment.
also "no data is persisted" might be clearer than "any data are NOT persisted"
There was a problem hiding this comment.
Sorry, the Javadoc was wrong. It should be "no metadata is persisted". Fixed it now.
Committer is about persisting intermediate metadata while allowIncrementalPersists is about persisting data ingested so far. So, as you said, if allowIncrementalPersists is set to false then committer is ignored. But if allowIncrementalPersists, committer can still be null to avoid persisting metadata.
There was a problem hiding this comment.
ah that makes sense now.
so, null committer simply means commit metadata is not persisted.
| * commit metadata for this persist | ||
| */ | ||
| ListenableFuture<Object> persist(Collection<SegmentIdentifier> identifiers, Committer committer); | ||
| ListenableFuture<Object> persist(Collection<SegmentIdentifier> identifiers, @Nullable Committer committer); |
There was a problem hiding this comment.
I mean IndexTask should never be calling persist and allowIncrementalPersists should be set to false on all add(..) calls (which then shouldn't/wouldn't call persist(..)) from IndexTask
| if (persistExecutor != null) { | ||
| persistExecutor.shutdownNow(); | ||
| Preconditions.checkState( | ||
| persistExecutor.awaitTermination(365, TimeUnit.DAYS), |
There was a problem hiding this comment.
we have two methods close() and closeNow() calling it .. in closeNow() we want to finish asap and shouldn't wait so you might want to add a flag in shutdownExecutors() if you want to wait in close() flow.
also, should we call xxx.shutdownNow() on all executors and then wait so that all of them are stopping in parallel ?
also the timeout here is too large.. if something was wrong then some thread would be stuck here potentially indefinitely . I would wait for maybe 5 minutes and then print an error and move on.
There was a problem hiding this comment.
This awaitTermination existed before. I changed nothing but just moved into shutdownExecutors() because it is called in both close() and closeNow().
also, should we call xxx.shutdownNow() on all executors and then wait so that all of them are stopping in parallel ?
Maybe it's useful if it usually takes so long time. I didn't measure how long it takes.
also the timeout here is too large.. if something was wrong then some thread would be stuck here potentially indefinitely . I would wait for maybe 5 minutes and then print an error and move on.
It looks that this code was there from the first place. @gianm any thoughts?
There was a problem hiding this comment.
also the timeout here is too large.. if something was wrong then some thread would be stuck here potentially indefinitely . I would wait for maybe 5 minutes and then print an error and move on.
It looks that this code was there from the first place. @gianm any thoughts?
With regard to the huge timeout, when I wrote that I was expecting that some other system would kill tasks that are taking too long to shut down. But I guess it would be fine to print an error after a few minutes and give up.
also, should we call xxx.shutdownNow() on all executors and then wait so that all of them are stopping in parallel ?
Maybe it's useful if it usually takes so long time. I didn't measure how long it takes.
Sure, why not call shutdownNow first and then wait. I don't think it should matter too much.
There was a problem hiding this comment.
@jihoonson from the patch it looks like xxx.awaitTermination(..) was added here (also I can't see them in current master https://github.com/druid-io/druid/blob/0105cdbc19828009d21de57a30ba55794a518d30/server/src/main/java/io/druid/segment/realtime/appenderator/AppenderatorImpl.java#L840 )
what am I missing ?
it is called from AppenderatorImpl.closeNow() which should finish asap ... hence the suggestion to add a flag on whether to await or not , then closeNow() wouldn't have await while close() would.
There was a problem hiding this comment.
It is called outside of shutdownExecutors(), but inside of close() (https://github.com/druid-io/druid/blob/0105cdbc19828009d21de57a30ba55794a518d30/server/src/main/java/io/druid/segment/realtime/appenderator/AppenderatorImpl.java#L698) and closeNow() (https://github.com/druid-io/druid/blob/0105cdbc19828009d21de57a30ba55794a518d30/server/src/main/java/io/druid/segment/realtime/appenderator/AppenderatorImpl.java#L749).
BTW, at first, I moved these awaiting termination codes to inside of shutdownExecutors() because closeNow() doesn't wait for pushExecutor to be terminated in the original code. But, it looks to be intended from the javadoc.
Do not unlock base persist dir as we are not waiting for push executor to shut down relying on current JVM to shutdown to not cause any locking problem if the task is restored.
I reverted this change and added a comment to avoid confusion.
| if (commitFile.exists()) { | ||
| final Committed oldCommitted = objectMapper.readValue(commitFile, Committed.class); | ||
| objectMapper.writeValue(commitFile, oldCommitted.without(identifier.getIdentifierAsString())); | ||
| final Committed oldCommit = readCommit(); |
There was a problem hiding this comment.
did anything change here?
nit: FWIW, I find older code sufficiently readable ..also computeCommitFile() was called once (not that I care about that so much).
There was a problem hiding this comment.
At first, I added readCommit() and writeCommit() to use other places as well, but during refactoring, I realized those methods are not needed. Do you want to revert this change? I think they are not so bad anyway.
There was a problem hiding this comment.
no strong opinion here .. i wasn't sure why the change was made.
| { | ||
| return () -> wrapCommitter(committerSupplier.get()); | ||
| } | ||
| } |
There was a problem hiding this comment.
not sure what changed here (is this a simple copy/paste from AppenderatorDriver) ... so haven't gone through it.. please let me know if there are specific parts here that changed and should be reviewed.
There was a problem hiding this comment.
Nothing was changed. I guess git recognizes this was changed because of the line change.
There was a problem hiding this comment.
alright, I'm not going through this class then.
| /** | ||
| * Move a set of identifiers out from "active", making way for newer segments. | ||
| */ | ||
| public void moveSegmentOut(final String sequenceName, final List<SegmentIdentifier> identifiers) |
There was a problem hiding this comment.
can you comment here saying this method only exists to support KafkaIndexTask.runLegacy(..) and should be removed with that.
|
@jihoonson where/how have you tested the changes? has this patch been verified on some cluster running Kafka Indexing Service and killing/restarting few Middle Managers ? |
|
@himanshug i've tested this PR on our cluster which is running multiple Kafka ingestions and some batch ingestions. It has been working well so far. I didn't tested killing/restarting MM yet. I'll keep you updated once the test is done. |
|
@himanshug I tested the backward compatibility by stopping/restarting MM and checking the running tasks were restored properly. It looks working well. |
gianm
left a comment
There was a problem hiding this comment.
LGTM, thanks @jihoonson. Please fix the conflict.
|
@gianm thanks fixed. |
|
Restarted failed travis component |
This PR is to fix a bug in indexTask. In #4238, indexTask is improved to support incrementally publishing segments which means, segments are incrementally published one by one during batch ingestion. This makes sense for normal ingestion. However, when it comes to reindexing, it can cause a problem of that an early published segment overshadows all old segments even before reindexing completes.
So, in this PR, the indexTask is changed to early push segments. All pushed segments are published at the end of the indexTask.
To do so, I also refactored
AppenderatorDriverand added two child classes,FiniteAppenderatorDriverandInfiniteAppenderatorDriver, which are specialized for batch and realtime indexing, respectively. This is needed because the segment lifecycle is different in batch and realtime indexing. In batch indexing, the lifecycle of a segment is APPENDING -> PUSHED -> DROPPED -> PUBLISHED, while in the realtime indexing, it's APPENDING -> PUSHED -> PUBLISHED -> DROPPED. To reduce complexity, only some fundamental methods are remained in AppenderatorDriver. All specialized methods are moved to eitherFiniteAppenderatorDriverorInfiniteAppenderatorDriver.This change is