Fix three bugs with segment publishing. by gianm · Pull Request #6155 · apache/druid

gianm · 2018-08-11T01:32:45Z

In AppenderatorImpl: always use a unique path if requested, even if the segment
was already pushed. This is important because if we don't do this, it causes
the issue mentioned in KafkaIndexTask can delete published segments on restart #6124.
In IndexerSQLMetadataStorageCoordinator: Fix a bug that could cause it to return
a "not published" result instead of throwing an exception, when there was one
metadata update failure, followed by some random exception. This is done by
resetting the AtomicBoolean that tracks what case we're in, each time the
callback runs.
In BaseAppenderatorDriver: Only kill segments if we get an affirmative false
publish result. Skip killing if we just got some exception. The reason for this
is that we want to avoid killing segments if they are in an unknown state.

Two other changes to clarify the contracts a bit and hopefully prevent future bugs:

Return SegmentPublishResult from TransactionalSegmentPublisher, to make it
more similar to announceHistoricalSegments.
Make it explicit, at multiple levels of javadocs, that a "false" publish result
must indicate that the publish definitely did not happen. Unknown states must be
exceptions. This helps BaseAppenderatorDriver do the right thing.

1. In AppenderatorImpl: always use a unique path if requested, even if the segment was already pushed. This is important because if we don't do this, it causes the issue mentioned in apache#6124. 2. In IndexerSQLMetadataStorageCoordinator: Fix a bug that could cause it to return a "not published" result instead of throwing an exception, when there was one metadata update failure, followed by some random exception. This is done by resetting the AtomicBoolean that tracks what case we're in, each time the callback runs. 3. In BaseAppenderatorDriver: Only kill segments if we get an affirmative false publish result. Skip killing if we just got some exception. The reason for this is that we want to avoid killing segments if they are in an unknown state. Two other changes to clarify the contracts a bit and hopefully prevent future bugs: 1. Return SegmentPublishResult from TransactionalSegmentPublisher, to make it more similar to announceHistoricalSegments. 2. Make it explicit, at multiple levels of javadocs, that a "false" publish result must indicate that the publish _definitely_ did not happen. Unknown states must be exceptions. This helps BaseAppenderatorDriver do the right thing.

clintropolis

Makes sense to me 👍

fjy · 2018-08-11T05:02:51Z

👍

nishantmonu51

LGTM. +1 after travis.

jihoonson · 2018-08-13T19:56:41Z

Please check this:

[INFO] PMD Failure: io.druid.segment.realtime.appenderator.TransactionalSegmentPublisher:22 Rule:UnusedImports Priority:4 Avoid unused imports such as 'io.druid.indexing.overlord.DataSourceMetadata'.

jihoonson · 2018-08-13T19:58:27Z

                final TransactionStatus transactionStatus
            ) throws Exception
            {
+              definitelyNotUpdated.set(false);


Would you add a comment that this overwrites definitelyNotUpdated on retrying?

jihoonson · 2018-08-13T20:01:19Z

                  log.info("Our segments really do exist, awaiting handoff.");
                } else {
-                  throw new ISE("Failed to publish segments[%s]", segmentsAndMetadata.getSegments());
+                  throw new ISE("Failed to publish segments.");


Is this change for removing too large logs? I feel sometimes this log helps..

I was thinking it's not necessary, since they will get logged in the catch statement via:

log.warn(e, "Failed publish, not removing segments: %s", segmentsAndMetadata.getSegments());

jihoonson · 2018-08-13T20:04:14Z

-      if (txnFailure.get()) {
-        return new SegmentPublishResult(ImmutableSet.of(), false);
+      if (definitelyNotUpdated.get()) {
+        return SegmentPublishResult.fail();


What do you think about adding an exception to SegmentPublishResult on failure, so that callers can figure out why it failed?

I think it's not necessary, since there is supposed to be only one reason: compare-and-swap failure with the metadata update.

gianm · 2018-08-13T22:25:47Z

[INFO] PMD Failure: io.druid.segment.realtime.appenderator.TransactionalSegmentPublisher:22 Rule:UnusedImports Priority:4 Avoid unused imports such as 'io.druid.indexing.overlord.DataSourceMetadata'.

This is happening because I referenced it in a javadoc. Apparently that's not good enough for the plugin. I removed the reference.

jihoonson · 2018-08-14T21:27:02Z

@gianm please check the build failure.

/home/travis/build/apache/incubator-druid/server/src/test/java/io/druid/segment/realtime/appenderator/StreamAppenderatorDriverTest.java:362: error: incompatible types: bad return type in lambda expression
    return (segments, commitMetadata) -> true;
                                         ^
    boolean cannot be converted to SegmentPublishResult
/home/travis/build/apache/incubator-druid/server/src/test/java/io/druid/segment/realtime/appenderator/StreamAppenderatorDriverTest.java:371: error: incompatible types: bad return type in lambda expression
      return false;
             ^
    boolean cannot be converted to SegmentPublishResult

gianm · 2018-08-14T23:57:05Z

@jihoonson thanks, I pushed again.

fjy · 2018-08-14T23:57:25Z

👍

jihoonson · 2018-08-15T00:45:33Z

There are still some build failures.

[ERROR] COMPILATION ERROR : 
[ERROR] /home/travis/build/apache/incubator-druid/server/src/test/java/io/druid/segment/realtime/appenderator/BatchAppenderatorDriverTest.java:[197,42] incompatible types: bad return type in lambda expression
    boolean cannot be converted to io.druid.indexing.overlord.SegmentPublishResult
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.2:testCompile (default-testCompile) on project druid-server: Compilation failure
[ERROR] /home/travis/build/apache/incubator-druid/server/src/test/java/io/druid/segment/realtime/appenderator/BatchAppenderatorDriverTest.java:[197,42] incompatible types: bad return type in lambda expression

gianm · 2018-08-15T00:50:46Z

OMG, sorry, I'll check more thoroughly before I push again.

jihoonson · 2018-08-15T04:15:44Z

Hmm, now some unit tests are failing and looks legitimate.

Failed tests: 
  StreamAppenderatorDriverFailTest.testFailDuringPublish 
Expected: (an instance of java.util.concurrent.ExecutionException and exception with cause an instance of io.druid.java.util.common.ISE and exception with message a string containing "Failed to publish segments[[DataSegment{size=0, shardSpec=NumberedShardSpec{partitionNum=0, partitions=0}, metrics=[], dimensions=[], version='abc123', loadSpec={}, interval=2000-01-01T00:00:00.000Z/2000-01-01T01:00:00.000Z, dataSource='foo', binaryVersion='0'}, DataSegment{size=0, shardSpec=NumberedShardSpec{partitionNum=0, partitions=0}, metrics=[], dimensions=[], version='abc123', loadSpec={}, interval=2000-01-01T01:00:00.000Z/2000-01-01T02:00:00.000Z, dataSource='foo', binaryVersion='0'}]]")
     but: exception with message a string containing "Failed to publish segments[[DataSegment{size=0, shardSpec=NumberedShardSpec{partitionNum=0, partitions=0}, metrics=[], dimensions=[], version='abc123', loadSpec={}, interval=2000-01-01T00:00:00.000Z/2000-01-01T01:00:00.000Z, dataSource='foo', binaryVersion='0'}, DataSegment{size=0, shardSpec=NumberedShardSpec{partitionNum=0, partitions=0}, metrics=[], dimensions=[], version='abc123', loadSpec={}, interval=2000-01-01T01:00:00.000Z/2000-01-01T02:00:00.000Z, dataSource='foo', binaryVersion='0'}]]" message was "io.druid.java.util.common.ISE: Failed to publish segments."
Stacktrace was: java.util.concurrent.ExecutionException: io.druid.java.util.common.ISE: Failed to publish segments.

jihoonson

@gianm thanks for the quick fix!

* Fix three bugs with segment publishing. 1. In AppenderatorImpl: always use a unique path if requested, even if the segment was already pushed. This is important because if we don't do this, it causes the issue mentioned in apache#6124. 2. In IndexerSQLMetadataStorageCoordinator: Fix a bug that could cause it to return a "not published" result instead of throwing an exception, when there was one metadata update failure, followed by some random exception. This is done by resetting the AtomicBoolean that tracks what case we're in, each time the callback runs. 3. In BaseAppenderatorDriver: Only kill segments if we get an affirmative false publish result. Skip killing if we just got some exception. The reason for this is that we want to avoid killing segments if they are in an unknown state. Two other changes to clarify the contracts a bit and hopefully prevent future bugs: 1. Return SegmentPublishResult from TransactionalSegmentPublisher, to make it more similar to announceHistoricalSegments. 2. Make it explicit, at multiple levels of javadocs, that a "false" publish result must indicate that the publish _definitely_ did not happen. Unknown states must be exceptions. This helps BaseAppenderatorDriver do the right thing. * Remove javadoc-only import. * Updates. * Fix test. * Fix tests.

* [Backport] Fix three bugs with segment publishing. (#6155) * Fix KafkaIndexTask

…che#6187) * [Backport] Fix three bugs with segment publishing. (apache#6155) * Fix KafkaIndexTask

gianm added Bug Area - Streaming Ingestion labels Aug 11, 2018

gianm added this to the 0.12.3 milestone Aug 11, 2018

clintropolis approved these changes Aug 11, 2018

View reviewed changes

Merge branch 'master' into fix-isql-msc-txn-fail-bug

a181c68

asdf2014 approved these changes Aug 12, 2018

View reviewed changes

nishantmonu51 approved these changes Aug 13, 2018

View reviewed changes

jihoonson reviewed Aug 13, 2018

View reviewed changes

Remove javadoc-only import.

64c4da9

Updates.

0e09228

Fix test.

77bd38f

Fix tests.

a6a9392

jihoonson approved these changes Aug 15, 2018

View reviewed changes

fjy merged commit 5ce3185 into apache:master Aug 15, 2018

gianm mentioned this pull request Aug 17, 2018

KafkaIndexTask can delete published segments on restart #6124

Closed

jon-wei pushed a commit to jon-wei/druid that referenced this pull request Aug 17, 2018

[Backport] Fix three bugs with segment publishing. (apache#6155)

d3abf93

jon-wei mentioned this pull request Aug 17, 2018

[Backport] Fix three bugs with segment publishing. (#6155) #6187

Merged

jon-wei added a commit that referenced this pull request Aug 18, 2018

[Backport] Fix three bugs with segment publishing. (#6155) (#6187)

8b1c8e4

* [Backport] Fix three bugs with segment publishing. (#6155) * Fix KafkaIndexTask

jon-wei added a commit to implydata/druid-public that referenced this pull request Aug 18, 2018

[Backport] Fix three bugs with segment publishing. (apache#6155) (apa…

497b761

…che#6187) * [Backport] Fix three bugs with segment publishing. (apache#6155) * Fix KafkaIndexTask

leventov pushed a commit to metamx/druid that referenced this pull request Aug 19, 2018

[Backport] Fix three bugs with segment publishing. (apache#6155) (apa…

ad116ad

…che#6187) * [Backport] Fix three bugs with segment publishing. (apache#6155) * Fix KafkaIndexTask

jon-wei mentioned this pull request Sep 1, 2018

[DRAFT] Druid 0.12.3 release notes #6288

Closed

gianm deleted the fix-isql-msc-txn-fail-bug branch September 23, 2022 19:23

Conversation

gianm commented Aug 11, 2018

Uh oh!

clintropolis left a comment

Choose a reason for hiding this comment

Uh oh!

fjy commented Aug 11, 2018

Uh oh!

nishantmonu51 left a comment

Choose a reason for hiding this comment

Uh oh!

jihoonson commented Aug 13, 2018

Uh oh!

jihoonson Aug 13, 2018

Choose a reason for hiding this comment

Uh oh!

gianm Aug 14, 2018

Choose a reason for hiding this comment

Uh oh!

jihoonson Aug 13, 2018

Choose a reason for hiding this comment

Uh oh!

gianm Aug 14, 2018

Choose a reason for hiding this comment

Uh oh!

jihoonson Aug 13, 2018

Choose a reason for hiding this comment

Uh oh!

gianm Aug 14, 2018

Choose a reason for hiding this comment

Uh oh!

gianm commented Aug 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jihoonson commented Aug 14, 2018

Uh oh!

gianm commented Aug 14, 2018

Uh oh!

fjy commented Aug 14, 2018

Uh oh!

jihoonson commented Aug 15, 2018

Uh oh!

gianm commented Aug 15, 2018

Uh oh!

jihoonson commented Aug 15, 2018

Uh oh!

jihoonson left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

gianm commented Aug 13, 2018 •

edited

Loading