Fix record validation in SeekableStreamIndexTaskRunner by jihoonson · Pull Request #7246 · apache/druid

jihoonson · 2019-03-12T22:59:00Z

In verifyInitialRecordAndSkipExclusivePartition(), currOffset is now used to verify the record offset. I also fixed a bug which uses the default compareTo method of SequenceOffsetType. OrderedSequenceNumber should be used instead. I also changed SequenceOffsetType to not extend Comparable to prevent this problem in the future.

clintropolis

LGTM 👍

jihoonson · 2019-03-13T00:55:35Z

Oops, I noticed that the unit test for kinesis indexing is missing. I'll add it.

jihoonson · 2019-03-13T01:44:53Z

Added a test for Kinesis index task.

gianm · 2019-03-13T03:29:30Z

      // check exclusive starting sequence
      if (isStartingSequenceOffsetsExclusive() && exclusiveStartingPartitions.contains(record.getPartitionId())) {
-        log.info("Skipping starting sequenceNumber for partition [%s] marked exclusive", record.getPartitionId());
+        log.warn("Skipping starting sequenceNumber for partition [%s] marked exclusive", record.getPartitionId());


I don't think this needs to be a warning. It looks like it happens by design in Kinesis for any task after the first one that first reads a particular partition.

Hmm, I don't remember why I changed this.. Will revert.

gianm · 2019-03-13T04:29:40Z

+    // Check only for the first record among the record batch.
+    if (initialOffsetsSnapshot.contains(record.getPartitionId())) {
+      final SequenceOffsetType currOffset = currOffsets.get(record.getPartitionId());
+      if (currOffset != null) {


When is currOffset null? It seems to defeat the purpose of this check if we can get a record to check, and then don't check it because we don't know what the current offset is supposed to be.

Hmm, maybe it's better to throw an error if it's null. Will raise a PR.

gianm · 2019-03-13T04:31:29Z

  )
  {
-    if (intialSequenceSnapshot.containsKey(record.getPartitionId())) {
-      if (record.getSequenceNumber().compareTo(intialSequenceSnapshot.get(record.getPartitionId())) < 0) {


What was the issue with using intialSequenceSnapshot in the original code? Did it have the wrong offsets for some reason (like, later offsets than we should be reading)?

I think checking against intialSequenceSnapshot is wrong. Before this PR, intialSequenceSnapshot contained the start offsets of the current sequence. Comparing the offsets of the read record with intialSequenceSnapshot means that it would allow rewinding if the rewound offsets are still larger than intialSequenceSnapshot which I don't think it should be allowed.

The bug reported in #7239 happens while checkpointing with multiple replicas. During the checkpoint, the supervisor pauses all replica tasks and finds the max offsets of the current sequence, S. And then, it sets the max offsets to end offsets for all replicas. Here, if finish = false in setEndOffsets(), intialSequenceSnapshot was updated to the given end offsets which is the start offsets of the next sequence, S'. However, each replica can still consume some more offsets of the sequence S after being resumed until it reaches to the end offsets of S. This incurred an exception at here because the offset of the record is for the sequence S which should be smaller than start offsets of S'.

Here, if finish = false in setEndOffsets(), intialSequenceSnapshot was updated to the given end offsets which is the start offsets of the next sequence, S'. However, each replica can still consume some more offsets of the sequence S after being resumed until it reaches to the end offsets of S. This incurred an exception at here because the offset of the record is for the sequence S which should be smaller than start offsets of S'.

It sounds like this part is the heart of the bug: the code didn't allow for continuing to read a few more messages of a prior sequence S before the messages for a new sequence S' started showing up. And it sounds like the fix is to compare against the currOffsets we think we should be reading right now, rather than the start of the sequence. Thanks for explaining.

* Fix record validation in SeekableStreamIndexTaskRunner * add kinesis test

Fix record validation in SeekableStreamIndexTaskRunner

585e0a7

jihoonson added Bug Area - Streaming Ingestion labels Mar 12, 2019

jihoonson added this to the 0.14.0 milestone Mar 12, 2019

clintropolis approved these changes Mar 13, 2019

View reviewed changes

add kinesis test

bbe29c2

fjy merged commit 32e86ea into apache:master Mar 13, 2019

gianm reviewed Mar 13, 2019

View reviewed changes

This was referenced Mar 13, 2019

Fix log level and throw NPE on null currOffset in SeekableStreamIndexTaskRunner #7253

Merged

[Backport] Fix record validation in SeekableStreamIndexTaskRunner #7261

Merged

gianm pushed a commit to implydata/druid-public that referenced this pull request Mar 14, 2019

Fix record validation in SeekableStreamIndexTaskRunner (apache#7246)

83d282d

* Fix record validation in SeekableStreamIndexTaskRunner * add kinesis test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix record validation in SeekableStreamIndexTaskRunner#7246

Fix record validation in SeekableStreamIndexTaskRunner#7246
fjy merged 2 commits intoapache:masterfrom
jihoonson:fix-seekable-stream

jihoonson commented Mar 12, 2019

Uh oh!

clintropolis left a comment

Uh oh!

jihoonson commented Mar 13, 2019

Uh oh!

jihoonson commented Mar 13, 2019

Uh oh!

gianm Mar 13, 2019

Uh oh!

jihoonson Mar 13, 2019

Uh oh!

gianm Mar 13, 2019

Uh oh!

gianm Mar 13, 2019

Uh oh!

jihoonson Mar 13, 2019

Uh oh!

gianm Mar 13, 2019

Uh oh!

gianm Mar 13, 2019

Uh oh!

jihoonson Mar 13, 2019

Uh oh!

gianm Mar 13, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jihoonson commented Mar 12, 2019

Uh oh!

clintropolis left a comment

Choose a reason for hiding this comment

Uh oh!

jihoonson commented Mar 13, 2019

Uh oh!

jihoonson commented Mar 13, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants