In Kafka indexing service, the overlord does a sanity check that the start offsets of partitions of current publishing segments are same with the ones stored in metastore, so that it guarantees that all segments are published in order. Because of this check, some tasks might fail in this scenario.
- The supervisor created a task (
T1) with a start offset O1.
- Somehow, the supervisor couldn't send an endOffset
O2 to T1 in taskDuration. Instead, it sent an endOffset O3 to T1 after taskDuration * 10. (In our case, supervisor couldn't send because of too frequent HTTP connection refused errors.)
T1 started to merge, push, and publish segments.
- The supervisor created a new task,
T2, with a start offset O3.
- After
taskDuration, it sent an endOffset O4 to T2.
T2 started to merge, push, and publish segments.
- Since
T1 had run for a much longer time, it had much more segments to publish than T2. As a result, T2 tried to publish before T1 complete publishing.
T2 failed to publish because of the sanity check when updating metastore.
So, I think the supervisor should be able to guarantee segment publishing order across all running tasks like below.
T1: indexing ===> publishing ===> handoff
T2: indexing ===> publishing ===> handoff
T3: indexing ===> publishing ===> handoff
...
In Kafka indexing service, the overlord does a sanity check that the start offsets of partitions of current publishing segments are same with the ones stored in metastore, so that it guarantees that all segments are published in order. Because of this check, some tasks might fail in this scenario.
T1) with a start offsetO1.O2toT1intaskDuration. Instead, it sent an endOffsetO3toT1aftertaskDuration * 10. (In our case, supervisor couldn't send because of too frequent HTTP connection refused errors.)T1started to merge, push, and publish segments.T2, with a start offsetO3.taskDuration, it sent an endOffsetO4toT2.T2started to merge, push, and publish segments.T1had run for a much longer time, it had much more segments to publish thanT2. As a result,T2tried to publish beforeT1complete publishing.T2failed to publish because of the sanity check when updating metastore.So, I think the supervisor should be able to guarantee segment publishing order across all running tasks like below.