Fix recurring bug "Inconsistency between stored metadata" during auto-scaling by kfaraz · Pull Request #19034 · apache/druid

kfaraz · 2026-02-19T07:04:46Z

Description

During aggressive auto-scaling, the tasks frequently fail with the error "Inconsistency between stored metadata and target state" causing ingestion lag. This is typically a self-healing issue as the supervisor re-launches the failed tasks with updated offsets, but it is still operational overhead and often causes ingestion lag.

java.util.concurrent.ExecutionException: org.apache.druid.java.util.common.ISE:
  Failed to publish segments because of
[Inconsistency between stored metadata state[KafkaDataSourceMetadata{}] and target state[KafkaDataSourceMetadata{}].

The root cause behind this failure seems to be the following race condition:

Scaling event is triggered.
changeTaskCount() is called
checkTaskDuration() tries to checkpoint the actively reading tasks and moves them to pending completion
checkTaskDuration() also updates the partitionOffsets with the latest result of the checkpointing
⚠️ clearAllocationInfo() clears partitionOffsets
New task group B is created and is assigned a partition P1 which an old task group (still pending completion) was also reading from.
⚠️ (race) Task group B is initialized with offsets present in the metadata store. But this does not reflect the latest checkpoint since task group A is yet to publish.
Task group A publishes offsets and updates the metadata store.
❌ Task group B tries to publish and fails since the committed offsets have now diverged.

The bug does not occur if task group A is able to finish publishing the offsets before task group B has been created.

Changes

Do not clear partitionOffsets before auto-scaling so that subsequent tasks know where the previous tasks had left off.
Simplify the condition in IndexerSQLMetadataStorageCoordinator
Add some comments and javadocs
Update error message to be more user-friendly
Fixed KafkaSupervisorTest to not use reflection and updated test to verify clearing of partitionOffsets

Note

This bug may still occur if Overlord leadership changes right before the scaling event.
But there is currently no way to handle that since partitionOffsets is an in-memory data structure and is not meant to be persisted.

This PR has:

…-scaling

Fly-Style · 2026-02-19T10:51:09Z

+   * <p>
+   * Since both of these are in-memory structures, a change in Overlord leadership
+   * might cause duplicate scaling actions and/or intermittent task failures due
+   * to {@code "Inconsistency between stored metadata and target"}.


It already has a new exception description :)

Copilot

Pull request overview

This PR fixes a race condition during aggressive auto-scaling that causes tasks to fail with "Inconsistency between stored metadata and target state" errors. The root cause was that partitionOffsets were being cleared before scaling, causing new task groups to initialize with stale offsets from the metadata store instead of the latest checkpointed offsets from pending task groups.

Changes:

Modified clearPartitionAssignmentsForScaling() (renamed from clearAllocationInfo()) to preserve partitionOffsets during auto-scaling so subsequent tasks know where previous tasks left off
Simplified conditional logic in IndexerSQLMetadataStorageCoordinator and improved error messages to be more user-friendly
Updated tests to verify the fix and removed reflection-based testing in favor of @VisibleForTesting annotation

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java	Renamed `clearAllocationInfo()` to `clearPartitionAssignmentsForScaling()`, removed clearing of `partitionOffsets`, improved documentation, and made method public with `@VisibleForTesting`
server/src/main/java/org/apache/druid/metadata/IndexerSQLMetadataStorageCoordinator.java	Simplified conditional logic for metadata state validation and improved error messages to be more descriptive and user-friendly
server/src/test/java/org/apache/druid/metadata/IndexerSQLMetadataStorageCoordinatorTest.java	Updated test assertions to match new error messages
extensions-core/kafka-indexing-service/src/test/java/org/apache/druid/indexing/kafka/supervisor/KafkaSupervisorTest.java	Removed reflection-based test method, updated to call `clearPartitionAssignmentsForScaling()` directly, and added assertion to verify `partitionOffsets` are preserved

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-19T12:33:04Z

+  /**
+   * Checks the duration of {@link #activelyReadingTaskGroups}, requests them
+   * to checkpoint themselves if they have exceeded the specified run duration
+   * or if early stop has been requested. If checkpoint is successfull, the


Typo in the javadoc: "successfull" should be "successful".

Suggested change

* or if early stop has been requested. If checkpoint is successfull, the

* or if early stop has been requested. If checkpoint is successful, the

Copilot · 2026-02-19T12:33:05Z

+    if (startMetadataMatchesExisting) {
+      // Proceed with the commit
+    } else if (startMetadataGreaterThanExisting) {
      // Offsets stored in startMetadata is greater than the last commited metadata.


Typo in comment: "commited" should be "committed".

Suggested change

// Offsets stored in startMetadata is greater than the last commited metadata.

// Offsets stored in startMetadata is greater than the last committed metadata.

Fly-Style

Looks good to me!

capistrant

nice investigative work @kfaraz. One comment of me pondering about wording of error messages that you can do with what you please. I see that the copilot bot left some nits about spelling in javadocs that you could batch commit before merging

capistrant · 2026-02-19T15:49:43Z

+          "Stored metadata state[%s] has already been updated by other tasks and"
+          + " has diverged from the expected start metadata state[%s]."
+          + " This task will be replaced by the supervisor with a new task using updated start offsets."
+          + " Reset the supervisor if the issue persists.",


nit: I wonder if we are better off suggesting the idea of resetting versus directing to do it. The new message is much better than the old though since you offer some reasoning and mention resetting only if it persists rather than the old message that just says to try resetting it 😜. same apples to all the error messages in this file and is really just semantics. Trying to reduce the Druid liability (not that there is such a thing in this sense) if an operator gets mad cuz they reset without knowing what it does and blames the error message

Thanks for calling this out. Yeah, I guess it's safer to continue with the suggestive tone for the time being. 😛

kfaraz · 2026-02-19T16:45:08Z

Thanks a lot for the reviews, @Fly-Style , @capistrant !

kfaraz · 2026-02-20T06:52:15Z

Merging off as failure is unrelated.

…er group is pending (#19091) Follow up to #19034 Changes --------- - Add method `SeekableStreamSupervisor.isAnotherTaskGroupPublishingToPartitions()` - Use this method to check if a task needs to wait before publishing its own offests - Update `SegmentTransactionalAppendAction` and `SegmentTransactionalInsertAction` to return a retryable error response only if there is a pending publish that conflicts with the current action - Fix behaviour of scale down on task rollover in `SeekableStreamSupervisor` - Fix bug in `SeekableStreamSupervisorIOConfig` - Fix bug in `CostBasedAutoScaler` to avoid spurious scale downs - Validate metrics in `CostBasedAutoScaler` before proceeding with scaling action - Add new tests in `CostBasedAutoScalerIntegrationTest`

Fix recurring bug "Inconsistency between stored metadata" during auto…

9ccbb8e

…-scaling

github-actions Bot added the Area - Ingestion label Feb 19, 2026

Update test to verify new behaviour

e62823b

github-actions Bot added the Area - Streaming Ingestion label Feb 19, 2026

Fly-Style reviewed Feb 19, 2026

View reviewed changes

Fix javadoc

b74cedd

kfaraz requested a review from Copilot February 19, 2026 12:29

Copilot started reviewing on behalf of kfaraz February 19, 2026 12:29 View session

Copilot AI reviewed Feb 19, 2026

View reviewed changes

Fly-Style approved these changes Feb 19, 2026

View reviewed changes

capistrant approved these changes Feb 19, 2026

View reviewed changes

Fix typos, error message

8177f4f

Fix test

068f443

kfaraz merged commit 97696de into apache:master Feb 20, 2026
36 of 37 checks passed

kfaraz deleted the fix_supervisor_scaling_bug branch February 20, 2026 06:56

jtuglu1 mentioned this pull request Mar 4, 2026

Fix metadata inconsistency during the autoscaling #19065

Closed

kfaraz mentioned this pull request Mar 5, 2026

Do not kill a task if offsets are inconsistent but publish from another group is pending #19091

Merged

10 tasks

cecemei added this to the 37.0.0 milestone Apr 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix recurring bug "Inconsistency between stored metadata" during auto-scaling#19034

Fix recurring bug "Inconsistency between stored metadata" during auto-scaling#19034
kfaraz merged 5 commits intoapache:masterfrom
kfaraz:fix_supervisor_scaling_bug

kfaraz commented Feb 19, 2026 •

edited

Loading

Uh oh!

Fly-Style Feb 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 19, 2026

Uh oh!

Copilot AI Feb 19, 2026

Uh oh!

Fly-Style left a comment

Uh oh!

capistrant left a comment

Uh oh!

capistrant Feb 19, 2026

Uh oh!

kfaraz Feb 19, 2026 •

edited

Loading

Uh oh!

kfaraz commented Feb 19, 2026

Uh oh!

kfaraz commented Feb 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	* or if early stop has been requested. If checkpoint is successfull, the
	* or if early stop has been requested. If checkpoint is successful, the

	// Offsets stored in startMetadata is greater than the last commited metadata.
	// Offsets stored in startMetadata is greater than the last committed metadata.

Conversation

kfaraz commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

Note

Uh oh!

Fly-Style Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Fly-Style left a comment

Choose a reason for hiding this comment

Uh oh!

capistrant left a comment

Choose a reason for hiding this comment

Uh oh!

capistrant Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

kfaraz Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kfaraz commented Feb 19, 2026

Uh oh!

kfaraz commented Feb 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kfaraz commented Feb 19, 2026 •

edited

Loading

kfaraz Feb 19, 2026 •

edited

Loading