Skip to content

Fix recurring bug "Inconsistency between stored metadata" during auto-scaling#19034

Merged
kfaraz merged 5 commits intoapache:masterfrom
kfaraz:fix_supervisor_scaling_bug
Feb 20, 2026
Merged

Fix recurring bug "Inconsistency between stored metadata" during auto-scaling#19034
kfaraz merged 5 commits intoapache:masterfrom
kfaraz:fix_supervisor_scaling_bug

Conversation

@kfaraz
Copy link
Copy Markdown
Contributor

@kfaraz kfaraz commented Feb 19, 2026

Description

During aggressive auto-scaling, the tasks frequently fail with the error "Inconsistency between stored metadata and target state" causing ingestion lag. This is typically a self-healing issue as the supervisor re-launches the failed tasks with updated offsets, but it is still operational overhead and often causes ingestion lag.

java.util.concurrent.ExecutionException: org.apache.druid.java.util.common.ISE:
  Failed to publish segments because of
[Inconsistency between stored metadata state[KafkaDataSourceMetadata{}] and target state[KafkaDataSourceMetadata{}].

The root cause behind this failure seems to be the following race condition:

  • Scaling event is triggered.
  • changeTaskCount() is called
  • checkTaskDuration() tries to checkpoint the actively reading tasks and moves them to pending completion
  • checkTaskDuration() also updates the partitionOffsets with the latest result of the checkpointing
  • ⚠️ clearAllocationInfo() clears partitionOffsets
  • New task group B is created and is assigned a partition P1 which an old task group (still pending completion) was also reading from.
  • ⚠️ (race) Task group B is initialized with offsets present in the metadata store. But this does not reflect the latest checkpoint since task group A is yet to publish.
  • Task group A publishes offsets and updates the metadata store.
  • ❌ Task group B tries to publish and fails since the committed offsets have now diverged.

The bug does not occur if task group A is able to finish publishing the offsets before task group B has been created.

Changes

  • Do not clear partitionOffsets before auto-scaling so that subsequent tasks know where the previous tasks had left off.
  • Simplify the condition in IndexerSQLMetadataStorageCoordinator
  • Add some comments and javadocs
  • Update error message to be more user-friendly
  • Fixed KafkaSupervisorTest to not use reflection and updated test to verify clearing of partitionOffsets

Note

This bug may still occur if Overlord leadership changes right before the scaling event.
But there is currently no way to handle that since partitionOffsets is an in-memory data structure and is not meant to be persisted.


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

* <p>
* Since both of these are in-memory structures, a change in Overlord leadership
* might cause duplicate scaling actions and/or intermittent task failures due
* to {@code "Inconsistency between stored metadata and target"}.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It already has a new exception description :)

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a race condition during aggressive auto-scaling that causes tasks to fail with "Inconsistency between stored metadata and target state" errors. The root cause was that partitionOffsets were being cleared before scaling, causing new task groups to initialize with stale offsets from the metadata store instead of the latest checkpointed offsets from pending task groups.

Changes:

  • Modified clearPartitionAssignmentsForScaling() (renamed from clearAllocationInfo()) to preserve partitionOffsets during auto-scaling so subsequent tasks know where previous tasks left off
  • Simplified conditional logic in IndexerSQLMetadataStorageCoordinator and improved error messages to be more user-friendly
  • Updated tests to verify the fix and removed reflection-based testing in favor of @VisibleForTesting annotation

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java Renamed clearAllocationInfo() to clearPartitionAssignmentsForScaling(), removed clearing of partitionOffsets, improved documentation, and made method public with @VisibleForTesting
server/src/main/java/org/apache/druid/metadata/IndexerSQLMetadataStorageCoordinator.java Simplified conditional logic for metadata state validation and improved error messages to be more descriptive and user-friendly
server/src/test/java/org/apache/druid/metadata/IndexerSQLMetadataStorageCoordinatorTest.java Updated test assertions to match new error messages
extensions-core/kafka-indexing-service/src/test/java/org/apache/druid/indexing/kafka/supervisor/KafkaSupervisorTest.java Removed reflection-based test method, updated to call clearPartitionAssignmentsForScaling() directly, and added assertion to verify partitionOffsets are preserved

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

/**
* Checks the duration of {@link #activelyReadingTaskGroups}, requests them
* to checkpoint themselves if they have exceeded the specified run duration
* or if early stop has been requested. If checkpoint is successfull, the
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in the javadoc: "successfull" should be "successful".

Suggested change
* or if early stop has been requested. If checkpoint is successfull, the
* or if early stop has been requested. If checkpoint is successful, the

Copilot uses AI. Check for mistakes.
if (startMetadataMatchesExisting) {
// Proceed with the commit
} else if (startMetadataGreaterThanExisting) {
// Offsets stored in startMetadata is greater than the last commited metadata.
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in comment: "commited" should be "committed".

Suggested change
// Offsets stored in startMetadata is greater than the last commited metadata.
// Offsets stored in startMetadata is greater than the last committed metadata.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

@Fly-Style Fly-Style left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

Copy link
Copy Markdown
Contributor

@capistrant capistrant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice investigative work @kfaraz. One comment of me pondering about wording of error messages that you can do with what you please. I see that the copilot bot left some nits about spelling in javadocs that you could batch commit before merging

"Stored metadata state[%s] has already been updated by other tasks and"
+ " has diverged from the expected start metadata state[%s]."
+ " This task will be replaced by the supervisor with a new task using updated start offsets."
+ " Reset the supervisor if the issue persists.",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I wonder if we are better off suggesting the idea of resetting versus directing to do it. The new message is much better than the old though since you offer some reasoning and mention resetting only if it persists rather than the old message that just says to try resetting it 😜. same apples to all the error messages in this file and is really just semantics. Trying to reduce the Druid liability (not that there is such a thing in this sense) if an operator gets mad cuz they reset without knowing what it does and blames the error message

Copy link
Copy Markdown
Contributor Author

@kfaraz kfaraz Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for calling this out. Yeah, I guess it's safer to continue with the suggestive tone for the time being. 😛

@kfaraz
Copy link
Copy Markdown
Contributor Author

kfaraz commented Feb 19, 2026

Thanks a lot for the reviews, @Fly-Style , @capistrant !

@kfaraz
Copy link
Copy Markdown
Contributor Author

kfaraz commented Feb 20, 2026

Merging off as failure is unrelated.

@kfaraz kfaraz merged commit 97696de into apache:master Feb 20, 2026
36 of 37 checks passed
@kfaraz kfaraz deleted the fix_supervisor_scaling_bug branch February 20, 2026 06:56
kfaraz added a commit that referenced this pull request Mar 9, 2026
…er group is pending (#19091)

Follow up to #19034 

Changes
---------
- Add method `SeekableStreamSupervisor.isAnotherTaskGroupPublishingToPartitions()`
- Use this method to check if a task needs to wait before publishing its own offests
- Update `SegmentTransactionalAppendAction` and `SegmentTransactionalInsertAction` to return
a retryable error response only if there is a pending publish that conflicts with the current action
- Fix behaviour of scale down on task rollover in `SeekableStreamSupervisor`
- Fix bug in `SeekableStreamSupervisorIOConfig`
- Fix bug in `CostBasedAutoScaler` to avoid spurious scale downs
- Validate metrics in `CostBasedAutoScaler` before proceeding with scaling action
- Add new tests in `CostBasedAutoScalerIntegrationTest`
@cecemei cecemei added this to the 37.0.0 milestone Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants