Skip to content

Fix negative Kafka partition lag caused by inconsistent current/latest offsets#18750

Merged
FrankChen021 merged 3 commits intoapache:masterfrom
wuguowei1994:hotfix/negative-lag
Mar 17, 2026
Merged

Fix negative Kafka partition lag caused by inconsistent current/latest offsets#18750
FrankChen021 merged 3 commits intoapache:masterfrom
wuguowei1994:hotfix/negative-lag

Conversation

@wuguowei1994
Copy link
Copy Markdown
Contributor

@wuguowei1994 wuguowei1994 commented Nov 18, 2025

Motivation

We operate a Druid deployment with more than 500 nodes.

In real-time ingestion scenarios, a monitoring process queries the cluster every minute to retrieve the ingest/kafka/partitionLag metric. If the lag remains unhealthy for more than five minutes, alerts are triggered.

In our production environment, this metric periodically becomes negative, even when the cluster is fully healthy. These false alerts create unnecessary operational load and frequently wake the on-call team during off-hours. At the same time, we cannot suppress negative-lag alerts entirely, since in some situations negative lag can indicate real ingestion problems.

For a large-scale, 24×7 real-time ingestion pipeline, accurate and consistent lag metrics are essential to avoid unnecessary nighttime wake-ups while still ensuring that real issues are detected promptly.


Problem Description

negative_lag

In the current implementation, the Druid supervisor maintains two volatile data structures:

  • The latest Kafka end_offset for each partition
  • The latest task-reported current_offset for each partition

The supervisor periodically updates these values (every 30 seconds):

  1. Querying all tasks in parallel to update current_offset.
    This step waits for all HTTP requests to complete and each request has a timeout of two minutes.
  2. Querying Kafka cluster to refresh end_offset.

On the other hand, a separate periodic task (every minute) computes:

lag = end_offset - current_offset

Because the two updates are not atomic, intermediate inconsistent states may occur.

Intermediate State Leading to Negative Lag

If one task becomes heavily loaded or experiences other delays during Step 1, it may take significantly longer to return its offset. In this situation, the supervisor continues waiting for that slow task while the other tasks have already responded.

During this waiting period:

  • Many current_offset values already have been updated to new values.
  • The end_offset values remain stale because Step 2 has not executed yet.

If a monitoring request arrives in this intermediate window, the supervisor computes lag using:

  • Partially updated current_offset
  • Stale end_offset

This produces negative lag values.

This issue repeats as long as at least one task remains slow. Large clusters with many partitions and many Kafka-indexing tasks are more likely to experience this scenario.


Example Scenario

  1. Initial state: end_offset = 10000, current_offset = 0.

  2. After consumption: latest Kafka end_offset = 30000, and all tasks have consumed up to 20000.

  3. During Step 1, 49 tasks respond quickly, and their current_offset is updated to 20000.
    One task is slow, causing Step 1 to remain in the awaiting state.

  4. The in-memory end_offset stays at the old value 10000.

  5. If a metric query occurs at this point, the supervisor calculates:

    10000 - 20000 = -10000
    
  6. Because the periodic update logic repeats, this situation can persist across multiple cycles.


Proposed Changes

Replace the two volatile structures storing current_offset and end_offset with AtomicReference containers that hold both values as a single immutable state object. The supervisor will update these references as atomic units, ensuring that lag computation always observes a consistent snapshot.

This eliminates inconsistent intermediate states and prevents negative lag due to partial updates.


Rationale

  • Ensures consistent reads between related fields.
  • No behavioral changes other than removing negative lag caused by inconsistent state.

Operational Impact

  • Improved accuracy of Kafka lag metrics in large clusters.
  • Reduces false alerts in monitoring systems.

Test Plan

  • This change does not add new feature. We only need to make sure existing tests still pass.
  • All current tests pass successfully.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@wuguowei1994 wuguowei1994 force-pushed the hotfix/negative-lag branch 2 times, most recently from 89b59bc to d1f5e4e Compare November 19, 2025 02:22
@wuguowei1994
Copy link
Copy Markdown
Contributor Author

wuguowei1994 commented Nov 19, 2025

Our team has been suffering from these negative-lag alerts recently, and they’ve been repeatedly waking us up at night. It’s become difficult for us to get a good night’s sleep.

@kfaraz
Copy link
Copy Markdown
Contributor

kfaraz commented Nov 19, 2025

@wuguowei1994 , thanks for creating a PR in Apache Druid! One of the contributors will take a look at the changes soon.

@abhishekrb19
Copy link
Copy Markdown
Contributor

@wuguowei1994 could you sync the latest master to your branch? I think this patch might fix the test failures.

@wuguowei1994 wuguowei1994 force-pushed the hotfix/negative-lag branch 3 times, most recently from 7b8667f to 14a115a Compare November 22, 2025 02:38
@wuguowei1994
Copy link
Copy Markdown
Contributor Author

wuguowei1994 commented Nov 26, 2025

ERROR [main] org.apache.druid.testing.utils.DruidClusterAdminClient 

- Error while waiting for [http://localhost:30400] to be ready


java.util.concurrent.ExecutionException: org.jboss.netty.channel.ChannelException: Faulty channel in resource pool

@clintropolis Thanks for rerunning the jobs. Before the rerun there were two failed tasks, and now there’s only one left. The error still looks like an environment issue… really frustrating.

This whole code submission experience also shows how inactive the project has become. A lot of people report issues in the Druid Slack, but almost no one responds anymore. It’s no surprise so many companies are migrating from Druid to ClickHouse.

@clintropolis
Copy link
Copy Markdown
Member

@clintropolis Thanks for rerunning the jobs. Before the rerun there were two failed tasks, and now there’s only one left. The error still looks like an environment issue… really frustrating.

Yea, we are in the middle of a bit of a migration/overhaul of our integration test framework and processes, hopefully this will be more well behaved in the future, since part of the reason for this change is to address flakiness as well as make it much easier to write and debug these tests.

This whole code submission experience also shows how inactive the project has become. A lot of people report issues in the Druid Slack, but almost no one responds anymore. It’s no surprise so many companies are migrating from Druid to ClickHouse.

Apologies for the perception - while I can assure you that there are quite a lot of active and interesting projects happening and that Druid is still very much being actively developed, that doesn't fix your experience here or the optics in slack. The unfortunate matter is that there are always a lot more things to do than people to do them, and while we try our best, sometimes things are a bit slow to get a committers attention. All that said, thanks again for the contribution and trying to make things better, and I do hear you - i'll see if maybe I can nudge some of the other committers to be a bit more responsive and try to do the same myself.

@wuguowei1994 wuguowei1994 force-pushed the hotfix/negative-lag branch 3 times, most recently from 3e773a2 to 74c3ac4 Compare November 26, 2025 12:13
@wuguowei1994
Copy link
Copy Markdown
Contributor Author

@clintropolis Take a look ?

@kfaraz
Copy link
Copy Markdown
Contributor

kfaraz commented Dec 3, 2025

@wuguowei1994 , thanks for your patience!
We will try to get this PR reviewed soon.

Copy link
Copy Markdown
Contributor

@cecemei cecemei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the PR, and the description really explains the problem well. Kafka ingest lag is an important metric and we also monitor it closely. I always feel a bit intimated by this part of code base, but your PR has helped me understand it much better.

@wuguowei1994
Copy link
Copy Markdown
Contributor Author

wuguowei1994 commented Dec 4, 2025

@cecemei
I've gone through your comments thoroughly — they're great! Give me a bit of time to improve the code.

Copy link
Copy Markdown
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes, @wuguowei1994 ! I have left some suggestions.

While the changes in this PR make sense by reporting the lag more consistently (updating the two offsets in lockstep), I was wondering if it wouldn't be simpler to just report zero lag in case the lag turns out to be negative.

A negative record lag does mean that the task has already caught up to the last offsets that we had fetched from the topic and ingested some more records beyond that. And the lag metric is really just meant to indicate if the tasks are keeping up.

For other purposes, we have the message gap and the time lag metrics.

In fact, the negative lag could even be a feature to identify if some tasks are particularly slow in returning their offsets. 😛 , and we could probably have alerts set up if the negative lag goes below a specific threshold.


@cecemei , I think you also raised a concern regarding the possibility that the lag reported might now be higher than the actual lag, since we always fetch the stream offsets only after we have received updates from all the tasks.

I think the current code is also susceptible to reporting stale lag (higher or lower).

Say if a task were slow to return its latest ingested offsets, we would be delayed in fetching the latest offsets from the stream.
So, in that period, we would be reporting stale lag (which could have been higher or lower than the actual lag, a special case of which would be negative lag) and then as soon as we fetched the latest offsets from the stream, the reported lag would fix itself.

@cecemei
Copy link
Copy Markdown
Contributor

cecemei commented Dec 4, 2025

Thanks for the changes, @wuguowei1994 ! I have left some suggestions.

While the changes in this PR make sense by reporting the lag more consistently (updating the two offsets in lockstep), I was wondering if it wouldn't be simpler to just report zero lag in case the lag turns out to be negative.

A negative record lag does mean that the task has already caught up to the last offsets that we had fetched from the topic and ingested some more records beyond that. And the lag metric is really just meant to indicate if the tasks are keeping up.

For other purposes, we have the message gap and the time lag metrics.

In fact, the negative lag could even be a feature to identify if some tasks are particularly slow in returning their offsets. 😛 , and we could probably have alerts set up if the negative lag goes below a specific threshold.

@cecemei , I think you also raised a concern regarding the possibility that the lag reported might now be higher than the actual lag, since we always fetch the stream offsets only after we have received updates from all the tasks.

I think the current code is also susceptible to reporting stale lag (higher or lower).

Say if a task were slow to return its latest ingested offsets, we would be delayed in fetching the latest offsets from the stream. So, in that period, we would be reporting stale lag (which could have been higher or lower than the actual lag, a special case of which would be negative lag) and then as soon as we fetched the latest offsets from the stream, the reported lag would fix itself.

Yes I suspect we might seeing a slightly higher lag after this change (comparing with what might be reported before). The accuracy of the lag would not be affected by when the latestSequenceFromStream gets updated (more random), but would only be affected by how long it took fetching the offsets (delay inside the system and interaction with kafka). I do think it provides more consistency than before and the trend of the lag is more important, so it's an improvement.

For future reference, we could maybe calculate the lag on a per partition basis.

@wuguowei1994
Copy link
Copy Markdown
Contributor Author

In fact, the negative lag could even be a feature to identify if some tasks are particularly slow in returning their offsets. 😛 , and we could probably have alerts set up if the negative lag goes below a specific threshold.

@kfaraz
Thanks for the clarification — that makes sense. In our case, though, we’ve noticed that negative lag in our large cluster can sometimes persist for over five minutes.

We’ve talked about this internally, and if it only happens occasionally (for example, under a minute), adjusting the alert thresholds would absolutely work for us. But when it lasts longer, it tends to indicate something worth investigating.

We’ve also seen a few situations where negative lag actually pointed to issues in the upstream Kafka cluster, so that’s part of why we’re a bit cautious here. If we keep the current Druid behavior and treat negative lag as normal consumption, there’s a chance we might overlook real problems.

So overall, having clear and reliable metrics to signal the health of the cluster would be really helpful for us.

@wuguowei1994
Copy link
Copy Markdown
Contributor Author

@kfaraz @cecemei
Both of your suggestions for improving the code are excellent, and I’m genuinely happy to keep refining it.
Give me a bit of time to rework the design.
Thank you both!

@wuguowei1994 wuguowei1994 force-pushed the hotfix/negative-lag branch 2 times, most recently from 1b89158 to 62366fd Compare December 9, 2025 10:36
@wuguowei1994
Copy link
Copy Markdown
Contributor Author

wuguowei1994 commented Dec 15, 2025

@cecemei Sorry, I’ve been dealing with some personal matters recently, so my reply is a bit late. I’ve gone through all your reviews— they’re excellent, thank you! It was also a great learning opportunity for me.

Take a look again?

Copy link
Copy Markdown
Contributor

@cecemei cecemei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm with one very minor nit. also, next time, maybe try not to force push commits but just adding commits, so that we have an entire history and easier for reviewer to pick up from last review.

please wait for @kfaraz 's review on this as well. thanks for contributing!

@wuguowei1994
Copy link
Copy Markdown
Contributor Author

@cecemei Fixed the issue with UT and merged the latest changes from master. Thank you very much for your support throughout the entire process!

@wuguowei1994 wuguowei1994 requested a review from kfaraz December 18, 2025 11:44
@wuguowei1994
Copy link
Copy Markdown
Contributor Author

@clintropolis @kfaraz Take a look, please?

@wuguowei1994
Copy link
Copy Markdown
Contributor Author

@gianm Could you help merge the code? We really need this bug fixed.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces an immutable, atomically-published offset snapshot to prevent inconsistent reads of “current” vs “latest/end” offsets, eliminating transient negative Kafka partition lag metrics.

Changes:

  • Added OffsetSnapshot as an immutable container for (highest ingested offsets, latest stream offsets) with null-value filtering.
  • Updated KafkaSupervisor to store and read offsets via an AtomicReference<OffsetSnapshot<...>> for consistent lag calculations.
  • Added unit tests validating OffsetSnapshot behavior (null/empty inputs, copying, filtering, immutability-by-copy).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.

File Description
indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/OffsetSnapshot.java New immutable snapshot type to publish current+latest offsets together.
indexing-service/src/test/java/org/apache/druid/indexing/seekablestream/supervisor/OffsetSnapshotTest.java New unit tests for snapshot creation/copying/filtering behavior.
extensions-core/kafka-indexing-service/src/main/java/org/apache/druid/indexing/kafka/supervisor/KafkaSupervisor.java Uses an atomic snapshot for lag computations and offset reporting.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@FrankChen021
Copy link
Copy Markdown
Member

@wuguowei1994 Could you resolve the comments from copilot?

@wuguowei1994
Copy link
Copy Markdown
Contributor Author

@FrankChen021 This is chinese new year, give me some time.

@FrankChen021
Copy link
Copy Markdown
Member

Sure, no hurry

@wuguowei1994
Copy link
Copy Markdown
Contributor Author

Our company has an urgent project that needs to go live within the next two weeks. I plan to refine and complete this round of changes after the 15th of this month. Thank you!

@FrankChen021
Copy link
Copy Markdown
Member

@wuguowei1994 let me know once you complete the fix.

@wuguowei1994
Copy link
Copy Markdown
Contributor Author

@FrankChen021 All the issues identified by AI have been fixed—could you take a look?

@FrankChen021 FrankChen021 merged commit 1d64896 into apache:master Mar 17, 2026
37 checks passed
@github-actions github-actions Bot added this to the 37.0.0 milestone Mar 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants