Fix negative Kafka partition lag caused by inconsistent current/latest offsets by wuguowei1994 · Pull Request #18750 · apache/druid

wuguowei1994 · 2025-11-18T06:21:14Z

Motivation

We operate a Druid deployment with more than 500 nodes.

In real-time ingestion scenarios, a monitoring process queries the cluster every minute to retrieve the ingest/kafka/partitionLag metric. If the lag remains unhealthy for more than five minutes, alerts are triggered.

In our production environment, this metric periodically becomes negative, even when the cluster is fully healthy. These false alerts create unnecessary operational load and frequently wake the on-call team during off-hours. At the same time, we cannot suppress negative-lag alerts entirely, since in some situations negative lag can indicate real ingestion problems.

For a large-scale, 24×7 real-time ingestion pipeline, accurate and consistent lag metrics are essential to avoid unnecessary nighttime wake-ups while still ensuring that real issues are detected promptly.

Problem Description

In the current implementation, the Druid supervisor maintains two volatile data structures:

The latest Kafka end_offset for each partition
The latest task-reported current_offset for each partition

The supervisor periodically updates these values (every 30 seconds):

Querying all tasks in parallel to update current_offset.
This step waits for all HTTP requests to complete and each request has a timeout of two minutes.
Querying Kafka cluster to refresh end_offset.

On the other hand, a separate periodic task (every minute) computes:

lag = end_offset - current_offset

Because the two updates are not atomic, intermediate inconsistent states may occur.

Intermediate State Leading to Negative Lag

If one task becomes heavily loaded or experiences other delays during Step 1, it may take significantly longer to return its offset. In this situation, the supervisor continues waiting for that slow task while the other tasks have already responded.

During this waiting period:

Many current_offset values already have been updated to new values.
The end_offset values remain stale because Step 2 has not executed yet.

If a monitoring request arrives in this intermediate window, the supervisor computes lag using:

Partially updated current_offset
Stale end_offset

This produces negative lag values.

This issue repeats as long as at least one task remains slow. Large clusters with many partitions and many Kafka-indexing tasks are more likely to experience this scenario.

Example Scenario

Initial state: end_offset = 10000, current_offset = 0.
After consumption: latest Kafka end_offset = 30000, and all tasks have consumed up to 20000.
During Step 1, 49 tasks respond quickly, and their current_offset is updated to 20000.
One task is slow, causing Step 1 to remain in the awaiting state.
The in-memory end_offset stays at the old value 10000.
If a metric query occurs at this point, the supervisor calculates:
```
10000 - 20000 = -10000
```
Because the periodic update logic repeats, this situation can persist across multiple cycles.

Proposed Changes

Replace the two volatile structures storing current_offset and end_offset with AtomicReference containers that hold both values as a single immutable state object. The supervisor will update these references as atomic units, ensuring that lag computation always observes a consistent snapshot.

This eliminates inconsistent intermediate states and prevents negative lag due to partial updates.

Rationale

Ensures consistent reads between related fields.
No behavioral changes other than removing negative lag caused by inconsistent state.

Operational Impact

Improved accuracy of Kafka lag metrics in large clusters.
Reduces false alerts in monitoring systems.

Test Plan

This change does not add new feature. We only need to make sure existing tests still pass.
All current tests pass successfully.

This PR has:

been self-reviewed.
added documentation for new or modified features or behaviors.
a release note entry in the PR description.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added or updated version, license, or notice information in licenses.yaml
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests.
been tested in a test Druid cluster.

wuguowei1994 · 2025-11-19T02:30:57Z

Our team has been suffering from these negative-lag alerts recently, and they’ve been repeatedly waking us up at night. It’s become difficult for us to get a good night’s sleep.

kfaraz · 2025-11-19T03:47:21Z

@wuguowei1994 , thanks for creating a PR in Apache Druid! One of the contributors will take a look at the changes soon.

abhishekrb19 · 2025-11-20T03:10:31Z

@wuguowei1994 could you sync the latest master to your branch? I think this patch might fix the test failures.

wuguowei1994 · 2025-11-26T01:49:32Z

ERROR [main] org.apache.druid.testing.utils.DruidClusterAdminClient 

- Error while waiting for [http://localhost:30400] to be ready


java.util.concurrent.ExecutionException: org.jboss.netty.channel.ChannelException: Faulty channel in resource pool

@clintropolis Thanks for rerunning the jobs. Before the rerun there were two failed tasks, and now there’s only one left. The error still looks like an environment issue… really frustrating.

This whole code submission experience also shows how inactive the project has become. A lot of people report issues in the Druid Slack, but almost no one responds anymore. It’s no surprise so many companies are migrating from Druid to ClickHouse.

clintropolis · 2025-11-26T04:43:38Z

@clintropolis Thanks for rerunning the jobs. Before the rerun there were two failed tasks, and now there’s only one left. The error still looks like an environment issue… really frustrating.

Yea, we are in the middle of a bit of a migration/overhaul of our integration test framework and processes, hopefully this will be more well behaved in the future, since part of the reason for this change is to address flakiness as well as make it much easier to write and debug these tests.

This whole code submission experience also shows how inactive the project has become. A lot of people report issues in the Druid Slack, but almost no one responds anymore. It’s no surprise so many companies are migrating from Druid to ClickHouse.

Apologies for the perception - while I can assure you that there are quite a lot of active and interesting projects happening and that Druid is still very much being actively developed, that doesn't fix your experience here or the optics in slack. The unfortunate matter is that there are always a lot more things to do than people to do them, and while we try our best, sometimes things are a bit slow to get a committers attention. All that said, thanks again for the contribution and trying to make things better, and I do hear you - i'll see if maybe I can nudge some of the other committers to be a bit more responsive and try to do the same myself.

wuguowei1994 · 2025-12-03T08:19:21Z

@clintropolis Take a look ?

kfaraz · 2025-12-03T09:25:30Z

@wuguowei1994 , thanks for your patience!
We will try to get this PR reviewed soon.

cecemei

Thanks for making the PR, and the description really explains the problem well. Kafka ingest lag is an important metric and we also monitor it closely. I always feel a bit intimated by this part of code base, but your PR has helped me understand it much better.

wuguowei1994 · 2025-12-04T12:29:54Z

@cecemei
I've gone through your comments thoroughly — they're great! Give me a bit of time to improve the code.

kfaraz

Thanks for the changes, @wuguowei1994 ! I have left some suggestions.

While the changes in this PR make sense by reporting the lag more consistently (updating the two offsets in lockstep), I was wondering if it wouldn't be simpler to just report zero lag in case the lag turns out to be negative.

A negative record lag does mean that the task has already caught up to the last offsets that we had fetched from the topic and ingested some more records beyond that. And the lag metric is really just meant to indicate if the tasks are keeping up.

For other purposes, we have the message gap and the time lag metrics.

In fact, the negative lag could even be a feature to identify if some tasks are particularly slow in returning their offsets. 😛 , and we could probably have alerts set up if the negative lag goes below a specific threshold.

@cecemei , I think you also raised a concern regarding the possibility that the lag reported might now be higher than the actual lag, since we always fetch the stream offsets only after we have received updates from all the tasks.

I think the current code is also susceptible to reporting stale lag (higher or lower).

Say if a task were slow to return its latest ingested offsets, we would be delayed in fetching the latest offsets from the stream.
So, in that period, we would be reporting stale lag (which could have been higher or lower than the actual lag, a special case of which would be negative lag) and then as soon as we fetched the latest offsets from the stream, the reported lag would fix itself.

cecemei · 2025-12-04T18:57:15Z

Thanks for the changes, @wuguowei1994 ! I have left some suggestions.

While the changes in this PR make sense by reporting the lag more consistently (updating the two offsets in lockstep), I was wondering if it wouldn't be simpler to just report zero lag in case the lag turns out to be negative.

A negative record lag does mean that the task has already caught up to the last offsets that we had fetched from the topic and ingested some more records beyond that. And the lag metric is really just meant to indicate if the tasks are keeping up.

For other purposes, we have the message gap and the time lag metrics.

In fact, the negative lag could even be a feature to identify if some tasks are particularly slow in returning their offsets. 😛 , and we could probably have alerts set up if the negative lag goes below a specific threshold.

@cecemei , I think you also raised a concern regarding the possibility that the lag reported might now be higher than the actual lag, since we always fetch the stream offsets only after we have received updates from all the tasks.

I think the current code is also susceptible to reporting stale lag (higher or lower).

Say if a task were slow to return its latest ingested offsets, we would be delayed in fetching the latest offsets from the stream. So, in that period, we would be reporting stale lag (which could have been higher or lower than the actual lag, a special case of which would be negative lag) and then as soon as we fetched the latest offsets from the stream, the reported lag would fix itself.

Yes I suspect we might seeing a slightly higher lag after this change (comparing with what might be reported before). The accuracy of the lag would not be affected by when the latestSequenceFromStream gets updated (more random), but would only be affected by how long it took fetching the offsets (delay inside the system and interaction with kafka). I do think it provides more consistency than before and the trend of the lag is more important, so it's an improvement.

For future reference, we could maybe calculate the lag on a per partition basis.

wuguowei1994 · 2025-12-05T03:31:02Z

In fact, the negative lag could even be a feature to identify if some tasks are particularly slow in returning their offsets. 😛 , and we could probably have alerts set up if the negative lag goes below a specific threshold.

@kfaraz
Thanks for the clarification — that makes sense. In our case, though, we’ve noticed that negative lag in our large cluster can sometimes persist for over five minutes.

We’ve talked about this internally, and if it only happens occasionally (for example, under a minute), adjusting the alert thresholds would absolutely work for us. But when it lasts longer, it tends to indicate something worth investigating.

We’ve also seen a few situations where negative lag actually pointed to issues in the upstream Kafka cluster, so that’s part of why we’re a bit cautious here. If we keep the current Druid behavior and treat negative lag as normal consumption, there’s a chance we might overlook real problems.

So overall, having clear and reliable metrics to signal the health of the cluster would be really helpful for us.

wuguowei1994 · 2025-12-05T03:38:44Z

@kfaraz @cecemei
Both of your suggestions for improving the code are excellent, and I’m genuinely happy to keep refining it.
Give me a bit of time to rework the design.
Thank you both!

wuguowei1994 · 2025-12-15T09:58:22Z

@cecemei Sorry, I’ve been dealing with some personal matters recently, so my reply is a bit late. I’ve gone through all your reviews— they’re excellent, thank you! It was also a great learning opportunity for me.

Take a look again?

cecemei

lgtm with one very minor nit. also, next time, maybe try not to force push commits but just adding commits, so that we have an entire history and easier for reviewer to pick up from last review.

please wait for @kfaraz 's review on this as well. thanks for contributing!

…t offsets

wuguowei1994 · 2025-12-16T05:25:57Z

@cecemei Fixed the issue with UT and merged the latest changes from master. Thank you very much for your support throughout the entire process!

wuguowei1994 · 2026-01-15T01:55:54Z

@clintropolis @kfaraz Take a look, please?

wuguowei1994 · 2026-01-20T00:58:53Z

@gianm Could you help merge the code? We really need this bug fixed.

Copilot

Pull request overview

Introduces an immutable, atomically-published offset snapshot to prevent inconsistent reads of “current” vs “latest/end” offsets, eliminating transient negative Kafka partition lag metrics.

Changes:

Added OffsetSnapshot as an immutable container for (highest ingested offsets, latest stream offsets) with null-value filtering.
Updated KafkaSupervisor to store and read offsets via an AtomicReference<OffsetSnapshot<...>> for consistent lag calculations.
Added unit tests validating OffsetSnapshot behavior (null/empty inputs, copying, filtering, immutability-by-copy).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.

File	Description
`indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/OffsetSnapshot.java`	New immutable snapshot type to publish current+latest offsets together.
`indexing-service/src/test/java/org/apache/druid/indexing/seekablestream/supervisor/OffsetSnapshotTest.java`	New unit tests for snapshot creation/copying/filtering behavior.
`extensions-core/kafka-indexing-service/src/main/java/org/apache/druid/indexing/kafka/supervisor/KafkaSupervisor.java`	Uses an atomic snapshot for lag computations and offset reporting.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

FrankChen021 · 2026-02-04T02:33:05Z

@wuguowei1994 Could you resolve the comments from copilot?

wuguowei1994 · 2026-02-21T14:49:31Z

@FrankChen021 This is chinese new year, give me some time.

FrankChen021 · 2026-02-21T14:50:44Z

Sure, no hurry

wuguowei1994 · 2026-03-01T10:45:47Z

Our company has an urgent project that needs to go live within the next two weeks. I plan to refine and complete this round of changes after the 15th of this month. Thank you!

FrankChen021 · 2026-03-01T12:02:54Z

@wuguowei1994 let me know once you complete the fix.

…Stream

wuguowei1994 · 2026-03-17T07:28:10Z

@FrankChen021 All the issues identified by AI have been fixed—could you take a look?

github-actions Bot added Area - Streaming Ingestion Area - Ingestion labels Nov 18, 2025

wuguowei1994 force-pushed the hotfix/negative-lag branch 2 times, most recently from 95a388c to 335d1e9 Compare November 18, 2025 09:46

github-advanced-security AI found potential problems Nov 18, 2025

View reviewed changes

Comment thread .../main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java Fixed

Comment thread .../main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java Fixed

wuguowei1994 force-pushed the hotfix/negative-lag branch 2 times, most recently from 89b59bc to d1f5e4e Compare November 19, 2025 02:22

wuguowei1994 force-pushed the hotfix/negative-lag branch 3 times, most recently from 7b8667f to 14a115a Compare November 22, 2025 02:38

clintropolis reviewed Nov 26, 2025

View reviewed changes

Comment thread ...dexing-service/src/main/java/org/apache/druid/indexing/kafka/supervisor/KafkaSupervisor.java Outdated

wuguowei1994 force-pushed the hotfix/negative-lag branch 3 times, most recently from 3e773a2 to 74c3ac4 Compare November 26, 2025 12:13

wuguowei1994 requested a review from clintropolis December 2, 2025 02:44

cecemei requested changes Dec 4, 2025

View reviewed changes

kfaraz reviewed Dec 4, 2025

View reviewed changes

wuguowei1994 force-pushed the hotfix/negative-lag branch 2 times, most recently from 1b89158 to 62366fd Compare December 9, 2025 10:36

wuguowei1994 force-pushed the hotfix/negative-lag branch from 8968fd2 to 3dcae80 Compare December 9, 2025 11:14

cecemei requested changes Dec 10, 2025

View reviewed changes

wuguowei1994 force-pushed the hotfix/negative-lag branch from 3dcae80 to 70f5bb4 Compare December 15, 2025 10:02

cecemei approved these changes Dec 16, 2025

View reviewed changes

Comment thread ...dexing-service/src/main/java/org/apache/druid/indexing/kafka/supervisor/KafkaSupervisor.java Outdated

cecemei reviewed Dec 16, 2025

View reviewed changes

Comment thread ...ce/src/test/java/org/apache/druid/indexing/seekablestream/supervisor/OffsetSnapshotTest.java Outdated

github-advanced-security AI found potential problems Dec 16, 2025

View reviewed changes

wuguowei1994 force-pushed the hotfix/negative-lag branch from 0032041 to dd10705 Compare December 16, 2025 05:22

wuguowei1994 added 2 commits December 16, 2025 13:23

Fix negative Kafka partition lag caused by inconsistent current/lates…

84f7425

…t offsets

Address review comments

30a9842

wuguowei1994 force-pushed the hotfix/negative-lag branch from dd10705 to 30a9842 Compare December 16, 2025 05:24

wuguowei1994 requested a review from kfaraz December 18, 2025 11:44

FrankChen021 requested a review from Copilot February 3, 2026 02:06

Copilot started reviewing on behalf of FrankChen021 February 3, 2026 02:06 View session

Copilot AI reviewed Feb 3, 2026

View reviewed changes

Fix stale Javadoc and brittle tests after removing latestSequenceFrom…

4612832

…Stream

wuguowei1994 force-pushed the hotfix/negative-lag branch from c949827 to 4612832 Compare March 17, 2026 07:24

FrankChen021 approved these changes Mar 17, 2026

View reviewed changes

github-advanced-security AI found potential problems Mar 17, 2026

View reviewed changes

Comment thread ...dexing-service/src/main/java/org/apache/druid/indexing/kafka/supervisor/KafkaSupervisor.java Dismissed

FrankChen021 merged commit 1d64896 into apache:master Mar 17, 2026
37 checks passed

github-actions Bot added this to the 37.0.0 milestone Mar 17, 2026

Conversation

wuguowei1994 commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Problem Description

Intermediate State Leading to Negative Lag

Example Scenario

Proposed Changes

Rationale

Operational Impact

Test Plan

Uh oh!

Uh oh!

Uh oh!

wuguowei1994 commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kfaraz commented Nov 19, 2025

Uh oh!

abhishekrb19 commented Nov 20, 2025

Uh oh!

wuguowei1994 commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

clintropolis commented Nov 26, 2025

Uh oh!

wuguowei1994 commented Dec 3, 2025

Uh oh!

kfaraz commented Dec 3, 2025

Uh oh!

cecemei left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wuguowei1994 commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kfaraz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cecemei commented Dec 4, 2025

Uh oh!

wuguowei1994 commented Dec 5, 2025

Uh oh!

wuguowei1994 commented Dec 5, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wuguowei1994 commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cecemei left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

wuguowei1994 commented Nov 18, 2025 •

edited

Loading

wuguowei1994 commented Nov 19, 2025 •

edited

Loading

wuguowei1994 commented Nov 26, 2025 •

edited

Loading

wuguowei1994 commented Dec 4, 2025 •

edited

Loading

wuguowei1994 commented Dec 15, 2025 •

edited

Loading