Skip to content

Conversation

@rdhabalia
Copy link
Contributor

Motivation

This will fix #6054 , It happens when a topic which has batch messages, subscription has multiple consumers and if some of those consumers' receive more messages than available-permits (due to unequal batch distribution which will be fixed by #7266) and if they are slow then totalAvailablePermits become -ve for that subscription and broker will not dispatch messages to the consumes which still have permits.

# slow consumer with -ve permit that stopped consuming messages
 "msgRateOut": 0,
          "msgThroughputOut": 0,
          "bytesOutCounter": 2993076140,
          "msgOutCounter": 325043,
          "msgRateRedeliver": 0,
          "chuckedMessageRate": 0,
          "chunkedMessageRate": 0,
          "consumerName": "bee1d",
          "availablePermits": -43,
          "unackedMessages": 1089,
 "msgRateOut": 0,
          "msgThroughputOut": 0,
          "bytesOutCounter": 3021328186,
          "msgOutCounter": 326508,
          "msgRateRedeliver": 0,
          "chuckedMessageRate": 0,
          "chunkedMessageRate": 0,
          "consumerName": "57f72",
          "availablePermits": -8,
          "unackedMessages": 1233,

# consumer with permit which doesn't receive message from broker
"msgRateOut": 0,
          "msgThroughputOut": 0,
          "bytesOutCounter": 2960672336,
          "msgOutCounter": 321930,
          "msgRateRedeliver": 0,
          "chuckedMessageRate": 0,
          "chunkedMessageRate": 0,
          "consumerName": "fdc38",
          "availablePermits": 70,
          "unackedMessages": 1213,
          "avgMessagesPerEntry": 11,
          "blockedConsumerOnUnackedMsgs": false,
          "lastAckedTimestamp": 1619577828969,
          "lastConsumedTimestamp": 1619577832806,
          "metadata": {
            "instance_id": "0",
            "application": "pulsar-function",
            "instance_hostname": "fab08.xyz.com",
            "id": "amplitude/processing/siege-filter"
          },

Snip20210428_2

Modification

The broker should dispatch messages if any of the consumers of subscription has permits available so, stuck/slow consumer doesn't impact other good consumers.

Result

It should fix #6054

cc @devinbost

@lhotari
Copy link
Member

lhotari commented Apr 28, 2021

Thank you @rdhabalia for working on a fix.
Would you be able to review #10413 since that is also working towards fixing issues in the same area?

while (entriesToDispatch > 0 && totalAvailablePermits > 0 && isAtleastOneConsumerAvailable()) {
int firstAvailableConsumerPermits = getFirstAvailableConsumerPermits();
int currentTotalAvailablePermits = Math.max(totalAvailablePermits, firstAvailableConsumerPermits);
while (entriesToDispatch > 0 && currentTotalAvailablePermits > 0 && firstAvailableConsumerPermits > 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't there a difference in behavior here now that firstAvailableConsumerPermits > 0 is evaluated once before the while loop and previously isAtleastOneConsumerAvailable() would get evaluated each time the while loop evaluates the condition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for catching it. yes, I missed to push the commit. fixed now.

@rdhabalia rdhabalia force-pushed the batch_permit branch 3 times, most recently from f70352b to 4fad5f5 Compare April 28, 2021 15:54
Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@devinbost
Copy link
Contributor

/pulsarbot run-failure-checks

1 similar comment
@devinbost
Copy link
Contributor

/pulsarbot run-failure-checks

@devinbost
Copy link
Contributor

@rdhabalia Please update the description of this issue so it doesn't automatically close #6054 since we determined that this PR is only a partial fix for that issue. I'll update the issue with our latest findings.

@devinbost
Copy link
Contributor

/pulsarbot run-failure-checks

@devinbost
Copy link
Contributor

BTW, to any reviewers, I have tested this branch on a custom build of Pulsar, and it works as expected.

@rdhabalia rdhabalia merged commit 3550f2e into apache:master May 11, 2021
@merlimat merlimat added the type/bug The PR fixed a bug or issue reported a bug label May 12, 2021
eolivelli pushed a commit that referenced this pull request May 13, 2021
* [pulsar-broker] Dispatch messaages to consumer with permits

* move test

(cherry picked from commit 3550f2e)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/broker type/bug The PR fixed a bug or issue reported a bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Catastrophic frequent random subscription freezes, especially on high-traffic topics.

4 participants