[branch-2.10] [fix][broker] Fix isolated group not work by horizonzy · Pull Request #21097 · apache/pulsar

horizonzy · 2023-08-30T17:22:48Z

Motivation

This is a cherry-pick commit for #21096.

Documentation

doc
doc-required
doc-not-needed
doc-complete

Matching PR in forked repository

PR in forked repository:

…#18421) (cherry picked from commit ac9742a)

…partition-` (apache#19230) (cherry picked from commit fc4bca6)

…#19404) (cherry picked from commit 33f40f6)

apache#19129) (cherry picked from commit a6516a8)

…he#19258) (cherry picked from commit cfd7e60)

(cherry picked from commit 96fb7da)

…pic partition number. (apache#19223) (cherry picked from commit 253e3e4)

…e#19446) (cherry picked from commit 524288c) (cherry picked from commit 4cbe68e)

…ion fails (apache#19129)" This reverts commit 9f0e8e6.

…tent topic timeout (apache#19454) Co-authored-by: Tao Jiuming <95597048+tjiuming@users.noreply.github.com>

…ache#12615)" (apache#19439) This reverts commit 62e2547. ### Motivation The motivation for apache#12615 relies on an incorrect understanding of Netty's threading model. The `ctx.executor()` is the context's event loop thread that is the same thread used to process messages. The `waitingForPingResponse` variable is only ever updated/read from the context's event loop, so there is no need to make the variable `volatile`. ### Modifications * Remove `volatile` keyword for `waitingForPingResponse` ### Verifying this change Read through all references to the variable. ### Documentation - [x] `doc-not-needed` ### Matching PR in forked repository PR in forked repository: Skipping for this trivial PR. (cherry picked from commit fb28d83)

…ot aware rack info problem. (apache#18672) (cherry picked from commit 43335fb)

(cherry picked from commit 11073fd)

…9346) (cherry picked from commit e2851da)

…nt (apache#19308) (cherry picked from commit 644be5f)

… txn race condition. (apache#19201) Fixes apache#19200 transaction lasted for long time and will not be aborted, which cause TB's MaxReadPosition do not move and will not take snapshot. With an old snapshot, TB will read a lot of entry while doing recovery. In worst cases, there are 30 minutes of unavailable time with Topics. avoid concurrent execution. (cherry picked from commit 96f4161)

…ache#19467) (apache#19473) Co-authored-by: Nicolò Boschi <boschi1997@gmail.com>

…che#19199) (cherry picked from commit 4d57828)

When a Pulsar topic is unloaded from a broker, certain metrics related to that topic will appear to remain active for the broker for 5 minutes. This is confusing for troubleshooting because it makes the topic appear to be owned by multiple brokers for a short period of time. See below for a way to reproduce this behavior. In order to solve this "zombie" metric problem, I propose we remove the timestamps that get exported with each Prometheus metric served by the broker. Since we introduced Prometheus metrics in apache#294, we have exported a timestamp along with most metrics. This is an optional, valid part of the spec defined [here](https://prometheus.io/docs/instrumenting/exposition_formats/#comments-help-text-and-type-information). However, after our adoption of Prometheus metrics, the Prometheus project released version 2.0 with a significant improvement to its concept of staleness. In short, before 2.0, a metric that was in the last scrape but not the next one (this often happens for topics that are unloaded) will essentially inherit the most recent value for the last 5 minute window. If there isn't one in the past 5 minutes, the metric becomes "stale" and isn't reported. Starting in 2.0, there was new logic to consider a value stale the very first time that it is not reported in a scrape. Importantly, this new behavior is only available if you do not export timestamps with metrics, as documented here: https://prometheus.io/docs/prometheus/latest/querying/basics/#staleness. We want to use the new behavior because it gives better insight into all topic metrics, which are subject to move between brokers at any time. This presentation https://www.youtube.com/watch?v=GcTzd2CLH7I and slide deck https://promcon.io/2017-munich/slides/staleness-in-prometheus-2-0.pdf document the feature in detail. This blog post was also helpful: https://www.robustperception.io/staleness-and-promql/. Additional motivation comes from mailing list threads like this one https://groups.google.com/g/prometheus-users/c/8OFAwp1OEcY. It says: > Note, however, that adding timestamps is an extremely niche use case. Most of the users who think the need it should actually not do it. > > The main usecases within that tiny niche are federation and mirroring the data from another monitoring system. The Prometheus Go client also indicates a similar motivation: https://pkg.go.dev/github.com/prometheus/client_golang/prometheus#NewMetricWithTimestamp. The OpenMetrics project also recommends against exporting timestamps: https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md#exposing-timestamps. As such, I think we are not a niche use case, and we should not add timestamps to our metrics. 1. Run any 2.x version of Prometheus (I used 2.31.0) along with the following scrape config: ```yaml - job_name: broker honor_timestamps: true scrape_interval: 30s scrape_timeout: 10s metrics_path: /metrics scheme: http follow_redirects: true static_configs: - targets: ["localhost:8080"] ``` 2. Start pulsar standalone on the same machine. I used a recently compiled version of master. 3. Publish messages to a topic. 4. Observe `pulsar_in_messages_total` metric for the topic in the prometheus UI (localhost:9090) 5. Stop the producer. 6. Unload the topic from the broker. 7. Optionally, `curl` the metrics endpoint to verify that the topic’s `pulsar_in_messages_total` metric is no longer reported. 8. Watch the metrics get reported in prometheus for 5 additional minutes. When you set `honor_timestamps: false`, the metric stops getting reported right after the topic is unloaded, which is the desired behavior. * Remove all timestamps from metrics * Fix affected tests and test files (some of those tests were in the proxy and the function worker, but no code was changed for those modules) This change is accompanied by updated tests. This is technically a breaking change to the metrics, though I would consider it a bug fix at this point. I will discuss it on the mailing list to ensure it gets proper visibility. Given how frequently Pulsar changes which metrics are exposed between each scrape, I think this is an important fix that should be cherry picked to older release branches. Technically, we can avoid cherry picking this change if we advise users to set `honor_timestamps: false`. However, I think it is better to just remove them. - [x] `doc-not-needed` (cherry picked from commit 0bbc4e1)

…#16744) (cherry picked from commit eb5725a)

…19412) Signed-off-by: Zixuan Liu <nodeces@gmail.com> (cherry picked from commit 016e7f0)

…ll messages. (apache#19428) (cherry picked from commit c91303d)

…LQ (apache#19392) (cherry picked from commit 39dd1cd)

Fixes: apache#19478 ### Motivation See issue for additional context. Essentially, we are doing a shallow clone when we needed a deep clone. The consequence is leaked labels, annotations, and tolerations. ### Modifications * Add a `deepClone` method to the `BasicKubernetesManifestCustomizer.RuntimeOpts` method. Note that this method is not technically a deep clone for the k8s objects. However, based on the way we "merge" these objects, it is sufficient to copy references to the objects. ### Verifying this change Added a test that fails before the change and passes afterwards. ### Documentation - [x] `doc-not-needed` This is an internal bug fix. No docs needed. ### Matching PR in forked repository PR in forked repository: michaeljmarshall#27 (cherry picked from commit 0205148)

Fixes apache#19431 `authenticationData` is already `volatile`. We use `originalAuthData` when set, so we should match the style. In apache#19431, I proposed that we find a way to not use `volatile`. I still think this might be a "better" approach, but it will be a larger change, and since we already use `volatile` for `authenticationData`, I think this is the right change for now. It's possible that this is not a bug, given that the `originalAuthData` does not change frequently. However, we always want to use up to date values for authorization. * Add `volatile` keyword to `ServerCnx#originalAuthData`. This change is a trivial rework / code cleanup without any test coverage. - [x] `doc-not-needed` PR in forked repository: skipping test in fork. (cherry picked from commit c4c1744) (cherry picked from commit e1d9941)

…ache#19270) (cherry picked from commit fd3ce8b) (cherry picked from commit 2847dd1)

I broke all release branches when I cherry picked 2847dd1 to them. This change takes some of the underlying logic from apache#19409, without taking the async logic. * Make changes to `ServerCnx` to make tests pass Tests are currently failing, so passing tests will show that this solution is correct. - [x] `doc-not-needed` (cherry picked from commit 8246da2) (cherry picked from commit 15e4198)

(cherry picked from commit 456d112)

…ception. (apache#20816)

…tion (apache#20872)

…oliciesIfCached (apache#20873)

(cherry picked from commit 5b32220)

…egistered if there has no message was sent (apache#20888) Motivation: In the replication scenario, we want to produce messages on the native cluster and consume messages on the remote cluster, the producer and consumer both use a same schema, but the consumer cannot be registered if there has no messages in the topic yet.The root cause is that for the remote cluster, there is a producer who has been registered with `AUTO_PRODUCE_BYTES` schema, so there is no schema to check the compatibility. Modifications: If there is no schema and only the replicator producer was registered, skip the compatibility check. (cherry picked from commit 9be0b52)

- The task `trim ledgers` runs in the thread `BkMainThreadPool.choose(ledgerName)` - The task `write entries to BK` runs in the thread `BkMainThreadPool.choose(ledgerId)` So the two tasks above may run concurrently/ The task `trim ledgers` work as the flow below: - find the ledgers which are no longer to read, the result is `{Ledgers before the slowest read}`. - check if the `{Ledgers before the slowest read}` is out of retention policy, the result is `{Ledgers to be deleted}`. - if the create time of the ledger is lower than the earliest retention time, mark it should be deleted - if after deleting this ledger, the rest ledgers are still larger than the retention size, mark it should be deleted - delete the`{Ledgers to be deleted}` **(Highlight)** There is a scenario that causes the task `trim ledgers` did discontinuous ledger deletion, resulting consume messages discontinuous: - context: - ledgers: `[{id=1, size=100}, {id=2,size=100}]` - retention size: 150 - no cursor there - Check `ledger 1`, skip by retention check `(200 - 100) < 150` - One in-flight writing is finished, the `calculateTotalSizeWrited()` would return `300` now. - Check `ledger 2`, retention check `(300 - 100) > 150`, mark the ledger-2 should be deleted. - Delete the `ledger 2`. - Create a new consumer. It will receive messages from `[ledger-1, ledegr-3]`, but the `ledger-2` will be skipped. Once the retention constraint has been met, break the loop. (cherry picked from commit 782e91f)

The C++ and Python clients are not maintained in the main repo now.

) (apache#20877)

…20978) (cherry picked from commit 63d9eaf)

(cherry picked from commit 3ab420c)

…e#20980)" This reverts commit 11605ca.

(cherry picked from commit 3ab420c)

Fixes: apache#20997 Update the expired certs to get tests passing. * Update all certs. See `README.md` in files for detailed steps. This change is covered by tests. - [x] `doc-not-needed` (cherry picked from commit d6734b7)

…he#20763)

…r. (apache#20880) Main Issue: apache#20851 ### Motivation When the Proto version does not allow us to send TcClientConnectRequest to the broker, we should add a log to debug it. ### Modifications Add a waining log.

…edup and set namespace disable deduplication. (apache#20905)

(cherry picked from commit d06cda6) (cherry picked from commit c644849)

…e#21061) (cherry picked from commit c05954e)

…pache#21035) Motivation: After [PIP-118: reconnect broker when ZooKeeper session expires](apache#13341), the Broker will not shut down after losing the connection of the local metadata store in the default configuration. However, before the ZK client is reconnected, the events of BK online and offline are lost, resulting in incorrect BK info in the memory. You can reproduce the issue by the test `BkEnsemblesChaosTest. testBookieInfoIsCorrectEvenIfLostNotificationDueToZKClientReconnect`(90% probability of reproduce of the issue, run it again if the issue does not occur) Modifications: Refresh BK info in memory after the ZK client is reconnected. (cherry picked from commit db20035)

…he#20763) (cherry picked from commit 3116abf)

…he (apache#20763)" This reverts commit 1a7accb.

…ageId read reaches lastReadId (apache#20988) (cherry picked from commit 9e2195c)

…letion of compaction (apache#21067) (cherry picked from commit bb9c9b4)

github-actions · 2023-08-30T17:23:20Z

@horizonzy Please add the following content to your PR description and select a checkbox:

- [ ] `doc` <!-- Your PR contains doc changes -->
- [ ] `doc-required` <!-- Your PR changes impact docs and you will update later -->
- [ ] `doc-not-needed` <!-- Your PR changes do not impact docs -->
- [ ] `doc-complete` <!-- Docs have been already added -->

zymap and others added 30 commits February 6, 2023 14:47

[improve][client] Change the get lastMessageId to debug level (apache…

fde620e

…#18421) (cherry picked from commit ac9742a)

[fix][broker] Support deleting partitioned topics with the keyword `-…

0ade728

…partition-` (apache#19230) (cherry picked from commit fc4bca6)

[fix] [ml] Fix the incorrect total size if use ML interceptor (apache…

80ab719

…#19404) (cherry picked from commit 33f40f6)

[fix][broker] Topic could be in fenced state forever if deletion fails (

9f0e8e6

apache#19129) (cherry picked from commit a6516a8)

[fix][txn] Catch and log runtime exceptions in async operations (apac…

3677138

…he#19258) (cherry picked from commit cfd7e60)

[improve][broker] Added isActive in ManagedCursorImpl (apache#19341)

f61aafa

(cherry picked from commit 96fb7da)

[improve][broker] Added isActive in ManagedCursorImpl (apache#19341)

d1e4008

(cherry picked from commit 96fb7da)

[improve][broker] Copy subscription properties during updating the to…

05cbbfd

…pic partition number. (apache#19223) (cherry picked from commit 253e3e4)

[fix][broker] Expect msgs after server initiated CloseProducer (apach…

283f773

…e#19446) (cherry picked from commit 524288c) (cherry picked from commit 4cbe68e)

Revert "[fix][broker] Topic could be in fenced state forever if delet…

660ac36

…ion fails (apache#19129)" This reverts commit 9f0e8e6.

[cherry-pick][branch-2.10] Close TransactionBuffer when create persis…

4a1ac0a

…tent topic timeout (apache#19454) Co-authored-by: Tao Jiuming <95597048+tjiuming@users.noreply.github.com>

[fix][broker] Fix PulsarRegistrationClient and ZkRegistrationClient n…

003c186

…ot aware rack info problem. (apache#18672) (cherry picked from commit 43335fb)

[fix][ml] Fix potential NPE cause future never complete. (apache#19415)

58fb59f

(cherry picked from commit 11073fd)

[fix] [ml] The atomicity of multiple fields of ml is broken (apache#1…

21fecf0

…9346) (cherry picked from commit e2851da)

[improve][txn] Handle changeToReadyState failure correctly in TC clie…

ddb94aa

…nt (apache#19308) (cherry picked from commit 644be5f)

[cherry-pick][branch-2.10] Allow superusers to abort transactions (ap…

cb91c4a

…ache#19467) (apache#19473) Co-authored-by: Nicolò Boschi <boschi1997@gmail.com>

[fix][broker] Fix race condition while updating partition number (apa…

4b83d06

…che#19199) (cherry picked from commit 4d57828)

[Improve][broker] Support clear old bookie data for BKCluster (apache…

eabc2cd

…#16744) (cherry picked from commit eb5725a)

[fix][authorization] Fix the return value of canConsumeAsync (apache#…

b28d796

…19412) Signed-off-by: Zixuan Liu <nodeces@gmail.com> (cherry picked from commit 016e7f0)

[fix][ml] Reset individualDeletedMessagesSerializedSize after acked a…

0c7b250

…ll messages. (apache#19428) (cherry picked from commit c91303d)

[fix][client] Fix async completion in ConsumerImpl#processPossibleToD…

73f6693

…LQ (apache#19392) (cherry picked from commit 39dd1cd)

[cleanup][broker] Validate originalPrincipal earlier in ServerCnx (ap…

01bd986

…ache#19270) (cherry picked from commit fd3ce8b) (cherry picked from commit 2847dd1)

[fix][security] Fix secure problem CVE-2017-1000487 (apache#19479)

fb5477b

[fix][broker] Fix loadbalance score caculation problem (apache#19420)

b4b664c

(cherry picked from commit 456d112)

shibd and others added 27 commits July 18, 2023 15:50

[fix][io][branch-2.10] Not restart instance when kafka source poll ex…

4a894af

…ception. (apache#20816)

[fix][build]Fix compatibility issue cause by apache#20819 (apache#20834)

1eb5eb3

[branch-2.10][fix][broker] Inconsistent behaviour for topic auto_crea…

fea2f9b

…tion (apache#20872)

[branch-2.10][fix][broker] Fix inconsensus namespace policies by getP…

6d321e1

…oliciesIfCached (apache#20873)

[improve] [broker] Print warn log if compaction failure (apache#19405)

ebf9961

(cherry picked from commit 5b32220)

[branch-2.10] Remove cpp tests

b0e2c0a

The C++ and Python clients are not maintained in the main repo now.

[branch-2.10][fix][broker] Avoid infinite bundle unloading (apache#20822

31d59cf

) (apache#20877)

[fix][broker] Fix incorrect number of read compacted entries (apache#…

c9cc8a7

…20978) (cherry picked from commit 63d9eaf)

[fix][broker] Fix message loss during topic compaction (apache#20980)

11605ca

(cherry picked from commit 3ab420c)

Revert "[fix][broker] Fix message loss during topic compaction (apach…

b9ab757

…e#20980)" This reverts commit 11605ca.

[fix][broker] Fix message loss during topic compaction (apache#20980)

385bede

(cherry picked from commit 3ab420c)

[fix][io] Update test certs for Elasticsearch (apache#21001)

fa0c70c

Fixes: apache#20997 Update the expired certs to get tests passing. * Update all certs. See `README.md` in files for detailed steps. This change is covered by tests. - [x] `doc-not-needed` (cherry picked from commit d6734b7)

[fix][broker] Fix get topic policies as null during clean cache (apac…

17a4bcf

…he#20763)

[fix][broker] fix MessageDeduplication throw NPE when enable broker d…

46372d6

…edup and set namespace disable deduplication. (apache#20905)

[improve][sql] Fix the wrong format of the logs (apache#20907)

ca880c6

[improve][proxy] Support disabling metrics endpoint (apache#21031)

084347c

(cherry picked from commit d06cda6) (cherry picked from commit c644849)

[fix][broker] Use MessageDigest.isEqual when comparing digests (apach…

63e142c

…e#21061) (cherry picked from commit c05954e)

fix the license description

2e1638b

[fix][broker] Fix get topic policies as null during clean cache (apac…

1a7accb

…he#20763) (cherry picked from commit 3116abf)

Revert "[fix][broker] Fix get topic policies as null during clean cac…

6e0ef70

…he (apache#20763)" This reverts commit 1a7accb.

[fix][broker] Fix can't stop phase-two of compaction even though mess…

a5d1ca0

…ageId read reaches lastReadId (apache#20988) (cherry picked from commit 9e2195c)

[fix][broker] Make sure all inflight writes have finished before comp…

810a2f0

…letion of compaction (apache#21067) (cherry picked from commit bb9c9b4)

Fix isolated group not work problem.

faa1a91

github-actions Bot added the doc-label-missing label Aug 30, 2023

horizonzy closed this Aug 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[branch-2.10] [fix][broker] Fix isolated group not work#21097

[branch-2.10] [fix][broker] Fix isolated group not work#21097
horizonzy wants to merge 1026 commits intoapache:masterfrom
horizonzy:branch-2.10-fix-isolated-group-not-work

horizonzy commented Aug 30, 2023

Uh oh!

github-actions Bot commented Aug 30, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

horizonzy commented Aug 30, 2023

Motivation

Documentation

Matching PR in forked repository

Uh oh!

github-actions Bot commented Aug 30, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants