Skip to content

Conversation

@BewareMyPower
Copy link
Contributor

@BewareMyPower BewareMyPower commented Jul 29, 2025

Motivation

I observed a topic has been loaded for more than 30 seconds in production environment recently after its namespace bundle ownership was transferred. From the logs and heap dump, the conclusion is that the topic has been stuck before calling asyncOpen in createPersistentTopic0.

After revisiting the long code path of topic loading, I found many places are not efficient, and the existing pulsar_topic_load_times metric is incorrect. This metric only counts the time from the beginning of createPersistentTopic0. However, there are some other time-consuming tasks before it's called.

return getTopicPoliciesBypassSystemTopic(topicName, TopicPoliciesService.GetType.LOCAL_ONLY)

It waits for topic policies of this topic are available before inserting a topic future to BrokerService#topics. In extreme cases, there could be many concurrent TopicPoliciesService#getTopicPoliciesAsync call.

In loadOrCreatePersistentTopic, it will check the ownership via checkTopicNsOwnership before acquiring the topic load semaphore here:

This violates the semantics of the maxConcurrentTopicLoadRequest config.

After acquiring the semaphore, it calls checkOwnershipAndCreatePersistentTopic, which validates the ownership again via NamespaceService#isServiceUnitActiveAsync:

pulsar.getNamespaceService().isServiceUnitActiveAsync(topicName)

Actually, the implementation of isServiceUnitActiveAsync is exactly the same with checkTopicNsOwnership, where the only difference is that the previous one returns a boolean future, while the latter one returns a failed future if it's false.

Even after that, it could fetch the topic policies before createPersistentTopic0:

propertiesFuture = fetchTopicPropertiesAsync(topicName);

Modifications

Major changes:

  • Remove the getTopicPoliciesBypassSystemTopic call before inserting a topic future. This method is only used in getManagedLedgerConfig.
    • NOTE: this breaks some tests that assume BrokerService#getTopic will succeed even when the topic's bundle is not owned. These tests could pass just because the system topic reader creation will trigger acquiring the bundles in the same namespace.
  • Remove the duplicated ownership validation before acquiring the topic load semaphore
  • Avoid calling getManagedLedgerConfig and fetchPartitionedTopicMetadataAsync repeatedly by executing these tasks before other tasks that depend on them.
  • Perform the validations concurrently, including checkMaxTopicsPerNamespace, checkTopicAlreadyMigrated, validateTopicConsistency.

Though many tasks are metadata store access with MetadataCache used, executing them concurrently is still more efficient.

For observability,

  • Take all tasks after acquiring the topic load semaphore into consideration of the pulsar_topic_load_times metric.
  • Add a log for topic policies get latency specifically when loading a topic. From my experience, using a reader with hasMessageAvailable and readNext loop could have poor performance when CPU pressure is high. This read loop is also too heavy.

Other changes and refactoring:

  • Pass the TopicName instance across the whole flow to avoid unnecessary conversions between TopicName and String.
  • Move the isTransactionInternalName at the beginning
  • Replace isServiceUnitActiveAsync with checkTopicNsOwnership and remove this method to avoid users using this method, which makes code hard to read.
  • Add a common method failTopicFuture to invalidate the topic cache for failures during topic loading

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: BewareMyPower#50

@BewareMyPower BewareMyPower added this to the 4.1.0 milestone Jul 29, 2025
@BewareMyPower BewareMyPower self-assigned this Jul 29, 2025
@BewareMyPower BewareMyPower added type/enhancement The enhancements for the existing features or docs. e.g. reduce memory usage of the delayed messages area/broker release/4.0.7 labels Jul 29, 2025
@BewareMyPower BewareMyPower marked this pull request as draft July 29, 2025 12:56
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Jul 29, 2025
@BewareMyPower BewareMyPower force-pushed the bewaremypower/topic-policies-temp-stuck branch from 0214bd2 to 4ffe9e8 Compare July 30, 2025 03:30
@coderzc coderzc modified the milestones: 4.1.0, 4.2.0 Sep 1, 2025
@BewareMyPower
Copy link
Contributor Author

I will write a PIP for the topic initialization optimization later, close this PR

@BewareMyPower
Copy link
Contributor Author

Regarding the duplicated ownership check method, I will use split it into a smaller PR.

@BewareMyPower
Copy link
Contributor Author

I'm splitting this PR into multiple small PRs, the 1st one is #24780

@BewareMyPower
Copy link
Contributor Author

2nd PR: #24785

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/broker doc-not-needed Your PR changes do not impact docs release/4.0.7 type/enhancement The enhancements for the existing features or docs. e.g. reduce memory usage of the delayed messages

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants