Skip to content

Remove consumer.listTopics() method in case when too many topics in kafka causes the FullGC in Overlord#6455

Merged
b-slim merged 7 commits intoapache:masterfrom
elloooooo:feature-improve-get-topic
Oct 22, 2018
Merged

Remove consumer.listTopics() method in case when too many topics in kafka causes the FullGC in Overlord#6455
b-slim merged 7 commits intoapache:masterfrom
elloooooo:feature-improve-get-topic

Conversation

@elloooooo
Copy link
Copy Markdown
Contributor

When there are too many topics in kafka cluster, the method consumer.listTopics() will leads to Full GC in Overlord.

@elloooooo elloooooo changed the title Remove consumer.listTopics() method in case when too many topic in kafka causes the FullGC in Overlord Remove consumer.listTopics() method in case when too many topics in kafka causes the FullGC in Overlord Oct 12, 2018
Copy link
Copy Markdown
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does seem like it would be more efficient than pulling the list of all topics 👍, but Travis failures look perhaps related to this PR, maybe something to do with the removal of the lock?

}

List<PartitionInfo> partitions = topics.get(ioConfig.getTopic());
List<PartitionInfo> partitions = consumer.partitionsFor(ioConfig.getTopic());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did the lock and exception handling get removed? The lock seems needed, and checking the docs partitionsFor can throw a handful of exceptions too:

Throws:
WakeupException - if wakeup() is called before or while this function is called
InterruptException - if the calling thread is interrupted before or while this function is called
AuthorizationException - if not authorized to the specified topic
TimeoutException - if the topic metadata could not be fetched before expiration of the configured request timeout
KafkaException - for any other unrecoverable errors

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for reviewing this. I aggree that the lock and exception handing should not be removed. I will fix later.

@b-slim
Copy link
Copy Markdown
Contributor

b-slim commented Oct 13, 2018

Dumb question why do we even need to list all topics if the user supply the input topic? Can this operation be removed.

@b-slim
Copy link
Copy Markdown
Contributor

b-slim commented Oct 13, 2018

@elloooooo am wondering how big the topic list you guys had to see this issue in practice thanks!

@elloooooo
Copy link
Copy Markdown
Contributor Author

@b-slim Someone used to create 20000+ topics at one of our kafka clusters which supports creating topic when needed. Actually his code has bugs.

log.warn("No such topic [%s] found, list of discovered topics [%s]", ioConfig.getTopic(), topics.keySet());
}
int numPartitions = (partitions != null ? partitions.size() : 0);
int numPartitions = partitions.size();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like partitions can be null. Should be int numPartitions = (partitions != null ? partitions.size() : 0);.

Copy link
Copy Markdown
Contributor Author

@elloooooo elloooooo Oct 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the partitions can be null only when consumer.partitionsFor(ioConfig.getTopic()) throw exceptions. And the exception has been catched and the function get return. So when partitions is null, there is no chance run int numPartitions = partitions.size();, is it?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the code of KafkaConsumer.partitionsFor().

    public List<PartitionInfo> partitionsFor(String topic) {
        acquire();
        try {
            Cluster cluster = this.metadata.fetch();
            List<PartitionInfo> parts = cluster.partitionsForTopic(topic);
            if (parts != null)
                return parts;

            Map<String, List<PartitionInfo>> topicMetadata = fetcher.getTopicMetadata(
                    new MetadataRequest.Builder(Collections.singletonList(topic)), requestTimeoutMs);
            return topicMetadata.get(topic);
        } finally {
            release();
        }
    }

Since topicMetadata is a map, topicMetadata.get(topic); can return null.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah~ I didn't notice that. Thank you. I will fix it.

@jihoonson
Copy link
Copy Markdown
Contributor

@elloooooo thanks for the quick fix! The Travis fail looks legit. Would you please fix it?

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.2:testCompile (default-testCompile) on project druid-sql: Compilation failure: Compilation failure: 
[ERROR] /home/travis/build/apache/incubator-druid/sql/src/test/java/org/apache/druid/sql/calcite/util/CalciteTests.java:[588,9] incompatible types: java.lang.String cannot be converted to org.apache.druid.discovery.NodeType
[ERROR] /home/travis/build/apache/incubator-druid/sql/src/test/java/org/apache/druid/sql/calcite/util/CalciteTests.java:[585,49] cannot find symbol
[ERROR]   symbol: constructor (org.apache.druid.java.util.http.client.HttpClient,org.apache.druid.discovery.DruidNodeDiscoveryProvider,java.lang.String,java.lang.String,org.apache.druid.curator.discovery.ServerDiscoverySelector)

@jihoonson
Copy link
Copy Markdown
Contributor

@elloooooo NVM. Looks like it's caused by the bug which #6466 fixes.

@jihoonson
Copy link
Copy Markdown
Contributor

Would you please check the CI failure?

Tests run: 89, Failures: 17, Errors: 8, Skipped: 0, Time elapsed: 57.429 sec <<< FAILURE! - in org.apache.druid.indexing.kafka.supervisor.KafkaSupervisorTest
testResetNoTasks[numThreads = 1](org.apache.druid.indexing.kafka.supervisor.KafkaSupervisorTest)  Time elapsed: 0.155 sec  <<< FAILURE!
java.lang.AssertionError: 
  Unexpected method call IndexerMetadataStorageCoordinator.getDataSourceMetadata("testDS"):

@elloooooo
Copy link
Copy Markdown
Contributor Author

@jihoonson I can't figure it out why the travis-ci failed.
Actually, there is no error like this in my local test.

Running org.apache.druid.curator.announcement.AnnouncerTest
No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.

Could you give me some suggestions?

@b-slim
Copy link
Copy Markdown
Contributor

b-slim commented Oct 18, 2018

can you try to run druid-server tests on your local machine? seems to be a locking issue, but not sure. i have restarted the build anyway.

@gianm
Copy link
Copy Markdown
Contributor

gianm commented Oct 18, 2018

Hmm, they got stuck again but I restarted them again.

private void addSomeEvents(int numEventsPerPartition) throws Exception
{
//create topic manually
AdminUtils.createTopic(zkUtils, topic, NUM_PARTITIONS, 1, new Properties(), RackAwareMode.Enforced$.MODULE$);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, would you tell me why this is needed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the value of setting auto.craete.topics.enable is ture(default), consumer.partitionsFor(SOME_TOPIC) will create the SOME_TOPIC if the topic doesn't exist. This leads some UT like testXXXNoTasks() failed. So I disable it but this lead some UTs which need sending some events to kafka fail. So I create the topic in kafka manually before sending data.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, it looks like there was something wrong in Github. Got it. Thanks.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about that ~ I will try to delete the duplicate comment~

@jihoonson
Copy link
Copy Markdown
Contributor

I suspect a race problem of KafkaSupervisorTest. All tests in KafkaSupervisorTest basically executes the below code.

    supervisor.start();
    supervisor.runInternal();
    verifyAll();

However, supervisor.start() also internally calls supervisor.runInternal(). If this method is called twice or more, it can cause the unexpected method call errors reported above. This is not related to this PR anyway.

I left one question. Please take a look.

Copy link
Copy Markdown
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@b-slim b-slim merged commit 1e82b62 into apache:master Oct 22, 2018
@elloooooo elloooooo deleted the feature-improve-get-topic branch October 26, 2018 02:02
gianm pushed a commit to implydata/druid-public that referenced this pull request Nov 16, 2018
…afka causes the FullGC in Overlord (apache#6455)

* remove consumer.listTopics() method

* add consumerLock and exception handling for consumer.partitionFor() and remove some useless checks

* add check in case consumer.partitionsFor() returns null

* fix CI failure

* fix failed UT

* Revert "fix CI failure"

This reverts commit f839d09.

* revert unless commit and re-commit the useful part to fix failed UT
@jon-wei jon-wei added this to the 0.14.0 milestone Feb 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants