Skip to content
This repository was archived by the owner on Jan 24, 2024. It is now read-only.

use producer to publish message for metrics collected by pulsar#228

Merged
BewareMyPower merged 5 commits intostreamnative:masterfrom
dockerzhang:producer-publish
Nov 13, 2020
Merged

use producer to publish message for metrics collected by pulsar#228
BewareMyPower merged 5 commits intostreamnative:masterfrom
dockerzhang:producer-publish

Conversation

@dockerzhang
Copy link
Copy Markdown
Contributor

to solve #226, and this patch can improve producing performance about 5 times.

@BewareMyPower
Copy link
Copy Markdown
Collaborator

BewareMyPower commented Nov 12, 2020

The main issue is that Pulsar consumer cannot consume messages produced from Kafka producer. It looks like the original implementation does some extra work.

The following is just an experiment. Let's change doPublishMessages to

                        // >1. the old way
                        persistentTopic.publishMessage(
                                headerAndPayload,
                                MessagePublishContext.get(
                                        offsetFuture, persistentTopic, System.nanoTime()));
                        // >2. the new way
                        topicManager.registerProducerInPersistentTopic(topic.toString(), persistentTopic);
                        topicManager.getReferenceProducer(topic.toString()).publishMessage(0, 0,
                                headerAndPayload, size, false);
                        offsetFuture.complete(Long.valueOf(size));

to produce 2 same messages once.

And change the testKafkaProducePulsarConsumeMessageOrder to receive twice:

            msg = consumer.receive(1000, TimeUnit.MILLISECONDS);
            assertNotNull(msg);
            msg = consumer.receive(1000, TimeUnit.MILLISECONDS);
            assertNotNull(msg);

Then the unit test passed even if >1 and >2 are swapped.So maybe we need to figure out what the extra work persistentTopic.publishMessage does.

By the way, the NPE of Producer$MessagePublishContext.completed(Producer.java:377) seems to be not relative to this issue.

@BewareMyPower
Copy link
Copy Markdown
Collaborator

It looks like that the original implementation with a stats update could solve the problem:

                        topicManager.getReferenceProducer(topic.toString())
                                .getTopic().incrementPublishCount(size, headerAndPayload.readableBytes());
                        persistentTopic.publishMessage(
                                headerAndPayload,
                                MessagePublishContext.get(
                                        offsetFuture, persistentTopic, System.nanoTime()));

In addition, if the problem was solved, a unit test should be added to verify the stats has been updated:

  1. Use kafka producer to produce some messages;
  2. Use pulsar admin to get topic stats to verify bytesInCounter and msgInCounter.

@BewareMyPower
Copy link
Copy Markdown
Collaborator

Also I found the possible reason. It still may be related to the NPE of Producer$MessagePublishContext.completed(Producer.java:377).

When use service.Producer#publishMessage, it would finally call the same method PersistentTopic#publishMessage:

    public void publishMessage(ByteBuf headersAndPayload, PublishContext publishContext) {
        pendingWriteOps.incrementAndGet(); // ** [1] pendingWriteOps++ here, we need to -- later **/ 
        /* ... */
        switch (status) {
            case NotDup:
                // this is a PersistentTopic instance that has implemented addComplete/addFailed methods
                ledger.asyncAddEntry(headersAndPayload, this, publishContext);
                break;
            /* other cases... */
        }
    }

The only difference is the publishContext. The original implemenation uses a io.streamnative.pulsar.handlers.kop.MessagePublishContext whose complete method only update the publish latency and complete the future.

while Producer#publishMessage uses Producer.MessagePublishContext whose complete method is:

        public void completed(Exception exception, long ledgerId, long entryId) {
            if (exception != null) {
                /* ... */
            } else {
                /* ... */
                this.ledgerId = ledgerId;
                this.entryId = entryId;
                // NPE happens here, producer.cnx.ctx() == null
                producer.cnx.ctx().channel().eventLoop().execute(this);
            }
        }

And see the PersistentTopic#addComplete:

    public void addComplete(Position pos, Object ctx) {
        PublishContext publishContext = (PublishContext) ctx;
        PositionImpl position = (PositionImpl) pos;

        messageDeduplication.recordMessagePersisted(publishContext, position);
        // NPE caused by Producer.MessagePublishContext#complete, so the following
        // pendingWriteOps-- would be skipped.
        publishContext.completed(null, position.getLedgerId(), position.getEntryId());
        // ** [2] pendingWriteOps-- here, it has ++ in publishMessage  **/ 
        decrementPendingWriteOpsAndCheck();
    }

I'm not sure if the reference count leak is the reason, but the NPE here is really an issue. See the example stack:

java.lang.NullPointerException: null
	at org.apache.pulsar.broker.service.Producer$MessagePublishContext.completed(Producer.java:377) ~[pulsar-broker-2.6.2.jar:2.6.2]
	at org.apache.pulsar.broker.service.persistent.PersistentTopic.addComplete(PersistentTopic.java:370) ~[pulsar-broker-2.6.2.jar:2.6.2]
	at org.apache.bookkeeper.mledger.impl.OpAddEntry.safeRun(OpAddEntry.java:192) ~[managed-ledger-2.6.2.jar:2.6.2]

Comment thread tests/src/test/java/io/streamnative/pulsar/handlers/kop/KafkaRequestTypeTest.java Outdated
Comment thread tests/src/test/java/io/streamnative/pulsar/handlers/kop/KafkaRequestTypeTest.java Outdated
@BewareMyPower
Copy link
Copy Markdown
Collaborator

Add an explanation for

improve producing performance about 5 times.

After discussing with @dockerzhang , the reason is that the original commit used Producer#publishMessage, which has a default PublishContext that doesn't parse MessageId but simply set offset to 0. Though the parsed offset is not used anywhere now, see #230 .

@BewareMyPower BewareMyPower merged commit b8528a3 into streamnative:master Nov 13, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] the producer metrics of kop topic is not changed when producing

2 participants