[pulsar-io] Add ability to use Kafka's sinks as pulsar sinks #9825

dlg99 · 2021-03-06T03:06:18Z

Motivation

Provide a way to use Kafka-connect Sink as a pulsar sink, in cases like:

company has custom kafka sink and want to try the pulsar out
no corresponding pulsar sink exists
etc.

Modifications

Added "kafka producer" that uses kafka sink to dump data to the 3rd system bypassing kafka.
Added configuration options (kafkaConnectorSinkClass, kafkaConnectorConfigProperties) for the pulsar-kafka sink
Split pulsar-io/kafka module into pulsar-io/kafka (builds jar) and pulsar-io/kafka-nar (builds nar)

Verifying this change

Make sure that the change passes the CI checks.

This change added tests and can be verified as follows:

Added unit test.
Tested locally as

Ran pulsar standalone
Built pulsario/kafka-nar as mvn clean package -DskipTests -P packageKafkaConnect to include kafka's connect-file sink into the nar.
Ran test nar as

bin/pulsar-admin sinks localrun -a ~/pulsar-io-kafka-nar-2.8.0-SNAPSHOT.nar --name kwrap --namespace public/default/ktest --parallelism 1 -i my-topic --sink-config-file ~/sink.yaml

with

$ cat ~/sink.yaml
configs:
  "topic": "test"
  "offsetStorageTopic": "kafka-connect-sink-offset"
  "pulsarServiceUrl": "pulsar://localhost:6650/"
  "kafkaConnectorSinkClass": "org.apache.kafka.connect.file.FileStreamSinkConnector"
  "defaultKeySchema": "STRING_SCHEMA"
  "defaultValueSchema": "BYTES_SCHEMA"
  "kafkaConnectorConfigProperties":
    "linger.ms": "10000"
    "batch.size": "3"
    "file": "/tmp/sink_test.out"

message produced as

bin/pulsar-client produce my-topic --messages "hello-pulsar"

and got

$ cat /tmp/sink_test.out
[B@242e2d8b

which is perfect ([B@... is from FileSink writing byte array as a string, currently expected)
Schema support is needed in the pulsar's kafka sink to support non-byte[] data, tbd.

Does this pull request potentially affect one of the following parts:

If yes was chosen, please highlight the changes

Dependencies (does it add or upgrade a dependency): (no)
The public API: (no)
The schema: (no)
The default values of configurations: (no)
The wire protocol: (no)
The rest endpoints: (no)
The admin cli options: (no)
Anything that affects deployment: (no)

Documentation

Does this pull request introduce a new feature? (yes)
If yes, how is the feature documented? (not documented yet)
If a feature is not documented yet in this PR, please create a followup issue for adding the documentation
TBD following review

sijie · 2021-03-08T03:06:13Z

@jerrypeng @srkukarni @freeznet @nlu90 Can you review this pull request?

sijie

@dlg99

I think you are putting the implementation in a wrong package. pulsar-io/kafka is used for pulsar source/sink that interacts with kafka.

The Kafka-connect-adapter is used for adopting Kafka Connect Sink/Source. We should put the Kafka sink connector adopter in pulsar-io/kafka-connect-adapter.

sijie · 2021-03-08T03:11:30Z

distribution/io/src/assemble/io.xml


    <file><source>${basedir}/../../pulsar-io/cassandra/target/pulsar-io-cassandra-${project.version}.nar</source></file>
    <file><source>${basedir}/../../pulsar-io/twitter/target/pulsar-io-twitter-${project.version}.nar</source></file>
-    <file><source>${basedir}/../../pulsar-io/kafka/target/pulsar-io-kafka-${project.version}.nar</source></file>


I think this is a breaking change, no?

@sijie I might be missing some context, what it can break? The nar file name/source path is changed.
Similar change is #9808 - am I missing some context that will help me understand why that one was safe and this one is not?
I relied on integration tests to catch potential breaks, please let me know if there is something the tests won't catch, how can I confirm that everything works fine in this case?

I missed #9808. There is no breaking change here.

#9808 is fine because the adopter is used for both standalone connector and a dependency for debezium connector. This goes to my major comment. I don't think you should put the Kafka sink connector wrapper in Kafka connector. It should go to the pulsar-io-kafka-connect-adaptor module.

@sijie moved to another module

sijie · 2021-03-08T03:11:48Z

pulsar-io/kafka/pom.xml


  <build>
    <plugins>
-      <plugin>


Isn't this a breaking change?

sijie · 2021-03-08T03:12:28Z

pulsar-io/kafka/src/main/java/org/apache/pulsar/io/kafka/KafkaAbstractSink.java

+                    .forEach(kv -> props.put(kv.getKey(), kv.getValue()));

-        producer = new KafkaProducer<>(beforeCreateProducer(props));
+            // todo: schemas from config


Please create a Github issue and link the issue here.

sijie · 2021-03-08T05:12:53Z

pulsar-io/kafka/src/main/java/org/apache/pulsar/io/kafka/KafkaBytesSink.java

+
+    @Override
+    public KeyValue<Schema, Schema> extractKeyValueSchemas(Record<byte[]> record) {
+        Schema keySchema = Schema.STRING_SCHEMA;


any reason why do you use STRING_SCHEMA, not BYTES_SCHEMA?

@sijie This is a type of original key (always String in pulsar).
KafkaProducer does serialization on send with configured serializers.
KafkaSinkWrappingProducer simply passes the key and value as it is to the sink, so the String key is passed as a String.

Pulsar already supports bytes through keyBytes. We should use that instead of using STRING.

The method accepts Record<byte[]> where the keyBytes does not exist.

I believe that this is a preliminary PR, currently we are only supporting String->byte[]

The next step will be to fully support Schema and KeyValue

eolivelli · 2021-03-08T10:50:27Z

pulsar-io/kafka/src/main/java/org/apache/pulsar/io/kafka/KafkaSinkWrappingProducer.java

+    public Future<RecordMetadata> send(ProducerRecord<K, V> producerRecord) {
+        sinkContext.throwIfNeeded();
+        task.put(Lists.newArrayList(toSinkRecord(producerRecord)));
+        return CompletableFuture.completedFuture(null);


is it possible to return a CompletableFuture that completes when the SinkRecord has been successfully processed ?

otherwise Pulsar thinks that the record has been correctly "sent" (processed) but this is not true.

the same comment applies to the other methods below

@eolivelli I'll add flush() immediately after task.put() to guarantee that for now.
The proper way would be implementation of batching support (and policies based on time/number of pending records) but I'd prefer to postpone this as a "performance improvement" until we agree on functional completeness and/or hear some feedback that this is needed.

added batch/linger support, should cover this.

dlg99 · 2021-03-11T23:38:34Z

@sijie , @eolivelli - can you take another look please?

eolivelli · 2021-03-12T06:53:00Z

...nect-adaptor/src/main/java/org/apache/pulsar/io/kafka/connect/KafkaSinkWrappingProducer.java

+    private final PulsarKafkaSinkContext sinkContext;
+    private final PulsarKafkaSinkTaskContext taskContext;
+    private final int batchSize;
+    private final ScheduledExecutorService scheduledExecutor = Executors.newSingleThreadScheduledExecutor();


can we give a name ?

eolivelli · 2021-03-12T06:53:47Z

...nect-adaptor/src/main/java/org/apache/pulsar/io/kafka/connect/KafkaSinkWrappingProducer.java

+        this.batchSize = getBatchSize(props);
+
+        long lingerMs = getLingerMs(props);
+        scheduledExecutor.scheduleAtFixedRate(() -> this.flushIfNeeded(true), lingerMs, lingerMs, TimeUnit.MILLISECONDS);


we are not shutting down this scheduledExecutor
can you please shut it down in the close method ?

eolivelli · 2021-03-12T07:00:19Z

...nect-adaptor/src/main/java/org/apache/pulsar/io/kafka/connect/KafkaSinkWrappingProducer.java

+    }
+
+    @Override
+    public List<PartitionInfo> partitionsFor(String topic) {


do we need to implement this method ?
if we are not calling it from our KafkaSink than we could simply throw UnsupportedOperationException

eolivelli · 2021-03-12T07:00:48Z

...nect-adaptor/src/main/java/org/apache/pulsar/io/kafka/connect/KafkaSinkWrappingProducer.java

+    @Override
+    public Map<MetricName, ? extends Metric> metrics() {
+        sinkContext.throwIfNeeded();
+        return null;


what about returning an empty map or throwing UnsupportedOperationException ?

eolivelli · 2021-03-12T07:01:33Z

...nnect-adaptor/src/main/java/org/apache/pulsar/io/kafka/connect/ProducerRecordWithSchema.java

+    final Schema keySchema;
+    final Schema valueSchema;
+
+    public ProducerRecordWithSchema(String topic, Integer partition, Long timestamp, K key, V value,


do we need to implement all of these constructors ?
probably we need only one

eolivelli · 2021-03-12T07:02:56Z

...ect-adaptor/src/main/java/org/apache/pulsar/io/kafka/connect/PulsarKafkaSinkTaskContext.java

+        return snapshot;
+    }
+
+    private ByteBuffer topicPartitionAsKey(TopicPartition topicPartition) {


it uses topicNamespace. I can make it static and pass topicNamespace as a parameter but I don't see what it improves.

Sorry. I missed that

eolivelli · 2021-03-12T07:03:48Z

...-adaptor/src/test/java/org/apache/pulsar/io/kafka/connect/KafkaSinkWrappingProducerTest.java

+        super.internalSetup();
+        super.producerBaseSetup();
+
+        file = Paths.get(System.getProperty("java.io.tmpdir"), UUID.randomUUID().toString());


what about Files.createTemporaryDirectory ?

eolivelli · 2021-03-12T07:04:13Z

...-adaptor/src/test/java/org/apache/pulsar/io/kafka/connect/KafkaSinkWrappingProducerTest.java

+    @Override
+    protected void cleanup() throws Exception {
+        if (file != null) {
+            //Files.delete(file);


please on comment this line

eolivelli · 2021-03-12T07:07:04Z

...-adaptor/src/test/java/org/apache/pulsar/io/kafka/connect/KafkaSinkWrappingProducerTest.java

+                status.incrementAndGet();
+            } else {
+                System.out.println(exception.toString());
+                exception.printStackTrace();


nit: use logger ?

eolivelli · 2021-03-12T07:10:00Z

pulsar-io/kafka/src/main/java/org/apache/pulsar/io/kafka/KafkaAbstractSink.java

    public void write(Record<byte[]> sourceRecord) {
        KeyValue<K, V> keyValue = extractKeyValue(sourceRecord);
-        ProducerRecord<K, V> record = new ProducerRecord<>(kafkaSinkConfig.getTopic(), keyValue.getKey(), keyValue.getValue());
+        KeyValue<Schema, Schema> keyValueSchemas = extractKeyValueSchemas(sourceRecord);


Probably using a KeyValue<Schema, Schema> here may be misleading, what about using a "Pair" ?

I mean, that usually KeyValue is the class we use ad content of the Record, here you are using it as a simple Pair, to return two objects from the extractKeyValueSchemas method

eolivelli

LGTM

@sijie please take a look

dlg99 · 2021-03-12T20:41:48Z

/pulsarbot run-failure-checks

…ith additional kafka's connect modules

sijie

I am really confused by what this PR is trying to do. If the goal is wrapping the Kafka Sink as Pulsar Sink, I would expect to see a class called KafkaSinkConnector similar as what we did to wrap kafka source into KafkaSourceConnector (https://github.com/apache/pulsar/blob/master/pulsar-io/kafka-connect-adaptor/src/main/java/org/apache/pulsar/io/kafka/connect/KafkaConnectSource.java).

I see many other different changes. i.e adding a new kafka-nar module (which is not necessary), or adding support to write records with schemas. I am lost in reviewing this pull request. Please clarify what you are going to do here.

sijie · 2021-03-13T02:54:47Z

...nect-adaptor/src/main/java/org/apache/pulsar/io/kafka/connect/KafkaSinkWrappingProducer.java

+import org.apache.kafka.clients.producer.Producer;
+import org.apache.kafka.clients.producer.ProducerRecord;
+import org.apache.kafka.clients.producer.RecordMetadata;
+import org.apache.kafka.common.*;


avoid importing *

sijie · 2021-03-13T02:56:15Z

...nect-adaptor/src/main/java/org/apache/pulsar/io/kafka/connect/KafkaSinkWrappingProducer.java

+
+    private static long getLingerMs(Properties props) {
+        long lingerMs = 2147483647L; // as in kafka
+        final String lingerPropName = "linger.ms";


Kafka has a constant for settings. We should use Kafka constants instead of defining those strings again.

sijie · 2021-03-13T02:57:13Z

...connect-adaptor/src/main/java/org/apache/pulsar/io/kafka/connect/PulsarKafkaSinkContext.java

+
+    @Override
+    public void raiseError(Exception e) {
+        log.warn("raiseError called", lastException);


please improve the log statement. The log statement here provides unless information.

sijie · 2021-03-13T02:59:58Z

pulsar-io/kafka-nar/pom.xml

+  </parent>
+
+  <artifactId>pulsar-io-kafka-nar</artifactId>
+  <name>Pulsar IO :: Kafka NAR</name>


I feel this change is unrelated to this PR. pulsar-io-kafka is a connector which is not a library shared by other modules. introducing a new module will increase the build time. Hence I will suggest removing this change.

sijie · 2021-03-13T03:02:01Z

pulsar-io/kafka/src/main/java/org/apache/pulsar/io/kafka/KafkaAbstractSink.java

    public void write(Record<byte[]> sourceRecord) {
        KeyValue<K, V> keyValue = extractKeyValue(sourceRecord);
-        ProducerRecord<K, V> record = new ProducerRecord<>(kafkaSinkConfig.getTopic(), keyValue.getKey(), keyValue.getValue());
+        Pair<Schema, Schema> keyValueSchemas = extractKeyValueSchemas(sourceRecord);


I don't think this change is related to the adopter change. Can you explain why do you put the change here?

dlg99 · 2021-03-15T17:33:39Z

@sijie The goal is using Kafka Connect Sink in Pusar Sink. There is already Pulsar's KafkaBytesSink (in pulsar-io/kafka/ module) that writes data to kafka.
Instead of adding another Sink and later dealing with potentially duplicate changes related to i.e. Schema, the change wraps Kafka Connect Sink into the kafka producer API and makes existing Pulsar's Sink use it.
Kafka Connect Sink accepts (through the SinkRecord) and uses Schemas to send data to the third-party system. there is no need to serialize data before
Schemas also will be used after @eolivelli's changes unblock implementation of Sink<Object>.

the nar module split does not affect anything functionally but

avoids building nar file when I simply want build module / run tests (mvn clean install)
simplifies dependencies definition in added profile. There is a default profile (as existing) and the one that specifies Kafka Connect module to include into nar.

dlg99 · 2021-03-16T18:58:19Z

Abandoned in lieu of #9927

sijie requested review from jerrypeng and srkukarni March 8, 2021 03:05

sijie assigned dlg99 Mar 8, 2021

sijie added the area/function label Mar 8, 2021

sijie added this to the 2.8.0 milestone Mar 8, 2021

sijie requested changes Mar 8, 2021

View reviewed changes

eolivelli reviewed Mar 8, 2021

View reviewed changes

dlg99 marked this pull request as draft March 8, 2021 23:24

dlg99 force-pushed the kafka-connect branch from 719c959 to 70643a7 Compare March 10, 2021 19:29

dlg99 marked this pull request as ready for review March 11, 2021 00:13

eolivelli requested changes Mar 12, 2021

View reviewed changes

eolivelli approved these changes Mar 12, 2021

View reviewed changes

dlg99 added 14 commits March 12, 2021 12:44

First take on using KafkaSink as a producer

5f1c7df

First take on schema handling in KafkaSinkWrappingProducer

3ec6bf9

Task/connector context and offset/partition management

c336907

split pulsar-io-kafka into jar and nar project; option to build nar w…

3fe7ef4

…ith additional kafka's connect modules

changed config to the parseable format

bdaa9c8

Fixed license header in the pom

c06ad56

flush task immediately after put

5dc0f3c

moved KafkaSinkWrappingProducer & co to kafka-connect-adaptor

8e31f4d

renamed ot match other classes

ac9efe5

linger/batch support in producer

e643cf0

persistent offsets

744649e

config parameters for default schemas, offset store

ae8e86e

fixed offset persistence, added test, logging

23d72d7

fixed Enrico's comments on CR

54cb9b5

dlg99 force-pushed the kafka-connect branch from 3b3d3f9 to 54cb9b5 Compare March 12, 2021 20:49

sijie requested changes Mar 13, 2021

View reviewed changes

sijie added area/connector and removed area/function labels Mar 13, 2021

dlg99 mentioned this pull request Mar 16, 2021

Add ability to use Kafka's sinks as pulsar sinks #9927

Merged

1 task

dlg99 closed this Mar 16, 2021

dlg99 deleted the kafka-connect branch October 14, 2021 23:29

[pulsar-io] Add ability to use Kafka's sinks as pulsar sinks #9825

[pulsar-io] Add ability to use Kafka's sinks as pulsar sinks #9825

Uh oh!

Conversation

dlg99 commented Mar 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

sijie commented Mar 8, 2021

Uh oh!

sijie left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dlg99 commented Mar 11, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eolivelli left a comment

Choose a reason for hiding this comment

Uh oh!

dlg99 commented Mar 12, 2021

Uh oh!

dlg99 commented Mar 6, 2021 •

edited

Loading

dlg99 commented Mar 15, 2021 •

edited

Loading