prevent npe on mismatch between number of kafka partitions and task count by pjain1 · Pull Request #5139 · apache/druid

pjain1 · 2017-12-05T20:23:58Z

2017-12-05T20:15:35,610 WARN [KafkaSupervisor-<datasource>-Reporting-0] io.druid.indexing.kafka.supervisor.KafkaSupervisor - Lag metric: Kafka partitions [16, 1, 51, 36, 21, 6, 56, 41, 26, 11, 46, 31] do not match task partitions [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 22, 23, 24, 25, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59]
2017-12-05T20:15:35,610 WARN [KafkaSupervisor-<datasource>-Reporting-0] io.druid.indexing.kafka.supervisor.KafkaSupervisor - Unable to compute Kafka lag
java.lang.NullPointerException
	at java.util.HashMap.merge(HashMap.java:1224) ~[?:1.8.0_131]
	at java.util.stream.Collectors.lambda$toMap$58(Collectors.java:1320) ~[?:1.8.0_131]
	at java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169) ~[?:1.8.0_131]
	at java.util.HashMap$EntrySpliterator.forEachRemaining(HashMap.java:1691) ~[?:1.8.0_131]
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) ~[?:1.8.0_131]
	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) ~[?:1.8.0_131]
	at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) ~[?:1.8.0_131]
	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:1.8.0_131]
	at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) ~[?:1.8.0_131]
	at io.druid.indexing.kafka.supervisor.KafkaSupervisor.getLagPerPartition(KafkaSupervisor.java:2113) ~[druid-kafka-indexing-service-0.11.1-1512178916-1232c9d-1815.jar:0.11.1-1512178916-1232c9d-1815]
	at io.druid.indexing.kafka.supervisor.KafkaSupervisor.lambda$emitLag$19(KafkaSupervisor.java:2143) ~[druid-kafka-indexing-service-0.11.1-1512178916-1232c9d-1815.jar:0.11.1-1512178916-1232c9d-1815]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_131]
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [?:1.8.0_131]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_131]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [?:1.8.0_131]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]

himanshug · 2017-12-05T21:08:50Z

👍

…ount

jihoonson · 2017-12-15T02:54:29Z

                     && e.getValue() != null
                     ? latestOffsetsFromKafka.get(e.getKey()) - e.getValue()
-                     : null
+                     : Integer.MIN_VALUE


Maybe is the below better because we don't have to add unnecessary lag values?

private Map<Integer, Long> getLagPerPartition(Map<Integer, Long> currentOffsets) { if (latestOffsetsFromKafka == null) { return ImmutableMap.of(); } return currentOffsets .entrySet() .stream() .filter(e -> latestOffsetsFromKafka.get(e.getKey()) != null && e.getValue() != null) .collect( Collectors.toMap( Map.Entry::getKey, e -> latestOffsetsFromKafka.get(e.getKey()) - e.getValue() ) ); }

I didn't wanted to filter so that its visible that there is mismatch between task count and number of available Kafka partitions. Currently, whenever total lag is calculated, x -> Math.max(x, 0) is used so setting lag to Integer.MIN_VALUE won't add to the total.

However, if you prefer this then we can do this as well.

Both looks good to me. If you think the current patch is better to check there is mismatch between task count and number of partitions, please go for it.

* Kafka Index Task that supports Incremental handoffs apache#4815 * prevent NPE from supressing actual exception (apache#5146) * prevent npe on mismatch between number of kafka partitions and task count (apache#5139) * Throw away rows with timestamps beyond long bounds in kafka indexing (apache#5215) (apache#5232) * Fix state check bug in Kafka Index Task (apache#5204) (apache#5248)

pjain1 added the Area - Streaming Ingestion label Dec 5, 2017

pjain1 added this to the 0.12.0 milestone Dec 5, 2017

pjain1 added the Bug label Dec 5, 2017

prevent npe on mismatch between number of kafka partitions and task c…

683267a

…ount

jihoonson reviewed Dec 15, 2017

View reviewed changes

jihoonson approved these changes Dec 19, 2017

View reviewed changes

pjain1 merged commit c56a980 into apache:master Dec 20, 2017

pjain1 deleted the fix_npe_lag branch December 20, 2017 22:23

pjain1 mentioned this pull request Dec 27, 2017

NPE when KafkaSupervisor getLagPerPartition() #5009

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prevent npe on mismatch between number of kafka partitions and task count#5139

prevent npe on mismatch between number of kafka partitions and task count#5139
pjain1 merged 1 commit intoapache:masterfrom
pjain1:fix_npe_lag

pjain1 commented Dec 5, 2017

Uh oh!

himanshug commented Dec 5, 2017

Uh oh!

jihoonson Dec 15, 2017

Uh oh!

pjain1 Dec 18, 2017

Uh oh!

pjain1 Dec 18, 2017

Uh oh!

jihoonson Dec 19, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

pjain1 commented Dec 5, 2017

Uh oh!

himanshug commented Dec 5, 2017

Uh oh!

jihoonson Dec 15, 2017

Choose a reason for hiding this comment

Uh oh!

pjain1 Dec 18, 2017

Choose a reason for hiding this comment

Uh oh!

pjain1 Dec 18, 2017

Choose a reason for hiding this comment

Uh oh!

jihoonson Dec 19, 2017

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants