Skip to content

KAFKA-6145: Pt 2.5 Compute overall task lag per client#8252

Merged
vvcephei merged 7 commits intoapache:trunkfrom
ableegoldman:KIP-441-compute-lag-from-offsets
Mar 21, 2020
Merged

KAFKA-6145: Pt 2.5 Compute overall task lag per client#8252
vvcephei merged 7 commits intoapache:trunkfrom
ableegoldman:KIP-441-compute-lag-from-offsets

Conversation

@ableegoldman
Copy link
Copy Markdown
Member

@ableegoldman ableegoldman commented Mar 7, 2020

Once we have encoded the offset sums per task for each client, we can compute the overall lag during assign by fetching the end offsets for all changelog and subtracting.

If the listOffsets request fails, we simply return a "completely sticky" assignment, ie all active tasks are given to previous owners regardless of balance.

Builds (but does not yet use) the statefulTasksToRankedCandidates map with the ranking:
Rank -1: active running task
Rank 0: standby or restoring task whose overall lag is within acceptableRecoveryLag
Rank 1: tasks whose lag is unknown (eg during version probing)
Rank 1+: all other tasks are ranked according to their actual total lag

@ableegoldman ableegoldman force-pushed the KIP-441-compute-lag-from-offsets branch 10 times, most recently from d180129 to 9fb5fb6 Compare March 18, 2020 00:32
@ableegoldman ableegoldman force-pushed the KIP-441-compute-lag-from-offsets branch 5 times, most recently from c46b133 to d83f64b Compare March 19, 2020 18:32
@vvcephei
Copy link
Copy Markdown
Contributor

test this please

@ableegoldman ableegoldman force-pushed the KIP-441-compute-lag-from-offsets branch from 118c3a9 to c89ac5c Compare March 20, 2020 03:17
@ableegoldman
Copy link
Copy Markdown
Member Author

Finished adding tests for the new behavior, ready for another pass @cadonna @vvcephei

@vvcephei
Copy link
Copy Markdown
Contributor

test this please

@vvcephei
Copy link
Copy Markdown
Contributor

ok to test

@vvcephei
Copy link
Copy Markdown
Contributor

test this please

@ableegoldman ableegoldman force-pushed the KIP-441-compute-lag-from-offsets branch 2 times, most recently from bd0884c to 0bada7e Compare March 20, 2020 19:23
fetch end offsets
add lag to each clienttate
build ranks
add ClientState tests
add tests for building client rankings
fall back to prev assignment if fetching offsets fails
@ableegoldman ableegoldman force-pushed the KIP-441-compute-lag-from-offsets branch from 0bada7e to 4200590 Compare March 20, 2020 19:24
@ableegoldman ableegoldman force-pushed the KIP-441-compute-lag-from-offsets branch from 4200590 to 3cf72e0 Compare March 20, 2020 19:26
@vvcephei
Copy link
Copy Markdown
Contributor

test this please

1 similar comment
@vvcephei
Copy link
Copy Markdown
Contributor

test this please

Copy link
Copy Markdown
Contributor

@vvcephei vvcephei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @ableegoldman ! Looks good overall, just a few remarks.

Comment thread clients/src/main/java/org/apache/kafka/clients/admin/Admin.java Outdated
Comment on lines +1242 to +1264
static Map<TopicPartition, ListOffsetsResultInfo> fetchEndOffsetsWithoutTimeout(final Collection<TopicPartition> partitions,
final Admin adminClient) {
return fetchEndOffsets(partitions, adminClient, null);
}

public static Map<TopicPartition, ListOffsetsResultInfo> fetchEndOffsets(final Collection<TopicPartition> partitions,
final Admin adminClient,
final Duration timeout) {
final Map<TopicPartition, ListOffsetsResultInfo> endOffsets;
try {
final KafkaFuture<Map<TopicPartition, ListOffsetsResultInfo>> future = adminClient.listOffsets(
partitions.stream().collect(Collectors.toMap(Function.identity(), tp -> OffsetSpec.latest())))
.all();
if (timeout == null) {
endOffsets = future.get();
} else {
endOffsets = future.get(timeout.toMillis(), TimeUnit.MILLISECONDS);
}
} catch (final TimeoutException | RuntimeException | InterruptedException | ExecutionException e) {
throw new StreamsException("Unable to obtain end offsets from kafka", e);
}
return endOffsets;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that this is static and has nothing in particular to do with the KafkaStreams class indicates that it probably belongs in a util class for use by KafkaStreams and StreamsPartitionAssignor.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #8328

}

@Test
public void shouldViolateBalanceToPreserveActiveTaskStickiness() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this test pass on the current code base, or is it a consequence of this PR?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's testing the newly added #preservePreviousTaskAssignment: this is sort of a temporary hack to force the StickyTaskAssignor to return a "completely sticky" active task assignment, in the case the admin listOffsets request fails or times out.

The default StickyTaskAssignor behavior is unchanged in this PR, to leave the code in a stable state and avoid adding a huge refactoring of StickyTaskAssignorTest to an already large PR.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks. The subtlety was a bit lost on me.

@vvcephei
Copy link
Copy Markdown
Contributor

test this please

1 similar comment
@vvcephei
Copy link
Copy Markdown
Contributor

test this please

@vvcephei
Copy link
Copy Markdown
Contributor

Test failures unrelated:
kafka.api.ConsumerBounceTest.testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup
org.apache.kafka.streams.integration.StandbyTaskEOSIntegrationTest.surviveWithOneTaskAsStandby

@vvcephei vvcephei merged commit 6cf27c9 into apache:trunk Mar 21, 2020
@ableegoldman ableegoldman deleted the KIP-441-compute-lag-from-offsets branch June 26, 2020 22:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants