Skip to content

MINOR: Add system test for optimization upgrades#5912

Merged
guozhangwang merged 5 commits intoapache:trunkfrom
bbejeck:MINOR_create_system_test_for_rolling_upgrade_with_optimization
Nov 27, 2018
Merged

MINOR: Add system test for optimization upgrades#5912
guozhangwang merged 5 commits intoapache:trunkfrom
bbejeck:MINOR_create_system_test_for_rolling_upgrade_with_optimization

Conversation

@bbejeck
Copy link
Copy Markdown
Member

@bbejeck bbejeck commented Nov 14, 2018

This is a new system test testing for optimizing an existing topology. This test takes the following steps

  1. Start a Kafka Streams application that uses a selectKey then performs 3 groupByKey() operations and 1 join creating four repartition topics
  2. Verify all instances start and process data
  3. Stop all instances and verify stopped
  4. For each stopped instance update the config for TOPOLOGY_OPTIMIZATION to all then restart the instance and verify the instance has started successfully also verifying Kafka Streams reduced the number of repartition topics from 4 to 1
  5. Verify that each instance is processing data from the aggregation, reduce, and join operation
  6. Stop all instances and verify the shut down is complete.

For testing I ran two passes of the system test with 25 repeats for a total of 50 test runs.

All test runs passed

First 25 system test runs

Second 25 system test runs

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

monitor.wait_for uses grep without the E switch so I need to escape the | in the pattern

Copy link
Copy Markdown
Member Author

@bbejeck bbejeck Nov 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With 5 sub-topologies (1 source input, 4 repartition topic inputs) and 6 partitions, we end up with 30 tasks. So it's very probable that each instance only has 2 of the 3 tested aggregations i,e (AGGREGATED, REDUCED), (AGGEGRATED, JOINED), (REDUCED, JOINED) so I just verify that the STDOUT log has something from the pattern. The False parameter is used to signal verification with the operation_pattern as a whole.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use named parameters (especially for the boolean param) for clarity?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now there are 2 sub-topologies (1 source input, 1 repartition topic inputs) and 6 partitions; we end up with 12 tasks. So each streams instance should have at least one task for AGGREGATED, REDUCED, and JOINED so we use the True parameter to indicate test for the existence of each operation in the STDOUT of each streams instance.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool. (again, do you mind using named params?)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing for the existence of each term by itself in the STDOUT file

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing for the entire pattern

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a tiny percentage of test runs one streams instance ends up with all input source tasks, i.e. (0_0, 0_1, 0_2, 0_3), so none of the expected operations are processed on that node. So we check the task assignment and if it's all input source tasks we skip checking this node.

I added this check after noticing some test flakiness.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.. I'm wondering why it is still possible, as we should have achieved balance across sub-topologies right? Or are there any potential edge cases you are aware of while working on that PR @bbejeck ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think as long as we favor stickiness over load-balancing this is always a possibility. One thing to note I only observed one instance getting all tasks from one sub-topology after the first phase of the test meaning stickiness is a factor, and it seemed to be a tiny percentage. I put the check in to eliminate test flakiness.

I have some additional thoughts on putting back the check we had in making sure that adding a task from the same sub-topology only happens when all clients are over-capacity. But I'd like to do that in a separate PR and write an independent system test for that.

WDYT?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting! I didn't realize we attempt to balance subtopologies over the instances... Is this important for some reason?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking the task assignment for this processor node.

@bbejeck
Copy link
Copy Markdown
Member Author

bbejeck commented Nov 14, 2018

ping @guozhangwang, @mjsax, and @vvcephei for reviews

@mjsax mjsax added the streams label Nov 15, 2018
Copy link
Copy Markdown
Member

@mjsax mjsax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, if I fully understand the test setup, ie, what Python part does. Will revisit tomorrow again.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: StreamsUpgradeTest -> StreamsOptimizedTest

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: simplify to final String propFileName = args[0];

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add null check for each?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this?

Also, there is a peek() in the other PR, that cannot be switched on/off. I like the idea, just want to get clarification to get a unique strategy we apply for system tests.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed this Sysout for debugging when writing the test, but I don't need to verify the output, hence the guard, but since it's no longer needed I'll remove it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: line too long

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not guarded as the output is needed for verification in the test. The one above was strictly for debugging purposes, but I've taken it out.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one more :)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not guarded as the output is needed for verification in the test. The one above was strictly for debugging purposes, but I've taken it out.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this? This info is logged already

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, taking out

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why starting the name with _ ?

Copy link
Copy Markdown
Member Author

@bbejeck bbejeck Nov 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trying to mark it as a private method, but thinking about it more, that doesn't make too much sense, I'll remove the _.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a naming convention? Or a Python feature (ie, does an starting _ make a method private in Python?)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this called index? Seems like a retry to me? -- Also, why do we need a retry? Can you use a monitor instead of direct ssh_capture ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated variable name.

I'll try using monitor , but I went with ssh_capture as it allows me to set my own regex and monitor.wait_for uses grep without the -E flag so I wasn't sure I could be as specific as I needed.

Copy link
Copy Markdown
Contributor

@guozhangwang guozhangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bbejeck . I just have some minor comments.

One meta question is that why we can still observe imbalance that tasks of the same sub-topology are not distributed evenly, do you have any ideas?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we set default serde as String, String already, do we still need the Produced / Consumed when constructing the topology?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a habit for me to always put those in. If you want me to remove them I will.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to remove, just curious if there are any issues I do not know about that enforces you to add it :)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd suggest print the list to sysout as well for debugging, since the number of matched ones may not be sufficient.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.. I'm wondering why it is still possible, as we should have achieved balance across sub-topologies right? Or are there any potential edge cases you are aware of while working on that PR @bbejeck ?

@guozhangwang
Copy link
Copy Markdown
Contributor

@bbejeck as a side note, we believe that when doing this upgrade there may be a minor amount of data loss as some repartition topics that gets merged may have data not processed yet. Could you add some logs to illustrate how much data in percentage may be incurred in lost (we do not make it as a verification phase, but just print it out like "produced XXX, consumed YYY, lost XXX - YYY".

@bbejeck
Copy link
Copy Markdown
Member Author

bbejeck commented Nov 16, 2018

@guozhangwang @mjsax updated per comments

@bbejeck
Copy link
Copy Markdown
Member Author

bbejeck commented Nov 16, 2018

@bbejeck as a side note, we believe that when doing this upgrade there may be a minor amount of data loss as some repartition topics that gets merged may have data not processed yet. Could you add some logs to illustrate how much data in percentage may be incurred in lost (we do not make it as a verification phase, but just print it out like "produced XXX, consumed YYY, lost XXX - YYY".

This could be a little tricky as I'm using the VerifiableProducer with no message limit. So this PR does not drag on can I do a separate PR adding a separate test?

@bbejeck
Copy link
Copy Markdown
Member Author

bbejeck commented Nov 16, 2018

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed this as current active tasks always shows up in the log files, where Committed active tasks may or may not be in the log file.

@guozhangwang
Copy link
Copy Markdown
Contributor

This could be a little tricky as I'm using the VerifiableProducer with no message limit. So this PR does not drag on can I do a separate PR adding a separate test?

Sounds good.

@guozhangwang
Copy link
Copy Markdown
Contributor

LGTM. @mjsax feel free to merge after you made another pass.

@guozhangwang
Copy link
Copy Markdown
Contributor

One meta question is that why we can still observe imbalance that tasks of the same sub-topology are not distributed evenly, do you have any ideas?

@bbejeck any ideas?

@bbejeck
Copy link
Copy Markdown
Member Author

bbejeck commented Nov 17, 2018

One meta question is that why we can still observe imbalance that tasks of the same sub-topology are not distributed evenly, do you have any ideas?
@bbejeck any ideas?

Here's my original response #5912 (comment)

Thinking about this some more, I also think setting group.initial.rebalance.delay.ms to a higher value will help mitigate the issue. Right now if one client gets tasks assignments before the other clients, it creates the potential for task imbalance. But if all the groups are present at the same time, we should see an improvement in task assignment balance. I'll experiment with this system test and see if it makes a difference.

@guozhangwang
Copy link
Copy Markdown
Contributor

Thinking about this some more, I also think setting group.initial.rebalance.delay.ms to a higher value will help mitigate the issue. Right now if one client gets tasks assignments before the other clients, it creates the potential for task imbalance. But if all the groups are present at the same time, we should see an improvement in task assignment balance. I'll experiment with this system test and see if it makes a difference.

Thanks for the explanation, and it makes sense to me that if the imbalance only happens on the second phase, then it is indeed possible because maybe not all instances participated in the first rebalance. I agree that initial.rebalance.delay.ms would help but since it is a broker-side config as we discussed it before it is not a very well-designed configuration, and we may consider deprecating it as we improve the rebalance protocol in the future. So I think we do not need to spend more time experimenting it (again, just to clarify I agree with you that it will mitigate the issue, just think we do not need to spend more time validating that on system test :)

@bbejeck
Copy link
Copy Markdown
Member Author

bbejeck commented Nov 19, 2018

retest this please

1 similar comment
@mjsax
Copy link
Copy Markdown
Member

mjsax commented Nov 21, 2018

retest this please

@vvcephei
Copy link
Copy Markdown
Contributor

Hey @bbejeck ,

Sorry my review is late, but since the tests are still failing, maybe I can sneak in two comments (above, about named parameters in python)?

Regardless, it LGTM. Thanks!

@bbejeck bbejeck force-pushed the MINOR_create_system_test_for_rolling_upgrade_with_optimization branch from e372f97 to 7e36e54 Compare November 27, 2018 15:37
@bbejeck
Copy link
Copy Markdown
Member Author

bbejeck commented Nov 27, 2018

@vvcephei updated per comments and kicked off new system test https://jenkins.confluent.io/job/system-test-kafka-branch-builder/2104/

EDIT: rebased from trunk as well

@guozhangwang
Copy link
Copy Markdown
Contributor

System test succeeded, merging to trunk now.

@guozhangwang guozhangwang merged commit dfd5454 into apache:trunk Nov 27, 2018
pengxiaolong pushed a commit to pengxiaolong/kafka that referenced this pull request Jun 14, 2019
This is a new system test testing for optimizing an existing topology. This test takes the following steps

1. Start a Kafka Streams application that uses a selectKey then performs 3 groupByKey() operations and 1 join creating four repartition topics
2. Verify all instances start and process data
3. Stop all instances and verify stopped
4. For each stopped instance update the config for TOPOLOGY_OPTIMIZATION to all then restart the instance and verify the instance has started successfully also verifying Kafka Streams reduced the number of repartition topics from 4 to 1
5. Verify that each instance is processing data from the aggregation, reduce, and join operation
Stop all instances and verify the shut down is complete.
6. For testing I ran two passes of the system test with 25 repeats for a total of 50 test runs.

All test runs passed

Reviewers: Matthias J. Sax <matthias@confluent.io>, Bill Bejeck <bill@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
@bbejeck bbejeck deleted the MINOR_create_system_test_for_rolling_upgrade_with_optimization branch July 10, 2024 12:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants