Skip to content

KAFKA-7834: Extend collected logs in system test services to include heap dumps#6158

Closed
kkonstantine wants to merge 4 commits intoapache:trunkfrom
kkonstantine:KAFKA-7834
Closed

KAFKA-7834: Extend collected logs in system test services to include heap dumps#6158
kkonstantine wants to merge 4 commits intoapache:trunkfrom
kkonstantine:KAFKA-7834

Conversation

@kkonstantine
Copy link
Copy Markdown
Contributor

  • Enable heap dumps on OOM with -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=<file.bin> in the major services in system tests
  • Collect the heap dump from the predefined location as part of the result logs for each service
  • Change Connect service to delete the whole root directory instead of individual expected files
  • Tested by running the full suite of system tests

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

Comment thread tests/kafkatest/services/connect.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these all under PERSISTENT_ROOT? if yes, can we just handle PERSISTENT_ROOT and not worry about individual files?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the other_files are not necessarily under PERSISTENT_ROOT
What is under it for sure is:
self.CONFIG_FILE, self.LOG4J_CONFIG_FILE, self.PID_FILE, self.LOG_FILE, self.STDOUT_FILE, self.STDERR_FILE, self.EXTERNAL_CONFIGS_FILE which I removed from explicit listing and I'm removing with PERSISTENT_ROOT altogether.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I get that though I think that's just tech debt. for any test related files, we really shouldn't be using anything other than PERSISTENT_ROOT so that we can at least attempt to ensure each test/service gets a clean workspace

Comment thread tests/kafkatest/services/connect.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we seem to duplicate this pattern of setting constants, is there any opportunity to relocate and reduce to one copy of these constants (given they should be same across all our java services)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With each service setting its security configs in KAFKA_OPTS using their own way, this seems difficult at the moment. I'd prefer if I could avoid to extend this PR to a general refactoring at this point.

Comment thread tests/kafkatest/services/connect.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the logic for this, including the different ways env_opts is being encoded?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up adding the heap jvm arguments in KAFKA_OPTS instead of KAFKA_HEAP_OPTS because the latter has defaults for sizes if it's empty. However, if I needed to add values to an env variable, I can't just append them to it's current value, because the value is wrapped with \"

That's why I'm adding this method - unused here - that can be used by tests that inherit this class. Let me know if you think I should remove.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, in general the construction of commands could be cleaned up quite a bit for extensibility -- i think we tend to forget about the importance/value of doing this when writing tests in the Kafka repo, but these classes should be reusable and extensible. this approach seems pretty error-prone, although at least it should consistently fail if the quoting gets messed up since you'll have invalid extra args on the command line.

Comment thread tests/kafkatest/services/connect.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's happening with the various strip('\"') calls? why do we need this process? can we refactor to something that makes it clearer what's happening?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mention above, we often wrap the value of an env var in escaped double quotes. To append contents, we first need to remove those. If they don't exist, strip is a no-op.

Comment thread vagrant/base.sh Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems unrelated to this PR? and the bucket probably shouldn't change unless we're changing the target bucket -- we want only 1, permanent bucket if possible

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is temporary. Was added with the commit named: TEMP commit to unblock system test branch builder in order to fix an issue with the jenkins builder and be able to run the tests. Should be removed before merging

@kkonstantine kkonstantine force-pushed the KAFKA-7834 branch 7 times, most recently from 38d4110 to 7775110 Compare January 24, 2019 01:48
Copy link
Copy Markdown
Contributor

@ewencp ewencp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The JIRA is marked to be backported to 1.0, but it looks like this only backports cleanly to 2.0. Additionally, 2.1.1 is currently in flight, and although this is only test code and I don't think too risky, I don't want to merge until we get the next RC out the door. So I may hold off on the actual merge here until that passes.

Comment thread tests/kafkatest/services/connect.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I get that though I think that's just tech debt. for any test related files, we really shouldn't be using anything other than PERSISTENT_ROOT so that we can at least attempt to ensure each test/service gets a clean workspace

Comment thread tests/kafkatest/services/connect.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, in general the construction of commands could be cleaned up quite a bit for extensibility -- i think we tend to forget about the importance/value of doing this when writing tests in the Kafka repo, but these classes should be reusable and extensible. this approach seems pretty error-prone, although at least it should consistently fail if the quoting gets messed up since you'll have invalid extra args on the command line.

@ewencp
Copy link
Copy Markdown
Contributor

ewencp commented Feb 5, 2019

Actually, after further discussion, will merge to trunk, 2.2, and 2.0 so we can get debugging information out for some failing tests. Will follow up on 2.1 after it won't get in the way of 2.1.1 release, and we'll follow up separately with 1.x versions if we find the additional info to be valuable, but hopefully tests shouldn't be that unstable on long term maintenance branches.

@kkonstantine
Copy link
Copy Markdown
Contributor Author

Excellent. Thanks @ewencp! I'll keep track progress with respect to 2.1.1

ewencp pushed a commit that referenced this pull request Feb 5, 2019
…heap dumps

* Enable heap dumps on OOM with -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=<file.bin> in the major services in system tests
* Collect the heap dump from the predefined location as part of the result logs for each service
* Change Connect service to delete the whole root directory instead of individual expected files
* Tested by running the full suite of system tests

Author: Konstantine Karantasis <konstantine@confluent.io>

Reviewers: Ewen Cheslack-Postava <ewen@confluent.io>

Closes #6158 from kkonstantine/KAFKA-7834

(cherry picked from commit 83c435f)
Signed-off-by: Ewen Cheslack-Postava <me@ewencp.org>
ewencp pushed a commit that referenced this pull request Feb 5, 2019
…heap dumps

* Enable heap dumps on OOM with -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=<file.bin> in the major services in system tests
* Collect the heap dump from the predefined location as part of the result logs for each service
* Change Connect service to delete the whole root directory instead of individual expected files
* Tested by running the full suite of system tests

Author: Konstantine Karantasis <konstantine@confluent.io>

Reviewers: Ewen Cheslack-Postava <ewen@confluent.io>

Closes #6158 from kkonstantine/KAFKA-7834

(cherry picked from commit 83c435f)
Signed-off-by: Ewen Cheslack-Postava <me@ewencp.org>
@ewencp ewencp closed this in 83c435f Feb 5, 2019
@kkonstantine kkonstantine deleted the KAFKA-7834 branch February 5, 2019 05:21
@ewencp
Copy link
Copy Markdown
Contributor

ewencp commented Feb 20, 2019

Now that 2.1.1 has finally landed, cherry-picking this into 2.1 as well. Seems like it has been fine for awhile on other branches anyway so probably we could have risked it even earlier.

ewencp pushed a commit that referenced this pull request Feb 20, 2019
…heap dumps

* Enable heap dumps on OOM with -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=<file.bin> in the major services in system tests
* Collect the heap dump from the predefined location as part of the result logs for each service
* Change Connect service to delete the whole root directory instead of individual expected files
* Tested by running the full suite of system tests

Author: Konstantine Karantasis <konstantine@confluent.io>

Reviewers: Ewen Cheslack-Postava <ewen@confluent.io>

Closes #6158 from kkonstantine/KAFKA-7834
pengxiaolong pushed a commit to pengxiaolong/kafka that referenced this pull request Jun 14, 2019
…heap dumps

* Enable heap dumps on OOM with -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=<file.bin> in the major services in system tests
* Collect the heap dump from the predefined location as part of the result logs for each service
* Change Connect service to delete the whole root directory instead of individual expected files
* Tested by running the full suite of system tests

Author: Konstantine Karantasis <konstantine@confluent.io>

Reviewers: Ewen Cheslack-Postava <ewen@confluent.io>

Closes apache#6158 from kkonstantine/KAFKA-7834
@kkonstantine kkonstantine added connect tests Test fixes (including flaky tests) labels Oct 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

connect tests Test fixes (including flaky tests)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants