Skip to content

KAFKA-15764: Missing Tests for Transactions#14702

Merged
jolshan merged 6 commits intoapache:trunkfrom
jolshan:kafka-15764
Dec 15, 2023
Merged

KAFKA-15764: Missing Tests for Transactions#14702
jolshan merged 6 commits intoapache:trunkfrom
jolshan:kafka-15764

Conversation

@jolshan
Copy link
Copy Markdown
Member

@jolshan jolshan commented Nov 4, 2023

I ran this test 40 times without KAFKA-15653 with and without compression enabled.
With compression it failed 39/40 times and without it passed 40/40 times.

With the KAFKA-15653 and compression it passed 40/40 times locally

@jolshan jolshan force-pushed the kafka-15764 branch 2 times, most recently from 7e64548 to 1c7a52b Compare November 6, 2023 18:45
@jolshan jolshan marked this pull request as ready for review November 8, 2023 18:40
props.put(KafkaConfig.AutoLeaderRebalanceEnableProp, false.toString)
props.put(KafkaConfig.GroupInitialRebalanceDelayMsProp, "0")
props.put(KafkaConfig.TransactionsAbortTimedOutTransactionCleanupIntervalMsProp, "200")
props.put(KafkaConfig.NumNetworkThreadsProp, 2.toString)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to see if changing these values causes issues for the other tests.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the other tests about 20 times locally and didn't see issues.

@jolshan jolshan changed the title DRAFT: KAFKA-15764: Missing Tests for Transactions KAFKA-15764: Missing Tests for Transactions Nov 8, 2023
Comment thread core/src/test/scala/integration/kafka/api/TransactionsTest.scala Outdated
transactionalCompressionProducers += createTransactionalProducer("transactional-compression-producer-" + i.toString, compressionType = "snappy")
}

// KAFKA-15653 is triggered more easily with replication factor 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This made sense to me before, but I'm trying to remember why. 😅

Is it possible that when we are building and sending the response we could receive the callback request and that means it will be handled on a different thread? Vs when acks=all, we free the thread after sending out the verification, so it can be available to receive the callback?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow that. Whether there are any replicas or not, wouldn't the request thread will be freed after sending out the verification? One way I could see the case being more likely is if we added additional request threads. I don't think there's any affinity in the callback to the original request thread, is there? So more available request threads probably means a greater chance a separate thread would handle the callback. Does that make sense?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I think I got confused about when the response is being sent. I will say that adding more request threads didn't work and I had to lower them to get it to reproduce frequently?

I can try it the other way around, but iirc, there were a lot of combinations that were not as consistent as it is now.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My take is that this test should probably just test compression of transactional data with the default settings. I don't think it necessarily needs to hit the case from KAFKA-15653. It seems like it doesn't do so reliably anyway. Are there lower level tests where we can validate callback safety?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see it fail much at all with the default settings. It was like once in 60 times or less which was not super useful. There are lower level tests for the callback using the correct requestLocal.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, what I'm saying is that this test seems more useful as a general validation of transactional produce with compression. From that perspective, the default settings are the most useful to test. For KAFKA-15653, the lower level tests seem sufficient.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I set back the default settings and it seems to trigger consistently again anyway. Maybe I just tricked myself into thinking I needed the other configs.

Copy link
Copy Markdown
Contributor

@hachikuji hachikuji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@jolshan
Copy link
Copy Markdown
Member Author

jolshan commented Dec 6, 2023

I've restarted the build every day since the approval and have yet to get a clean build 😵‍💫

@jolshan
Copy link
Copy Markdown
Member Author

jolshan commented Dec 15, 2023

I'm convinced this PR is cursed. Still failing builds. :(

@jolshan
Copy link
Copy Markdown
Member Author

jolshan commented Dec 15, 2023

Given that this failed on a storage test (storage tests shouldn't be affected here), and that it was able to build on some of the many other runs I tried, I am going to merge this after 12 rebuilds. 😵

@jolshan jolshan merged commit ed7ad6d into apache:trunk Dec 15, 2023
lucasbru added a commit to lucasbru/kafka that referenced this pull request Dec 17, 2023
@lucasbru
Copy link
Copy Markdown
Member

@jolshan This PR might indeed be cursed. We have been seeing a lot of failures of TransactionsWithTieredStoreTest.testTransactionsWithCompression on trunk, and it seems to start with this PR. I see how this PR can influence the test via the change in TestUtils. The bad part is that it seems to kill the Gradle Executors completely.

#15029

https://ge.apache.org/scans/tests?search.relativeStartTime=P28D&search.rootProjectNames=kafka&search.timeZoneId=Europe%2FBerlin&tests.container=org.apache.kafka.tiered.storage.integration.TransactionsWithTieredStoreTest&tests.test=testTransactionsWithCompression(String)%5B2%5D

lucasbru added a commit that referenced this pull request Dec 18, 2023
This reverts commit ed7ad6d.

We have been seeing a lot of failures of TransactionsWithTieredStoreTest.testTransactionsWithCompression on trunk, and it seems to start with this PR. I see how this PR can influence the test via the change in TestUtils. The bad part is that sometimes seems to kill the Gradle Executors completely. So I'd suggest reverting the change before investigating further to stabilize CI.

Reviewers: Bruno Cadonna <cadonna@apache.org>
yyu1993 pushed a commit to yyu1993/kafka that referenced this pull request Feb 15, 2024
I ran this test 40 times without KAFKA-15653 with and without compression enabled.
With compression it failed 39/40 times and without it passed 40/40 times.

With the KAFKA-15653 and compression it passed 40/40 times locally

Reviewers: Jason Gustafson <jason@confluent.io>
yyu1993 pushed a commit to yyu1993/kafka that referenced this pull request Feb 15, 2024
…pache#15029)

This reverts commit ed7ad6d.

We have been seeing a lot of failures of TransactionsWithTieredStoreTest.testTransactionsWithCompression on trunk, and it seems to start with this PR. I see how this PR can influence the test via the change in TestUtils. The bad part is that sometimes seems to kill the Gradle Executors completely. So I'd suggest reverting the change before investigating further to stabilize CI.

Reviewers: Bruno Cadonna <cadonna@apache.org>
AnatolyPopov pushed a commit to aiven/kafka that referenced this pull request Feb 16, 2024
I ran this test 40 times without KAFKA-15653 with and without compression enabled.
With compression it failed 39/40 times and without it passed 40/40 times.

With the KAFKA-15653 and compression it passed 40/40 times locally

Reviewers: Jason Gustafson <jason@confluent.io>
AnatolyPopov pushed a commit to aiven/kafka that referenced this pull request Feb 16, 2024
…pache#15029)

This reverts commit ed7ad6d.

We have been seeing a lot of failures of TransactionsWithTieredStoreTest.testTransactionsWithCompression on trunk, and it seems to start with this PR. I see how this PR can influence the test via the change in TestUtils. The bad part is that sometimes seems to kill the Gradle Executors completely. So I'd suggest reverting the change before investigating further to stabilize CI.

Reviewers: Bruno Cadonna <cadonna@apache.org>
clolov pushed a commit to clolov/kafka that referenced this pull request Apr 5, 2024
I ran this test 40 times without KAFKA-15653 with and without compression enabled.
With compression it failed 39/40 times and without it passed 40/40 times.

With the KAFKA-15653 and compression it passed 40/40 times locally

Reviewers: Jason Gustafson <jason@confluent.io>
clolov pushed a commit to clolov/kafka that referenced this pull request Apr 5, 2024
…pache#15029)

This reverts commit ed7ad6d.

We have been seeing a lot of failures of TransactionsWithTieredStoreTest.testTransactionsWithCompression on trunk, and it seems to start with this PR. I see how this PR can influence the test via the change in TestUtils. The bad part is that sometimes seems to kill the Gradle Executors completely. So I'd suggest reverting the change before investigating further to stabilize CI.

Reviewers: Bruno Cadonna <cadonna@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants