Skip to content

Remove flaky arm64 test job#10953

Merged
jihoonson merged 1 commit intoapache:masterfrom
suneet-s:arm64
Mar 8, 2021
Merged

Remove flaky arm64 test job#10953
jihoonson merged 1 commit intoapache:masterfrom
suneet-s:arm64

Conversation

@suneet-s
Copy link
Copy Markdown
Contributor

@suneet-s suneet-s commented Mar 5, 2021

This removes a flaky test job that was added in #10562

The travis job was added to test building Druid on Arm64 architecture. No tests are actually run as part of the job.

However this job appears to fail around half of the time. My limited googling has not yielded any promising results. Since this impacts dev productivity, I propose we remove this job until we find out why this test fails so often and fix it appropriately.

@suneet-s suneet-s added Area - Dev For items related to the project itself, like dev docs and checklists, but not CI Flaky test labels Mar 5, 2021
@suneet-s
Copy link
Copy Markdown
Contributor Author

suneet-s commented Mar 5, 2021

@nishantmonu51 @martin-g @himanshug FYI since this is reverting a change that you all participated in. Any concerns with this?

@martin-g
Copy link
Copy Markdown
Member

martin-g commented Mar 5, 2021 via email

@himanshug
Copy link
Copy Markdown
Contributor

himanshug commented Mar 7, 2021

thanks, reducing transient failures is good, so it is ok to [temporarily] remove it since no issues have been filed specifically for things not working on arm64. so, +1

that said, I would let @martin-g take a crack at fixing this as there might be something systemic wrong and build failure might actually be a true positive.

let us merge this towards the end of next week if things stay same.

@zhangyue19921010
Copy link
Copy Markdown
Contributor

zhangyue19921010 commented Mar 8, 2021

Nice catch. In my experience, this job 24 is a little bit flaky. This jobs often fails with /home/travis/.travis/functions: line 109: 6122 Killed. I am not sure what happens yet, but generally speaking, retry can pass.
So +1 to remove this job temporarily until we find out why this test fails so often and fix it appropriately.

@martin-g
Copy link
Copy Markdown
Member

martin-g commented Mar 8, 2021

According to https://www.howtobuildsoftware.com/index.php/how-do/b5CN/travis-ci-home-travis-buildsh-line-41-pid-killed-exit-code-137 error 137 means exhaustion of available system resources. Most of the time it is memory related.

It is interesting that all the failures are in the build of the last module - distribution.

@zhangyue19921010
Copy link
Copy Markdown
Contributor

According to https://www.howtobuildsoftware.com/index.php/how-do/b5CN/travis-ci-home-travis-buildsh-line-41-pid-killed-exit-code-137 error 137 means exhaustion of available system resources. Most of the time it is memory related.

It is interesting that all the failures are in the build of the last module - distribution.

Maybe we can move this job into Tests - phase 1? The resources of phase 1 may be more sufficient than phase 2?

@martin-g
Copy link
Copy Markdown
Member

martin-g commented Mar 8, 2021

I haven't used stages before in TravisCI. I don't see anything in .travis.yml that configures resources for the stages.
But we can try it!
Another thing is to add allow_failures for ARM64 until we have more clue what is the reason for the kill.

@martin-g
Copy link
Copy Markdown
Member

martin-g commented Mar 8, 2021

OK, I see how TravisCI stages work! IMO it would be even better to move the ARM64 job into a third/new stage so that it does not affect the other jobs.

@martin-g
Copy link
Copy Markdown
Member

martin-g commented Mar 8, 2021

https://docs.travis-ci.com/user/common-build-problems/#my-build-script-is-killed-without-any-error - the max memory per job is 3Gb.
Is it an option to decrease -Xmx3000m to some smaller value (

install: MAVEN_OPTS='-Xmx3000m' travis_wait 15 ${MVN} clean install -q -ff -pl '!distribution' ${MAVEN_SKIP} ${MAVEN_SKIP_TESTS} -T1C && ${MVN} install -q -ff -pl 'distribution' ${MAVEN_SKIP} ${MAVEN_SKIP_TESTS}
) ?

@martin-g
Copy link
Copy Markdown
Member

martin-g commented Mar 8, 2021

I've created #10958.
It uses the new AWS Gravoton2 based ARM64 nodes at TravisCI. Hopefully they will be more stable than the old ARM64 nodes.

@suneet-s
Copy link
Copy Markdown
Contributor Author

suneet-s commented Mar 8, 2021

I've created #10958.
It uses the new AWS Gravoton2 based ARM64 nodes at TravisCI. Hopefully they will be more stable than the old ARM64 nodes.

Thanks for the fix @martin-g! Since this job is still failing, I think it would be better to remove this job till we have a fix with some confidence that it will work. This way we can think through the fix fully instead of trying to rush the fix. I'll be sure to review your change as soon as it is ready so we can bring this test job back.

Maybe we can move this job into Tests - phase 1? The resources of phase 1 may be more sufficient than phase 2?

@zhangyue19921010 This job used to be in phase 1, but would fail and prevent all the integration tests from running. I moved it to phase 2 so that a committer wouldn't need to manually start every job in phase 2 if the phase 1 job is flaky.

@jihoonson
Copy link
Copy Markdown
Contributor

jihoonson commented Mar 8, 2021

https://docs.travis-ci.com/user/common-build-problems/#my-build-script-is-killed-without-any-error - the max memory per job is 3Gb.
Is it an option to decrease -Xmx3000m to some smaller value

This kind of memory issue in CI requires a trials-and-errors type of experiments. You could try adjusting the max memory to fit in the container. Please check https://docs.travis-ci.com/user/reference/overview/ first and see how much memory the container has depending on the build environment setup.

Thanks for the fix @martin-g! Since this job is still failing, I think it would be better to remove this job till we have a fix with some confidence that it will work. This way we can think through the fix fully instead of trying to rush the fix. I'll be sure to review your change as soon as it is ready so we can bring this test job back.

Based on that it could take some time to fix this issue, +1 for temporarily disabling this particular test.

@jihoonson
Copy link
Copy Markdown
Contributor

Merging this PR as it blocks other PRs from getting merged.

@jihoonson jihoonson merged commit 756ac6e into apache:master Mar 8, 2021
@jihoonson jihoonson added this to the 0.21.0 milestone Apr 13, 2021
jihoonson pushed a commit to jihoonson/druid that referenced this pull request Apr 13, 2021
jihoonson added a commit that referenced this pull request Apr 13, 2021
Co-authored-by: Suneet Saldanha <suneet@apache.org>
2bethere pushed a commit to 2bethere/apache-druid that referenced this pull request Apr 18, 2022
... because at the moment they took 1h which is a little bit above the TravisCI limit of 50mins per job and because @clintropolis requested to add one more module - sql
2bethere pushed a commit to 2bethere/apache-druid that referenced this pull request Apr 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area - Dev For items related to the project itself, like dev docs and checklists, but not CI Development Blocker Flaky test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants