Skip to content

Issue #10953: ARM64 - Build and test on AWS Graviton2 node#10958

Closed
martin-g wants to merge 2 commits intoapache:masterfrom
martin-g:feature/arm64-on-graviton2
Closed

Issue #10953: ARM64 - Build and test on AWS Graviton2 node#10958
martin-g wants to merge 2 commits intoapache:masterfrom
martin-g:feature/arm64-on-graviton2

Conversation

@martin-g
Copy link
Copy Markdown
Member

@martin-g martin-g commented Mar 8, 2021

@jihoonson
Copy link
Copy Markdown
Contributor

Please check the CI failure for ARM build. It seems legit.

[INFO] Downloading binary from https://github.com/sass/node-sass/releases/download/v4.13.1/linux-arm64-64_binding.node
[ERROR] Cannot download "https://github.com/sass/node-sass/releases/download/v4.13.1/linux-arm64-64_binding.node": 
[ERROR] 
[ERROR] HTTP error 404 Not Found
[ERROR] 
[ERROR] Hint: If github.com is not accessible in your location
[ERROR]       try setting a proxy via HTTP_PROXY, e.g. 
[ERROR] 
[ERROR]       export HTTP_PROXY=http://example.com:1234
[ERROR] 
[ERROR] or configure npm proxy via
[ERROR] 
[ERROR]       npm config set proxy http://example.com:8080
...
[ERROR] gyp verb check python checking for Python executable "python2" in the PATH
[ERROR] gyp verb `which` failed Error: not found: python2
[ERROR] gyp verb `which` failed     at getNotFoundError (/home/travis/build/apache/druid/web-console/node_modules/which/which.js:13:12)
[ERROR] gyp verb `which` failed     at F (/home/travis/build/apache/druid/web-console/node_modules/which/which.js:68:19)
[ERROR] gyp verb `which` failed     at E (/home/travis/build/apache/druid/web-console/node_modules/which/which.js:80:29)
[ERROR] gyp verb `which` failed     at /home/travis/build/apache/druid/web-console/node_modules/which/which.js:89:16
[ERROR] gyp verb `which` failed     at /home/travis/build/apache/druid/web-console/node_modules/isexe/index.js:42:5
[ERROR] gyp verb `which` failed     at /home/travis/build/apache/druid/web-console/node_modules/isexe/mode.js:8:5
[ERROR] gyp verb `which` failed     at FSReqWrap.oncomplete (fs.js:154:21)
...

@martin-g
Copy link
Copy Markdown
Member Author

There is no node-sass binary for Linux Aarch64 - https://github.com/sass/node-sass/releases/tag/v4.14.1
It is interesting why this didn't fail before (on the non-Graviton2 ARM64 TravisCI nodes) ?!

@martin-g
Copy link
Copy Markdown
Member Author

martin-g commented Mar 15, 2021

I've reworked the PR to not build the Web Console, so node-sass is no more a problem, but now it fails because of Hyperic Sigar library. This library has no native binaries for aarch64. I really wonder how the old builds passed at all.
And Hyperic Sigar is not maintained since 2016, so there is no chance for adding support for ARM64. There are 3 Pull Requests and all stale.

@martin-g
Copy link
Copy Markdown
Member Author

With the changes in this PR I am able to run the full build and tests on my Linux ARM64 VM!
But at TravisCI org.apache.druid.segment.virtual.ExpressionVectorSelectorsTest fails with OutOfMemoryError in the heap space.
According to Async Profiler the most memory allocations are caused by org.apache.druid.java.util.common.StringUtils#format(). I think this method is overused - most of the time simple String concatenation could be used. String#format() is both slow and memory intensive.
I've simplified a bit this test to not use StringUtils#format() but this is not enough - it still fails with OOME.

@martin-g
Copy link
Copy Markdown
Member Author

On my ARM64 VM I am able to run org.apache.druid.segment.virtual.ExpressionVectorSelectorsTest even with -Xmx384m.
At TravisCI the tests run with -Xmx3g` - I have the feeling that there might be a memory leak from the previous tests.

@martin-g
Copy link
Copy Markdown
Member Author

The TravisCI job for testing on ARM64 has been reworked and passes successfully now: https://travis-ci.com/github/apache/druid/builds/221285346
The new job builds and tests just the main sub-modules because:

  • running all modules does not finish in 50 mins (Travis max time for a job)
  • sql module fails because SqlResourceTest#testTooManyRequests() succeeds for all 3 requests. This is because TravisCI ARM64 nodes have reduced CPU power and can run up to 2 threads
  • the distribution job is not executed on ARM64 because of the flacky failures at Remove flaky arm64 test job #10953

@martin-g
Copy link
Copy Markdown
Member Author

martin-g commented Apr 2, 2021

Any feedback ?

@suneet-s suneet-s closed this Apr 8, 2021
@suneet-s suneet-s reopened this Apr 8, 2021
@suneet-s
Copy link
Copy Markdown
Contributor

suneet-s commented Apr 8, 2021

Closed and re-opened to see if it would re-trigger travis. It appears that it hasn't :(

@martin-g could you merge master back in to this PR so we can see what the test run on Travis would look like? Since this test job only runs unit tests, we still can't validate that Druid would work correctly on ARM64 machines. Was this the intention of this PR? Or did you just want to validate that Druid compiles on these types of machines?

@martin-g
Copy link
Copy Markdown
Member Author

martin-g commented Apr 8, 2021

@suneet-s I've rebased the branch with the latest master and force pushed but this also didn't trigger a build at Travis for some reason.
I see that the last build for this PR has been 13 days ago - https://travis-ci.com/github/apache/druid/pull_requests

I just pushed an empty commit (non-force) but again nothing ...

I needed to reduce the built projects for ARM64 because otherwise I have to create a separate job for each "feature" (processing, indexing, server, ...).
Another option is to add something like https://github.com/apache/velocity-engine/blob/master/.travis.yml.disabled#L21-L23 but then the build matrix will explode and the build time will double.

Please let me know if you think it would be good to add more test jobs for the not covered "features"/modules.

@martin-g martin-g closed this Apr 12, 2021
@martin-g martin-g reopened this Apr 12, 2021
@martin-g
Copy link
Copy Markdown
Member Author

@suneet-s I've created a new PR (#11094) but it didn't trigger Travis build too. It might be related to the fact that my personal account at TravisCI is out of free credits :-/

@martin-g
Copy link
Copy Markdown
Member Author

I've created a new Github user (martin-g2) and a new PR (#11095) but still no TravisCI build ...

@suneet-s
Copy link
Copy Markdown
Contributor

@martin-g I think you don't need any travis credits to run the tests. The tests should be running on Apache's account. I don't have any suggestions on how to trick travis into kicking off the tests other than what you've already tried. Let me think about it some more, and if I can come up with a way, I'll comment on this PR

@martin-g
Copy link
Copy Markdown
Member Author

Closing in favor of #11109
@suneet-s The problem was that .travis.yml got corrupted due to a wrong indentation. I've created a diff from master to my branch and created a new PR. This also didn't help and they I realized that it must be something YAML related. Please review #11109

@martin-g martin-g closed this Apr 13, 2021
@martin-g martin-g deleted the feature/arm64-on-graviton2 branch April 13, 2021 07:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants