Fix failing test step on AWS#678
Fix failing test step on AWS#678trz42 merged 9 commits intoEESSI:2023.06-software.eessi.iofrom casparvl:fix_memory_detection_testsuite_aws
Conversation
|
Instance
|
|
Instance
|
|
bot: build repo:eessi.io-2023.06-software arch:zen2 |
Updates by the bot instance
|
Updates by the bot instance
|
|
New job on instance
|
Interactively, I got the correct output, from file I'll try to manually submit some batch jobs on AWS to figure this out. It's tricky that we don't see this interactively... My bet is, it is due to the replacement with |
|
Very strange, this small test job just works correctly: WIth output: |
|
bot: build repo:eessi.io-2023.06-software arch:zen2 |
Updates by the bot instance
|
Updates by the bot instance
|
|
New job on instance
|
…sible that inside the container, the /sys directory is different
|
The difference is that we are in a container... so we should use the info from the mounted directories of the host, not from the container's |
|
bot: build repo:eessi.io-2023.06-software arch:zen2 |
Updates by the bot instance
|
Updates by the bot instance
|
|
New job on instance
|
|
bot: build repo:eessi.io-2023.06-software arch:zen2 |
Updates by the bot instance
|
Updates by the bot instance
|
|
New job on instance
|
…it from the containers' /proc is fine. So let's do that
|
bot: build repo:eessi.io-2023.06-software arch:zen2 |
Updates by the bot instance
|
Updates by the bot instance
|
|
New job on instance
|
|
bot: build repo:eessi.io-2023.06-software arch:zen3 |
Updates by the bot instance
|
Updates by the bot instance
|
|
New job on instance
|
|
bot: build repo:eessi.io-2023.06-software arch:zen2 |
Updates by the bot instance
|
Updates by the bot instance
|
|
New job on instance
|
|
bot: build repo:eessi.io-2023.06-software arch:zen4 |
Updates by the bot instance
|
Updates by the bot instance
|
|
New job on instance
|
|
bot: build repo:eessi.io-2023.06-software arch:x86_64/intel/haswell |
Updates by the bot instance
|
Updates by the bot instance
|
|
New job on instance
|
trz42
left a comment
There was a problem hiding this comment.
Looks fine overall. Added a little question if the script could have made a little 'smarter' (using /sys or /hostsys) and a suggestion to explain a bit what this /hostsys is (and where it comes from).
| cgroup_v1_mem_limit="/hostsys/fs/cgroup/memory/$(</proc/self/cpuset)/memory.limit_in_bytes" | ||
| cgroup_v2_mem_limit="/hostsys/fs/cgroup/$(</proc/self/cpuset)/memory.max" |
There was a problem hiding this comment.
This path probably makes only sense if run in a very specific environment (e.g., testing software built for EESSI). While this is fine, how about checking whether /sys or /hostsys is available and use that?
If there would be a comment that explains what /hostsys is and how it is made available, it might make debugging a little easier.
There was a problem hiding this comment.
Yes, we bind-mount this additional path in bot/test.sh. You're absolutely right about the commenting part: I'll make that clear.
Regarding a fallback on /sys, I'm not sure if we want to do that. If /hostsys isn't there, it means the bind-mount failed / was not executed. I'd probably prefer there to be a hard error, than a silent success here that maybe extracts the wrong amount of memory.
There was a problem hiding this comment.
Ok, added the description now
trz42
left a comment
There was a problem hiding this comment.
Thanks for the update. Makes sense that we use /hostsys and don't fall back to /sys.
Currently, the test step on AWS fails because we fail to get a memory limit from the cgroup. I'll add some more verbose output as a first step to debugging this.