Skip to content

Updates for EESSI compat layer 2022.11#160

Closed
trz42 wants to merge 10 commits intoEESSI:mainfrom
trz42:eessi-hpc.org-2022.11
Closed

Updates for EESSI compat layer 2022.11#160
trz42 wants to merge 10 commits intoEESSI:mainfrom
trz42:eessi-hpc.org-2022.11

Conversation

@trz42
Copy link
Copy Markdown
Contributor

@trz42 trz42 commented Nov 3, 2022

This includes a number of updates to the current version (2021.12):

File bootscript-prefix.sh:

  • The script bootstrap-prefix.sh has been sync'ed with updates to the upstream version (https://github.com/gentoo/prefix.git) until a recent update on Nov 2 2022. The script here still includes a few changes we may want to revisit i.e. if they are necessary or if they can be removed to allow us to just use the upstream script.
  • Changes that are still present (diff gentoo-prefix/scripts/bootstrap-prefix.sh EESSI-compatibility-layer/bootstrap-prefix.sh):
590c590,593
<       local theshell=${SHELL##*/}
---
>       # from gentoo/prefix
>       #local theshell=${SHELL##*/}
>       # from EESSI/compatibility-layer
>       local theshell=$(echo "${SHELL##*/}" | cut -f1 -d ' ')
2267,2268c2270,2271
<       DISTFILES_G_O="http://distfiles.prefix.bitzolder.nl"
<       DISTFILES_PFX="http://distfiles.prefix.bitzolder.nl/prefix"
---
>       DISTFILES_G_O="https://distfiles.prefix.bitzolder.nl"
>       DISTFILES_PFX="https://distfiles.prefix.bitzolder.nl/prefix"
2825c2828
<                       [[ ${TODO} == 'noninteractive' ]] && ans=yes ||
---
>                       [[ ${TODO} == 'noninteractive' ]] && ans=no ||

File ansible/playbooks/roles/compatibility_layer/defaults/main.yml:

  • The version has been bumped to 2022.11. (eessi_version: "2022.11")
  • The gentoo overlay currently uses a fork (https://github.com/trz42/gentoo-overlay.git). This can be reverted after Overlay for compat layer EESSI 2023.02 gentoo-overlay#84 has been merged. (custom_overlays.url: https://github.com/trz42/gentoo-overlay.git)
  • The gentoo commit (see gentoo_git_commit: cec3214ef5d5661e28c9d2c5b5750b27c27c5435) being used is updated to a more recent version (from Nov 3 2022). A bit information about history of used commits was added. Can be removed if deemed unnecessary.
  • The default gcc version has been increased to 10.4.0 (see prefix_default_gcc: 10.4.0). (Requirement in upstream bootstrap-prefix.sh script.)
  • However, the version of gcc being used had to be restricted to anything before 10.4.1 because some 10.4.1_p* ebuilds were added to https://github.com/gentoo/gentoo/sys-devel/gcc recently. While those installed fine during the bootstrap stages 2/3, it was not immediately clear how to set them as default with gcc-config. Simply setting them via prefix_default_gcc (see item on default gcc) didn't work. (see line >=sys-devel/gcc-10.4.1 for prefix_mask_packages:)
  • The setting for prefix_singularity_command were changed to unset LD_LIBRARY_PATH (the bootstrap-prefix.sh in 2021.12 had a slight spelling typo in the check if LD_LIBRARY_PATH is set, hence went on even if it was set). The updated bootstrap-prefix.sh checks for the correct variable.

File ansible/playbooks/roles/compatibility_layer/tasks/install_packages.yml:

  • Task to change file ownership needed root permissions. (Could be an issue with testing setup. Might be good to verify if it is needed in other environments.)

truib added 7 commits October 29, 2022 19:38
Main changes:
- new version 2022.11
- using trz42/gentoo-overlay until new sets have been added to EESSI/gentoo-overlay
- using a recent commit to gentoo/gentoo.git plus adding some comments about history of used commits
- setting a new version of the default gcc to 10.4.0 (from 9.4.0)
- masking sys-devel/gcc greater or equal to 10.4.1 (avoiding issue that newer package is installed in boostrap stages 2/3 and then the default gcc cannot be set with gcc-config)
- adding clearance of LD_LIBRARY_PATH variable to singularity command (if not the bootstrap-prefix.sh script fails if LD_LIBRARY_PATH was set)
- change `become` to true for task changing ownership of installed files
- using recent bootstrap-prefix.sh script from gentoo/prefix.git with three changes from EESSI (setting og theshell in bootstrap_startscript, using https for setting DISFILES_{G_O,PFX}, setting ans to no in line 2839 {question about stable packages keywords})
Changes by commit c43a5bf0d00d3cb2a0452c35b5c42bf4fc4cfc9f
to gentoo/prefix/scripts/bootstrap-prefix.sh
@amadio
Copy link
Copy Markdown

amadio commented Nov 3, 2022

The default gcc version has been increased to 10.4.0 (see prefix_default_gcc: 10.4.0). (Requirement in upstream bootstrap-prefix.sh script.)

You know you don't need to stick to that after installing, right? You can and should update to a newer version of GCC if you want. Only the bootstrap process is a bit finicky, but once you bootstrap, you are usually free from the constraints. You can also install multiple versions of GCC (i.e. GCC 12 for CPU, GCC 11 for CUDA), which you can use later as gcc-$version on the command line even if they are not the default.

@trz42
Copy link
Copy Markdown
Contributor Author

trz42 commented Nov 3, 2022

The default gcc version has been increased to 10.4.0 (see prefix_default_gcc: 10.4.0). (Requirement in upstream bootstrap-prefix.sh script.)

You know you don't need to stick to that after installing, right? You can and should update to a newer version of GCC if you want. Only the bootstrap process is a bit finicky, but once you bootstrap, you are usually free from the constraints. You can also install multiple versions of GCC (i.e. GCC 12 for CPU, GCC 11 for CUDA), which you can use later as gcc-$version on the command line even if they are not the default.

Good to know. I think for EESSI we only need one version that is then used to build up tools chains (incl compilers) in the software layer. In the past we ran into issues with too recent versions of GCC in that step. Hence, we'd rather like to restrict ourselves to the oldest version of GCC in the compat layer. Might change in the future when we move to more recent tool chains in the software layer.

@amadio
Copy link
Copy Markdown

amadio commented Nov 3, 2022

I see. I suggest to try using the toolchain from the compat layer to build the software layer, to avoid having to add ld wrappers, etc, since the toolchain is already configured to work well with the non-standard glibc and linker. It will also let you control binutils with eselect-binutils, etc. I read the diffs and the reason to use <GCC-11 is because GCC-11 needs C++11 support to build, which may not be available on old OSs like CentOS 7, etc. Other than that, you should be able to remove the restriction, as I believe other compilation problems have been solved.

- name: eessi
source: git
url: https://github.com/EESSI/gentoo-overlay.git
url: https://github.com/trz42/gentoo-overlay.git
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just pointing this out, so we don't forget to revert this back to EESSI after EESSI/gentoo-overlay#84 is merged...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trz42 How do you deal with the branch aspect? Or did you just merge the eessi-2022.11 branch you used for EESSI/gentoo-overlay#84 into your main branch?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To answer my own question: yes, you've just updated the main branch in your fork with EESSI/gentoo-overlay#84, OK.

Copy link
Copy Markdown
Contributor

@boegel boegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the install_prefix.yml task will also need a change to fix ansible-lint check?

Jinja templates should only be at the end of 'name'

# stick to GCC 9.x; using a too recent compiler in the compat layer complicates stuff in the software layer,
# see for example https://github.com/EESSI/software-layer/issues/151
>=sys-devel/gcc-10
>=sys-devel/gcc-10.4.1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment above should be updated accordingly?
But maybe keep the pointer to EESSI/software-layer#151 since that can help explain why we prefer sticking to an older GCC version.

@trz42 Did you try to build GCC/9.3.0 on top of this compat layer with GCC 10.x as system compiler?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch about the comment. We should keep the pointer yes.

Yes, I've built GCC/9.3.0 and also then software. See trz42/software-layer#42

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated comment.

# trying to set 10.4.1_p20221006 fails
# we mask sys-devel/gcc below to not install anything newer than 10.4.0
# gentoo_git_commit: c2d8ce0e1b6206a225a9f2547bbc65c79218756c
# 2022.11 (Nov 3 2022) second iteration made for PR
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trz42 I'm not sure we should keep this whole history, but the problems with the patched 10.4.1 versions would be good to highlight somehow. I think sticking to 10.4.0 makes sense.

The 10.4.1 versions are test builds I think... Maybe @amadio can clarify?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can prune the comments. For me they add some context right now while we got going again. If we update more frequently or even build compat layers with a bot, they might not provide much useful context or there are other ways to document the history.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pruned some comments.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GCC ebuilds with _p in the version are versions of GCC with extra patches by Gentoo, usually to solve bugs, so you can consider them similar to the other versions. For example sys-devel/gcc-13.0.0_pre20221030 was added to fix https://bugs.gentoo.org/879049. Also, the only reason that GCC 10.x is used for bootstrapping is because GCC 11 requires C++11 to build, and that itself requires a new enough compiler that may not be available on old systems. You don't have to stick to older versions of GCC in EESSI. You can just leave the masks free and choose GCC by adding a slotted version to your set, like sys-devel/gcc:11 if you want GCC 11.x but don't care which minor version. It's quite safe to keep it updated within the same major version (i.e. that won't break packages built on top of the compat layer).

local: "{{ playbook_dir }}/../../bootstrap-prefix.sh"
remote: /tmp/bootstrap-prefix.sh
prefix_singularity_command: "singularity exec -B {{ gentoo_prefix_path }}:{{ gentoo_prefix_path }}"
prefix_singularity_command: "singularity exec --env LD_LIBRARY_PATH= -B {{ gentoo_prefix_path }}:{{ gentoo_prefix_path }}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trz42 We should add a comment above this line to clarify why LD_LIBRARY_PATH is explicitly set to empty?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because otherwise the bootstrap script simply stops running. It's explained in the PR.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment.

path: "{{ gentoo_prefix_path }}"
recurse: true
become: false
become: true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here we should clarify why true (or false is needed, this become bit is a bit cryptic I think...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the ansible script is run as a normal user, it cannot change ownership of files it doesn't own. That's the purpose of the task ... to change ownership to the user running the task.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment.


- include_tasks: install_prefix.yml
- name: Include task install_prefix.yml
ansible.builtin.include_tasks: install_prefix.yml
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this was changed for a similar reason as community.general.portage above?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was changed to make the sensible-linter happy.

@trz42
Copy link
Copy Markdown
Contributor Author

trz42 commented Nov 3, 2022

I'd rather reconfigure the ansible-lint check to ignore the Jinja template issue. The script has run fine a dozen of times. The "ERROR" only states that templates should be at the end, so it doesn't matter. Also curious why we need to fix this here. Maybe it is better to do a code quality hackathon than imposing unrelated code improvements to PRs.

@trz42
Copy link
Copy Markdown
Contributor Author

trz42 commented Nov 7, 2022

Probably interesting finds on building the compat layer:

  • Wasn't able to build direnv easily, so removed it from the overlay sets. nss and stop built without any issues. rpm wasn't added as it didn't seem needed for GPU support.
  • Building a compat layer from scratch was failing with the latest gentoo commits (Nov 05, Nov 03), only after going back to a commit from Oct 28 it again worked ... on x86_64.
  • On aarch64 I ran into issues that community.general.portage is not a known ansible module action. So, will have to revert that until a proper fix is found. Previously, only portage was used but ansible-lint complained something like
fqcn[action]: Use FQCN for module actions, such `<namespace>.<collection>.portage`. (warning)
[1281](https://github.com/EESSI/compatibility-layer/actions/runs/3385280194/jobs/5623244261#step:4:1282)
ansible/playbooks/roles/compatibility_layer/tasks/add_overlay.yml:22 Action `portage` is not FQCN.
[1282](https://github.com/EESSI/compatibility-layer/actions/runs/3385280194/jobs/5623244261#step:4:1283)
  • A bit curious what could be the issue: Are the containers we use to build the compat layer for x86_64 and aarch64 different wrt the ansible installation?

@trz42
Copy link
Copy Markdown
Contributor Author

trz42 commented Apr 3, 2023

This PR has become obsolete with new developments in early 2023.

@trz42 trz42 closed this Apr 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants