-
Notifications
You must be signed in to change notification settings - Fork 81
Gromacs EESSI on top of CSCS GROMACS library test #156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
18 commits
Select commit
Hold shift + click to select a range
db3758e
Initial try of mpi hello world test
casparvl fc7e8a0
Merge branch 'master' of github.com:EESSI/software-layer
casparvl 29f4cd2
Merge remote-tracking branch 'upstream/main'
13441c9
Merge branch 'EESSI:main' into master
casparvl f5a351c
Added updated hooks, utils and settings
794ffb4
Added GROMACS check written on top of CSCS GROMACS libtest
f4f7c93
Removed unneeded files
b05b011
Added tags. Still do: tags that represent computation footprint. Diff…
396c4d1
More clear skip messages. Added config for magic castle cluster
051edac
Would like to use remote CPU detection, but its not working right now…
3176bb2
Probably not the most elegant, but for now: just rais error and quit …
9f5a9e5
update setttings files, remove system specific files
b01511f
Remove settings file, since we already have the magic castle settings…
cd76266
Now that ReFrame supports describing processor topology in the config…
36f9622
removed MPI hello world test, it was just there because I wanted to r…
38f0009
Merge branch 'EESSI:main' into gromacs_cscs
casparvl f198d96
Split the auto-assign hook into two separate ones. For GROMACS test o…
ae91ff9
Add an example config file for a system with GPUs
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,98 @@ | ||
| from os import environ | ||
| username = environ.get('USER') | ||
|
|
||
| # This is an example configuration file | ||
| site_configuration = { | ||
| 'systems': [ | ||
| { | ||
| 'name': 'examle', | ||
| 'descr': 'Example cluster', | ||
| 'modules_system': 'lmod', | ||
| 'hostnames': ['int*','tcn*'], | ||
| 'stagedir': f'/tmp/reframe_output/staging', | ||
| 'partitions': [ | ||
| { | ||
| 'name': 'cpu', | ||
| 'scheduler': 'slurm', | ||
| 'launcher': 'mpirun', | ||
| 'access': ['-p cpu'], | ||
| 'environs': ['builtin'], | ||
| 'max_jobs': 4, | ||
| 'processor': { | ||
| 'num_cpus': 128, | ||
| 'num_sockets': 2, | ||
| 'num_cpus_per_socket': 64, | ||
| 'arch': 'znver2', | ||
| }, | ||
| 'descr': 'CPU partition' | ||
| }, | ||
| { | ||
| 'name': 'gpu', | ||
| 'scheduler': 'slurm', | ||
| 'launcher': 'mpirun', | ||
| 'access': ['-p gpu'], | ||
| 'environs': ['builtin'], | ||
| 'max_jobs': 4, | ||
| 'processor': { | ||
| 'num_cpus': 72, | ||
| 'num_sockets': 2, | ||
| 'num_cpus_per_socket': 36, | ||
| 'arch': 'icelake', | ||
| }, | ||
| 'devices': [ | ||
| { | ||
| 'type': 'gpu', | ||
| 'num_devices': 4, | ||
| } | ||
| ], | ||
| 'descr': 'GPU partition' | ||
| }, | ||
| ] | ||
| }, | ||
| ], | ||
| 'environments': [ | ||
| { | ||
| 'name': 'builtin', | ||
| 'cc': 'cc', | ||
| 'cxx': '', | ||
| 'ftn': '', | ||
| }, | ||
| ], | ||
| 'logging': [ | ||
| { | ||
| 'level': 'debug', | ||
| 'handlers': [ | ||
| { | ||
| 'type': 'stream', | ||
| 'name': 'stdout', | ||
| 'level': 'info', | ||
| 'format': '%(message)s' | ||
| }, | ||
| { | ||
| 'type': 'file', | ||
| 'name': 'reframe.log', | ||
| 'level': 'debug', | ||
| 'format': '[%(asctime)s] %(levelname)s: %(check_info)s: %(message)s', # noqa: E501 | ||
| 'append': False | ||
| } | ||
| ], | ||
| 'handlers_perflog': [ | ||
| { | ||
| 'type': 'filelog', | ||
| 'prefix': '%(check_system)s/%(check_partition)s', | ||
| 'level': 'info', | ||
| 'format': ( | ||
| '%(check_job_completion_time)s|reframe %(version)s|' | ||
| '%(check_info)s|jobid=%(check_jobid)s|' | ||
| '%(check_perf_var)s=%(check_perf_value)s|' | ||
| 'ref=%(check_perf_ref)s ' | ||
| '(l=%(check_perf_lower_thres)s, ' | ||
| 'u=%(check_perf_upper_thres)s)|' | ||
| '%(check_perf_unit)s' | ||
| ), | ||
| 'append': True | ||
| } | ||
| ] | ||
| } | ||
| ], | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,77 @@ | ||
| # This is an example configuration file | ||
| site_configuration = { | ||
| 'systems': [ | ||
| { | ||
| 'name': 'Magic Castle', | ||
| 'descr': 'The Magic Castle instance as it was used in the EESSI hackathon in dec 2021, on AWS', | ||
| 'modules_system': 'lmod', | ||
| 'hostnames': ['login', 'node'], | ||
| 'partitions': [ | ||
| { | ||
| 'name': 'cpu', | ||
| 'scheduler': 'slurm', | ||
| 'launcher': 'mpirun', | ||
| # By default, the Magic Castle cluster only allocates a small amount of memory | ||
| # Thus we request the full memory explicitely | ||
| 'access': ['-p cpubase_bycore_b1 --exclusive --mem=94515M'], | ||
| 'environs': ['builtin'], | ||
| 'max_jobs': 4, | ||
| 'processor': { | ||
| 'num_cpus': 36, | ||
| }, | ||
| 'descr': 'normal CPU partition' | ||
| }, | ||
| ] | ||
| }, | ||
| ], | ||
| 'environments': [ | ||
| { | ||
| 'name': 'builtin', | ||
| 'cc': 'cc', | ||
| 'cxx': '', | ||
| 'ftn': '', | ||
| }, | ||
| ], | ||
| 'logging': [ | ||
| { | ||
| 'level': 'debug', | ||
| 'handlers': [ | ||
| { | ||
| 'type': 'stream', | ||
| 'name': 'stdout', | ||
| 'level': 'info', | ||
| 'format': '%(message)s' | ||
| }, | ||
| { | ||
| 'type': 'file', | ||
| 'name': 'reframe.log', | ||
| 'level': 'debug', | ||
| 'format': '[%(asctime)s] %(levelname)s: %(check_info)s: %(message)s', # noqa: E501 | ||
| 'append': False | ||
| } | ||
| ], | ||
| 'handlers_perflog': [ | ||
| { | ||
| 'type': 'filelog', | ||
| 'prefix': '%(check_system)s/%(check_partition)s', | ||
| 'level': 'info', | ||
| 'format': ( | ||
| '%(check_job_completion_time)s|reframe %(version)s|' | ||
| '%(check_info)s|jobid=%(check_jobid)s|' | ||
| '%(check_perf_var)s=%(check_perf_value)s|' | ||
| 'ref=%(check_perf_ref)s ' | ||
| '(l=%(check_perf_lower_thres)s, ' | ||
| 'u=%(check_perf_upper_thres)s)|' | ||
| '%(check_perf_unit)s' | ||
| ), | ||
| 'append': True | ||
| } | ||
| ] | ||
| } | ||
| ], | ||
| 'general': [ | ||
| { | ||
| 'remote_detect': True, | ||
| } | ||
| ], | ||
| } | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,80 @@ | ||
| # Copyright 2016-2021 Swiss National Supercomputing Centre (CSCS/ETH Zurich) | ||
| # ReFrame Project Developers. See the top-level LICENSE file for details. | ||
| # | ||
| # SPDX-License-Identifier: BSD-3-Clause | ||
|
|
||
| import reframe as rfm | ||
| from reframe.utility import find_modules | ||
|
|
||
| from hpctestlib.sciapps.gromacs.benchmarks import gromacs_check | ||
| import eessi_utils.hooks as hooks | ||
| import eessi_utils.utils as utils | ||
|
|
||
| @rfm.simple_test | ||
| class GROMACS_EESSI(gromacs_check): | ||
|
|
||
| scale = parameter([ | ||
| ('singlenode', 1), | ||
| ('n_small', 2), | ||
| ('n_medium', 8), | ||
| ('n_large', 16)]) | ||
| module_info = parameter(find_modules('GROMACS', environ_mapping={r'.*': 'builtin'})) | ||
|
|
||
| omp_num_threads = 1 | ||
| executable_opts += ['-dlb yes', '-ntomp %s' % omp_num_threads, '-npme -1'] | ||
| variables = { | ||
| 'OMP_NUM_THREADS': '%s' % omp_num_threads, | ||
| } | ||
|
|
||
| time_limit = '30m' | ||
|
|
||
| @run_after('init') | ||
| def apply_module_info(self): | ||
| self.s, self.e, self.m = self.module_info | ||
| self.valid_systems = [self.s] | ||
| self.modules = [self.m] | ||
| self.valid_prog_environs = [self.e] | ||
|
|
||
| @run_after('init') | ||
| def set_test_scale(self): | ||
| scale_variant, self.num_nodes = self.scale | ||
| self.tags.add(scale_variant) | ||
|
|
||
| # Set correct tags for monitoring & CI | ||
| @run_after('init') | ||
| def set_test_purpose(self): | ||
| # Run all tests from the testlib for monitoring | ||
| self.tags.add('monitoring') | ||
| # Select one test for CI | ||
| if self.benchmark_info[0] == 'HECBioSim/hEGFRDimer': | ||
| self.tags.add('CI') | ||
|
|
||
| # Skip testing for when nb_impl=gpu and this is not a GPU node | ||
| @run_after('setup') | ||
| def skip_nb_impl_gpu_on_cpu_nodes(self): | ||
| self.skip_if( | ||
| (self.nb_impl == 'gpu' and not utils.is_gpu_present(self)), | ||
| "Skipping test variant with non-bonded interactions on GPUs, as this partition (%s) does not have GPU nodes" % self.current_partition.name | ||
| ) | ||
|
|
||
| # Sckip testing when nb_impl=gpu and this is not a GPU build of GROMACS | ||
| @run_after('setup') | ||
| def skip_nb_impl_gpu_on_non_cuda_builds(self): | ||
| self.skip_if( | ||
| (self.nb_impl == 'gpu' and not utils.is_cuda_required(self)), | ||
| "Skipping test variant with non-bonded interaction on GPUs, as this GROMACS was not build with GPU support" | ||
| ) | ||
|
|
||
| # Skip testing GPU-based modules on CPU-based nodes | ||
| @run_after('setup') | ||
| def skip_gpu_test_on_cpu_nodes(self): | ||
| hooks.skip_gpu_test_on_cpu_nodes(self) | ||
|
|
||
| # Assign num_tasks, num_tasks_per_node and num_cpus_per_task automatically based on current partition's num_cpus and gpus | ||
| # Only when running nb_impl on GPU do we want one task per GPU | ||
| @run_after('setup') | ||
| def set_num_tasks(self): | ||
| if(self.nb_impl == 'gpu'): | ||
| hooks.assign_one_task_per_gpu(test = self, num_nodes = self.num_nodes) | ||
| else: | ||
| hooks.assign_one_task_per_cpu(test = self, num_nodes = self.num_nodes) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,44 @@ | ||
| import reframe as rfm | ||
| import eessi_utils.utils as utils | ||
|
|
||
| processor_info_missing = '''This test requires the number of CPUs to be known for the partition it runs on. | ||
| Check that processor information is either autodetected | ||
| (see https://reframe-hpc.readthedocs.io/en/stable/configure.html#proc-autodetection), | ||
| or manually set in the ReFrame configuration file | ||
| (see https://reframe-hpc.readthedocs.io/en/stable/config_reference.html?highlight=processor%20info#processor-info). | ||
| ''' | ||
|
|
||
| def skip_cpu_test_on_gpu_nodes(test: rfm.RegressionTest): | ||
| '''Skip test if GPUs are present, but no CUDA is required''' | ||
| skip = ( utils.is_gpu_present(test) and not utils.is_cuda_required(test) ) | ||
| if skip: | ||
| test.skip_if(True, "GPU is present on this partition (%s), skipping CPU-based test" % test.current_partition.name) | ||
|
|
||
| def skip_gpu_test_on_cpu_nodes(test: rfm.RegressionTest): | ||
| '''Skip test if CUDA is required, but no GPU is present''' | ||
| skip = ( utils.is_cuda_required(test) and not utils.is_gpu_present(test) ) | ||
| if skip: | ||
| test.skip_if(True, "Test requires CUDA, but no GPU is present in this partition (%s). Skipping test..." % test.current_partition.name) | ||
|
|
||
| def assign_one_task_per_cpu(test: rfm.RegressionTest, num_nodes: int) -> rfm.RegressionTest: | ||
| '''Sets num_tasks_per_node and num_cpus_per_task such that it will run one task per core''' | ||
| if test.current_partition.processor.num_cpus is None: | ||
| raise AttributeError(processor_info_missing) | ||
| test.num_tasks_per_node = test.current_partition.processor.num_cpus | ||
| test.num_cpus_per_task = 1 | ||
| test.num_tasks = num_nodes * test.num_tasks_per_node | ||
|
|
||
| def assign_one_task_per_gpu(test: rfm.RegressionTest, num_nodes: int) -> rfm.RegressionTest: | ||
| '''Sets num_tasks_per_node to the number of gpus, and num_cpus_per_task to the number of CPUs available per GPU in this partition''' | ||
| if test.current_partition.processor.num_cpus is None: | ||
| raise AttributeError(processor_info_missing) | ||
| test.num_tasks_per_node = utils.get_num_gpus(test) | ||
| test.num_cpus_per_task = int(test.current_partition.processor.num_cpus / test.num_tasks_per_node) | ||
| test.num_tasks = num_nodes * test.num_tasks_per_node | ||
|
|
||
| def auto_assign_num_tasks_MPI(test: rfm.RegressionTest, num_nodes: int) -> rfm.RegressionTest: | ||
| '''Automatically sets num_tasks, tasks_per_node and cpus_per_task based on the current partitions num_cpus, number of GPUs and test.num_nodes. For GPU tests, one task per GPU is set, and num_cpus_per_task is based on the ratio of CPU cores/GPUs. For CPU tests, one task per CPU is set, and num_cpus_per_task is set to 1. Total task count is determined based on the number of nodes to be used in the test. Behaviour of this function is (usually) sensible for pure MPI tests.''' | ||
| if utils.is_cuda_required(test): | ||
| assign_one_task_per_gpu(test, num_nodes) | ||
| else: | ||
| assign_one_task_per_cpu(test, num_nodes) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,35 @@ | ||
| import re | ||
|
|
||
| import reframe as rfm | ||
|
|
||
|
|
||
| gpu_dev_name = 'gpu' | ||
|
|
||
| def _get_gpu_list(test: rfm.RegressionTest): | ||
| return [ dev.num_devices for dev in test.current_partition.devices if dev.device_type == gpu_dev_name ] | ||
|
|
||
| def get_num_gpus(test: rfm.RegressionTest) -> int: | ||
| '''Returns the number of GPUs for the current partition''' | ||
| gpu_list = _get_gpu_list(test) | ||
| # If multiple devices are called 'GPU' in the current partition, | ||
| # we don't know for which to return the device count... | ||
| if(len(gpu_list) != 1): | ||
| raise ValueError(f"Multiple different devices exist with the name " | ||
| f"'{gpu_dev_name}' for partition '{test.current_partition.name}'. " | ||
| f"Cannot determine number of GPUs available for the test. " | ||
| f"Please check the definition of partition '{test.current_partition.name}' " | ||
| f"in your ReFrame config file.") | ||
|
|
||
| return gpu_list[0] | ||
|
|
||
| def is_gpu_present(test: rfm.RegressionTest) -> bool: | ||
| '''Checks if GPUs are present in the current partition''' | ||
| return ( len(_get_gpu_list(test)) >= 1 ) | ||
|
|
||
| def is_cuda_required(test: rfm.RegressionTest) -> bool: | ||
| '''Checks if CUDA seems to be required by current module''' | ||
| requires_cuda = False | ||
| for module in test.modules: | ||
| if re.search("(?i)cuda", module): | ||
| requires_cuda = True | ||
| return requires_cuda |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.