Hi!
I am trying to reproduce the simple MPI example here, because actually running an mpi program, because the example here is just running hostname. I have locally two examples - one an application we are working on, and the second a "hello world" example that I fell back to when I hit some issues (and it reproduced them). Here is what my job looks like:
name: "projects/llnl-flux/locations/us-central1/jobs/hello-world-mpi-005"
uid: "hello-world-mpi-00-3f853428-1bba-44c60"
task_groups {
name: "projects/xxxxxxxxxxxxxxxlocations/us-central1/jobs/hello-world-mpi-005/taskGroups/group0"
task_spec {
runnables {
barrier {
name: "wait-for-setup"
}
}
runnables {
script {
text: "bash /mnt/share/hello-world-mpi/setup.sh"
}
}
runnables {
barrier {
name: "wait-for-setup"
}
}
runnables {
script {
text: "bash /mnt/share/hello-world-mpi/run.sh"
}
}
compute_resource {
cpu_milli: 1000
memory_mib: 1000
}
max_run_duration {
seconds: 3600
}
max_retry_count: 2
volumes {
gcs {
remote_path: "netmark-experiment-bucket"
}
mount_path: "/mnt/share"
}
}
task_count: 4
parallelism: 4
task_count_per_node: 1
require_hosts_file: true
permissive_ssh: true
}
allocation_policy {
location {
allowed_locations: "regions/us-central1"
allowed_locations: "zones/us-central1-a"
allowed_locations: "zones/us-central1-b"
allowed_locations: "zones/us-central1-c"
allowed_locations: "zones/us-central1-f"
}
instances {
policy {
machine_type: "c2-standard-16"
boot_disk {
image: "projects/cloud-hpc-image-public/global/images/family/hpc-centos-7"
}
}
}
service_account {
email: "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
}
labels {
key: "batch-job-id"
value: "hello-world-mpi-005"
}
}
labels {
key: "type"
value: "script"
}
labels {
key: "mount"
value: "bucket"
}
labels {
key: "env"
value: "testing"
}
status {
state: QUEUED
run_duration {
}
}
create_time {
seconds: 1684889759
nanos: 883261744
}
update_time {
seconds: 1684889759
nanos: 883261744
}
logs_policy {
destination: CLOUD_LOGGING
}
And the setup.sh and run.sh scripts
setup.sh
#!/bin/bash
export DEBIAN_FRONTEND=noninteractive
sleep $BATCH_TASK_INDEX
# Note that for this family / image, we are root (do not need sudo)
yum update -y && yum install -y cmake gcc tuned ethtool
# This ONLY works on the hpc-* image family images
google_mpi_tuning --nosmt
# google_install_mpi --intel_mpi
google_install_intelmpi --impi_2021
# This is where they are installed to
# ls /opt/intel/mpi/latest/
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/intel/mpi/latest/lib:/opt/intel/mpi/latest/lib/release
export PATH=/opt/intel/mpi/latest/bin:$PATH
outdir=/mnt/share/hello-world-mpi
mkdir -p ${outdir}
cd ${outdir}
if [ $BATCH_TASK_INDEX = 0 ]; then
wget -O /tmp/ompi.tar.gz https://docs.it4i.cz/src/ompi/ompi.tar.gz
cd /tmp
tar -xzvf ompi.tar.gz
rm ompi/Makefile
cp -R ./ompi/* ${outdir}/
cd ${outdir}/
ls
mpicc -g -lmpi -lmpifort hello_c.c -I/opt/intel/mpi/latest/include -I/opt/intel/mpi/2021.8.0/include -L/opt/intel/mpi/2021.8.0/lib/release -L/opt/intel/mpi/2021.8.0/lib -o hello_c
fi
and run.sh
#!/bin/bash
export PATH=/opt/intel/mpi/latest/bin:$PATH
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/intel/mpi/latest/lib:/opt/intel/mpi/latest/lib/release
find /opt/intel -name mpicc
if [ $BATCH_TASK_INDEX = 0 ]; then
cd /mnt/share/hello-world-mpi
ls
mpirun -hostfile $BATCH_HOSTS_FILE -n 4 -ppn 1 -- /mnt/share/hello-world-mpi/hello_c
fi
It looks like it's compiling OK - I see hello_c - but the error I've hit in both with mpirun is something related to hydra and an argument?

It's been really challenging figuring out how all this works - e.g., it took me a hot minute to realize that these google install commands for mpi were only available on that specific image family, and then it's taken 10+ jobs to find paths / bins of various things (I'm on my 50+ run and still don't have a working thing!) 😆 I have a lot of feedback I'm planning to share, but would like to get at least one reasonable example working first (and I'd be happy to share)! For my execution, I'm using the python sdk so I don't have the config beyond what I posted above. Thanks for the help - looking forward to getting this working!
Hi!
I am trying to reproduce the simple MPI example here, because actually running an mpi program, because the example here is just running
hostname. I have locally two examples - one an application we are working on, and the second a "hello world" example that I fell back to when I hit some issues (and it reproduced them). Here is what my job looks like:And the setup.sh and run.sh scripts
setup.sh
and run.sh
It looks like it's compiling OK - I see
hello_c- but the error I've hit in both withmpirunis something related to hydra and an argument?It's been really challenging figuring out how all this works - e.g., it took me a hot minute to realize that these google install commands for mpi were only available on that specific image family, and then it's taken 10+ jobs to find paths / bins of various things (I'm on my 50+ run and still don't have a working thing!) 😆 I have a lot of feedback I'm planning to share, but would like to get at least one reasonable example working first (and I'd be happy to share)! For my execution, I'm using the python sdk so I don't have the config beyond what I posted above. Thanks for the help - looking forward to getting this working!