CESM with singularity on multiple nodes

Vru · Jan 8, 2021

I am trying to run cesm with a singularity container on a machine with for example 512 processors (on several nodes)

This requires starting the application by calling the mpi launcher from the host, i.e., mpirun -n 512 singularity ... cesm.exe

Obviously that does not work with ./case.submit

My problem is that execution fails when directly calling cesm.exe (whereas it is successful when using case.submit or python .case.run which is only possible without container)

I create the case, setup and build cesm, and before running execute ./check_case, which I thought would be sufficient to prepare the run, but I must be missing something: what else is required before calling cesm.exe from the rundir?

Has anyone experience running cesm with containers on multiple nodes?

jrvb · Jan 13, 2021

Hi @Vru -

I've been working to get a multi-machine containerized CESM up and running as well. I haven't been successful yet, but I believe the issues I'm hitting at the moment are due to interactions between the container software and my particular host, and not an issue with communication between the containers. Here are some things I found which might help you move forward:

* The cesm.exe binary needs to be accessible to all of the containers. The simplest way to do this is to have a network drive (eg, NFS mount) that you mirror into each of the containers and use that for your test cases.

* The MPICH installation in the CESM needs ssh access to other hosts in order to start up MPI processes. Since the container doesn't include sshd, you'll need to add it and save as a new container. You'll also need to do a bit of messing around to make sure /sbin/sshd is started in each of the containers, most likely listening on a different port so you don't interfere with the sshd running in the host OS. Finally, you'll need to set up a password-less ssh private key in ~user/.ssh/id_rsa and add the public key to the ~user/.ssh/authorized_keys so the containers can ssh to one another when mpirun is called.

* You probably need to set up port forwarding for the containers so MPICH will have open ports to use for it's communication between the containers. Depending on your environment, you may also need to change the firewall rules for the host OS-es to make sure they can talk to each other on these ports as well.

That's clearly a very rough sketch, but hopefully it helps a bit! I'm planning to put together a Dockerfile in the next few days which uses escomp/cesm-2.2:latest as a base and does some of this setup, so I'll share a pointer to that when I've got something working.

owhughes · Jan 11, 2022

Hi @jrvb !
I was wondering if you've made any progress on a multi-machine CESM container working. I've been working on containerizing a number of Earth System Models and dynamical cores (namely E3SM and MPAS), and I would be curious to see how you're going about this. Full disclosure: I know very little about the internals of batch submission systems.

jrvb · Jan 11, 2022

Hi @owhughes -

I wound up putting my multi-machine containerized version on hold for a couple of reasons:

1. After more testing, the speedup I was able to get by adding nodes was fairly poor in the environment I'm using. (I'm running things on Google Cloud VMs, and while the networks are fairly fast, they are still higher latency than infiniband and the communication overhead starts to dominate past ~10 nodes)

2. We shifted to doing wider ensembles of shorter-length simulations, where single-machine runs work well anyway.

3. For the single-node case, I've found it more convenient to just use machine full machine images rather than Docker containers.

Sorry I don't have anything more helpful for you!

Best,

-Rob

fluidnumerics-joe · Jan 18, 2022

jrvb said:
Hi @owhughes -

I wound up putting my multi-machine containerized version on hold for a couple of reasons:

1. After more testing, the speedup I was able to get by adding nodes was fairly poor in the environment I'm using. (I'm running things on Google Cloud VMs, and while the networks are fairly fast, they are still higher latency than infiniband and the communication overhead starts to dominate past ~10 nodes)

2. We shifted to doing wider ensembles of shorter-length simulations, where single-machine runs work well anyway.

3. For the single-node case, I've found it more convenient to just use machine full machine images rather than Docker containers.

Sorry I don't have anything more helpful for you!

Best,

-Rob

Hey @jrvb ,
What compute nodes are you using on Google Cloud ? I've had reasonable scaling experience (with other OGCM's) on the c2-standard-60 instances with compact placement enabled.

Are you running under Docker or using Singularity ? I'm interested to see where I can pitch in to help get improved performance for CESM on multi-VM environments on Google Cloud.

jrvb · Jan 19, 2022

Hi @fluidnumerics-joe -

I've also been using c2-standard-60, and used compact placement policy for my scaling tests. No docker or other container software; I just ran on regular VM instances. It's entirely possible that I had a suboptimal configuration, though.

Some details about my scalability tests:

* machines were all c2-standard-60 with hyperthreading disabled (so 30 cores / node)
* compact placement policy
* applied suggested config changes from GitHub - GoogleCloudPlatform/hpc-tools (albeit an older version, as I did these tests in Jan 2021)
* Intel Fortran compiler and MPI stack
* disk-backed Filestore NFS server for sharing CESM binaries, input files and outputs.
* vanilla CESM configuration: --compset F2000climo --res f19_g17

Here is what I found:

nodes throughput (sim years / day)
1 3.07
2 5.29
4 8.35
6 10.08
8 10.43
12 9.15
16 7.25
22 6.01

The best I was able to get for this configuration was about a 3.4x speedup, with 8x the compute cost.

Some other misc things I've tried:

* Switching to lustre for the shared file system gave a ~10% performance improvement bump but didn't change the scaling properties
* Running an FW2000climo simulation scaled slightly better, with closer to 4x speedup at 8 nodes
* Using large instances (nd-224 or something, IIRC) scaled about the same as adding CPUs by adding nodes, which definitely suggests something suboptimal in my test setup. :)

I'd love to hear what sort of scaling you've been able to get!

Best,

-Rob

fluidnumerics-joe · Jan 19, 2022

jrvb said:
Hi @fluidnumerics-joe -

I've also been using c2-standard-60, and used compact placement policy for my scaling tests. No docker or other container software; I just ran on regular VM instances. It's entirely possible that I had a suboptimal configuration, though.

Some details about my scalability tests:

* machines were all c2-standard-60 with hyperthreading disabled (so 30 cores / node)
* compact placement policy
* applied suggested config changes from GitHub - GoogleCloudPlatform/hpc-tools (albeit an older version, as I did these tests in Jan 2021)
* Intel Fortran compiler and MPI stack
* disk-backed Filestore NFS server for sharing CESM binaries, input files and outputs.
* vanilla CESM configuration: --compset F2000climo --res f19_g17

Here is what I found:

nodes throughput (sim years / day)
1 3.07
2 5.29
4 8.35
6 10.08
8 10.43
12 9.15
16 7.25
22 6.01

The best I was able to get for this configuration was about a 3.4x speedup, with 8x the compute cost.

Some other misc things I've tried:

* Switching to lustre for the shared file system gave a ~10% performance improvement bump but didn't change the scaling properties
* Running an FW2000climo simulation scaled slightly better, with closer to 4x speedup at 8 nodes
* Using large instances (nd-224 or something, IIRC) scaled about the same as adding CPUs by adding nodes, which definitely suggests something suboptimal in my test setup. :)

I'd love to hear what sort of scaling you've been able to get!

Best,

-Rob

Hey @jrvb,
Here's a video I published on WRF early last year :

In short, we see about 50% scaling efficiency for the CONUS 2.5km benchmark once we hit 1920 ranks (32 x c2-standard-60)

Since then, using the Intel OneAPI for ifort with openmpi, we've gotten about a 30% bump in performance in all of our runs. I'm working on firming up some new scaling efficiency metrics for WRF, but we did find that using hyperthreads, and keeping more MPI ranks per node provided better cost-efficiency.

I'm also looking into running on the new Epyc Milan systems (c2d) - these are the "compute optimized" versions of the n2d's on Google Cloud. For those, AMD recommends the aocc compiler chain.

All of the image baking that is done to help reproduce results is maintained in this repository : GitHub - FluidNumerics/rcc-apps: Create VM images for HPC applications on Google Cloud Platform

Do you have a repository where you're working on CESM installation ? If we can get it into rcc-apps, I can rig up "continuous benchmarking" to see what I can do to help improve scaling on Google Cloud.

On using Lustre, it's best if CESM has options for parallel file IO - most of the gains in IO performance will come from implementing this in software.

dobbins · Feb 7, 2022

Vru said:
I am trying to run cesm with a singularity container on a machine with for example 512 processors (on several nodes)

This requires starting the application by calling the mpi launcher from the host, i.e., mpirun -n 512 singularity ... cesm.exe

Obviously that does not work with ./case.submit

My problem is that execution fails when directly calling cesm.exe (whereas it is successful when using case.submit or python .case.run which is only possible without container)

I create the case, setup and build cesm, and before running execute ./check_case, which I thought would be sufficient to prepare the run, but I must be missing something: what else is required before calling cesm.exe from the rundir?

Has anyone experience running cesm with containers on multiple nodes?

Hi Vru

I've not been watching this forum, so I'm a bit late. But to add a few notes to the discussion:

We've run with a multi-node container on our system, but the development of this has been on the back burner for a while now (this is our 'cesm-hpc' container, which includes the Intel compilers, improving performance a fair bit). The problem is that CESM's config files are all internal, and have no knowledge of the host system, so the goal is to introduce a 'minimal' configuration via a few environment variables (for queuing systems, procs-per-node, etc), so porting doesn't involve adding libraries, compilers, XML settings, etc., just a few simple variables to describe your system. Additionally, as noted above, you need 'case.submit' to be 'container-aware' and for your host system to be able to talk to the container MPI's launching mechanism, likely PMI / PMIx.

You should be able to run if you're doing that 'by hand', and I'd be interested in seeing your error logs if it's still not working for you.

- Brian

dobbins · Feb 7, 2022

jrvb said:
* The MPICH installation in the CESM needs ssh access to other hosts in order to start up MPI processes. Since the container doesn't include sshd, you'll need to add it and save as a new container. You'll also need to do a bit of messing around to make sure /sbin/sshd is started in each of the containers, most likely listening on a different port so you don't interfere with the sshd running in the host OS. Finally, you'll need to set up a password-less ssh private key in ~user/.ssh/id_rsa and add the public key to the ~user/.ssh/authorized_keys so the containers can ssh to one another when mpirun is called.

Hi Rob!

Good to see you guys are still doing some stuff with this. On the above note, if you have a 'host' MPI (unsure, on your GCP?), you probably don't need to do it via your own hostfile and launching from within a container with SSH enabled. This is fairly new to me, too, but in a discussion with Ralph Castain on the OpenMPI list, he clarified that the 'wire-up' phase of OpenMPI just uses PMI/PMIx, and thus even if a container is internally MPICH, it's self-consistent in terms of the executable/libraries, so there's no risk of MPI compatibility issues. So if you have some sort of MPI launcher across multiple GCP nodes, that should be sufficient. If you don't, then this doesn't apply, though, and your way is also fine. (This is more relevant for clusters which already have an existing MPI, vs cloud nodes that don't.)

Cheers,
- Brian

dobbins · Feb 7, 2022

owhughes said:
Hi @jrvb !
I was wondering if you've made any progress on a multi-machine CESM container working. I've been working on containerizing a number of Earth System Models and dynamical cores (namely E3SM and MPAS), and I would be curious to see how you're going about this. Full disclosure: I know very little about the internals of batch submission systems.

Owen, I'm sending you a note - If you'd like to chat, I've run MPAS and E3SM on our cloud infrastructure, which mirrors our container setup, absent the batch system. I've not touched our containers in a while, but hope to get back to them in ~April or so. I bet we can compare notes / ideas.

- Brian

dobbins · Feb 7, 2022

fluidnumerics-joe said:
Hey @jrvb,
Here's a video I published on WRF early last year :

In short, we see about 50% scaling efficiency for the CONUS 2.5km benchmark once we hit 1920 ranks (32 x c2-standard-60)

Since then, using the Intel OneAPI for ifort with openmpi, we've gotten about a 30% bump in performance in all of our runs. I'm working on firming up some new scaling efficiency metrics for WRF, but we did find that using hyperthreads, and keeping more MPI ranks per node provided better cost-efficiency.

I'm also looking into running on the new Epyc Milan systems (c2d) - these are the "compute optimized" versions of the n2d's on Google Cloud. For those, AMD recommends the aocc compiler chain.

All of the image baking that is done to help reproduce results is maintained in this repository : GitHub - FluidNumerics/rcc-apps: Create VM images for HPC applications on Google Cloud Platform

Do you have a repository where you're working on CESM installation ? If we can get it into rcc-apps, I can rig up "continuous benchmarking" to see what I can do to help improve scaling on Google Cloud.

On using Lustre, it's best if CESM has options for parallel file IO - most of the gains in IO performance will come from implementing this in software.

Hi Joe,

For the AMD nodes, our experiences with CESM suggest sticking with the Intel compiler, but make sure you force it to use AVX2 optimizations ('-march=core-avx2 -mtune=core-avx2'), since it defaults to a runtime check on capabilities, and since it's not an Intel one, it uses the lowest optimizations. We tried AOCC a while back, and it still wasn't up to par, but maybe it's made great strides recently?

Cheers,
- Brian

CESM with singularity on multiple nodes

Vru

Vru

New Member

jrvb

Rob von Behren

New Member

owhughes

Owen Hughes

New Member

jrvb

Rob von Behren

New Member

fluidnumerics-joe

Joe Schoonover

New Member

jrvb

Rob von Behren

New Member

fluidnumerics-joe

Joe Schoonover

New Member

dobbins

Brian Dobbins

CSEG and Liaisons

dobbins

Brian Dobbins

CSEG and Liaisons

dobbins

Brian Dobbins

CSEG and Liaisons

dobbins

Brian Dobbins

CSEG and Liaisons