Welcome to the new DiscussCESM forum!
We are still working on the website migration, so you may experience downtime during this process.

Existing users, please reset your password before logging in here: https://xenforo.cgd.ucar.edu/cesm/index.php?lost-password/

CESM with singularity on multiple nodes


New Member
I am trying to run cesm with a singularity container on a machine with for example 512 processors (on several nodes)

This requires starting the application by calling the mpi launcher from the host, i.e., mpirun -n 512 singularity ... cesm.exe

Obviously that does not work with ./case.submit

My problem is that execution fails when directly calling cesm.exe (whereas it is successful when using case.submit or python .case.run which is only possible without container)

I create the case, setup and build cesm, and before running execute ./check_case, which I thought would be sufficient to prepare the run, but I must be missing something: what else is required before calling cesm.exe from the rundir?

Has anyone experience running cesm with containers on multiple nodes?


Rob von Behren
New Member
Hi @Vru -

I've been working to get a multi-machine containerized CESM up and running as well. I haven't been successful yet, but I believe the issues I'm hitting at the moment are due to interactions between the container software and my particular host, and not an issue with communication between the containers. Here are some things I found which might help you move forward:

* The cesm.exe binary needs to be accessible to all of the containers. The simplest way to do this is to have a network drive (eg, NFS mount) that you mirror into each of the containers and use that for your test cases.

* The MPICH installation in the CESM needs ssh access to other hosts in order to start up MPI processes. Since the container doesn't include sshd, you'll need to add it and save as a new container. You'll also need to do a bit of messing around to make sure /sbin/sshd is started in each of the containers, most likely listening on a different port so you don't interfere with the sshd running in the host OS. Finally, you'll need to set up a password-less ssh private key in ~user/.ssh/id_rsa and add the public key to the ~user/.ssh/authorized_keys so the containers can ssh to one another when mpirun is called.

* You probably need to set up port forwarding for the containers so MPICH will have open ports to use for it's communication between the containers. Depending on your environment, you may also need to change the firewall rules for the host OS-es to make sure they can talk to each other on these ports as well.

That's clearly a very rough sketch, but hopefully it helps a bit! I'm planning to put together a Dockerfile in the next few days which uses escomp/cesm-2.2:latest as a base and does some of this setup, so I'll share a pointer to that when I've got something working.


Owen Hughes
New Member
Hi @jrvb !
I was wondering if you've made any progress on a multi-machine CESM container working. I've been working on containerizing a number of Earth System Models and dynamical cores (namely E3SM and MPAS), and I would be curious to see how you're going about this. Full disclosure: I know very little about the internals of batch submission systems.


Rob von Behren
New Member
Hi @owhughes -

I wound up putting my multi-machine containerized version on hold for a couple of reasons:

1. After more testing, the speedup I was able to get by adding nodes was fairly poor in the environment I'm using. (I'm running things on Google Cloud VMs, and while the networks are fairly fast, they are still higher latency than infiniband and the communication overhead starts to dominate past ~10 nodes)

2. We shifted to doing wider ensembles of shorter-length simulations, where single-machine runs work well anyway.

3. For the single-node case, I've found it more convenient to just use machine full machine images rather than Docker containers.

Sorry I don't have anything more helpful for you!