Case resubmission error due to compute/login node restrictions

jgvirgin@uwaterloo_ca · Jan 28, 2021

Hi all,

I am currently trying to resolve a resubmit issue with CESM2.1.3. My HPC uses SLURM, and case submissions are sent to compute nodes (e.g. sbatch mpirun -np {total tasks}). I can successfully run cases, but if I use resubmissions, the case fails because SLURM submissions must be from a login node, and not a compute node. My understanding is that since the first case was ran manually from a login node by myself, it's fine. However, since subsequent resubmits are done from the compute nodes after the run, the error is being thrown. I can get around this issue by adding the --resubmit-immediate argument to case.submit, but I'd like to avoid queuing up many dependency jobs for long runs, as the HPC sets a cap on the number of jobs I can have queued.

In older versions of CESM (e.g. 1.2.2), I got around this but modifying the post run .csh script to SSH back into a login node before running the $CASE.run script. However, since CESM2 uses revised infrastructure, I don't see an analogous workaround.

Any guidance on this issue would be much appreciated!

Jack

jedwards · Jan 28, 2021

Hi Jack,

There are two ways to resolve this issue. One is to edit config_batch.xml and add an ssh to the login node:
<batch_submit>ssh login1 cd $CASEROOT ; sbatch</batch_submit>

The other, and the one I prefer, is to use the resubmit-immediate option to case.submit:

--resubmit-immediate This queues all of the resubmissions immediately after
the first job is queued. These rely on the queue system to
handle dependencies.

jgvirgin@uwaterloo_ca · Jan 28, 2021

Hi Jim,

changing <batch_submit> worked perfectly!

Thanks,

Jack

Case resubmission error due to compute/login node restrictions

jgvirgin@uwaterloo_ca

Jack Virgin

New Member

jedwards

CSEG and Liaisons

jgvirgin@uwaterloo_ca

Jack Virgin

New Member