Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Case resubmission error due to compute/login node restrictions

jgvirgin@uwaterloo_ca

Jack Virgin
New Member
Hi all,

I am currently trying to resolve a resubmit issue with CESM2.1.3. My HPC uses SLURM, and case submissions are sent to compute nodes (e.g. sbatch mpirun -np {total tasks}). I can successfully run cases, but if I use resubmissions, the case fails because SLURM submissions must be from a login node, and not a compute node. My understanding is that since the first case was ran manually from a login node by myself, it's fine. However, since subsequent resubmits are done from the compute nodes after the run, the error is being thrown. I can get around this issue by adding the --resubmit-immediate argument to case.submit, but I'd like to avoid queuing up many dependency jobs for long runs, as the HPC sets a cap on the number of jobs I can have queued.

In older versions of CESM (e.g. 1.2.2), I got around this but modifying the post run .csh script to SSH back into a login node before running the $CASE.run script. However, since CESM2 uses revised infrastructure, I don't see an analogous workaround.

Any guidance on this issue would be much appreciated!

Jack
 

jedwards

CSEG and Liaisons
Staff member
Hi Jack,

There are two ways to resolve this issue. One is to edit config_batch.xml and add an ssh to the login node:
<batch_submit>ssh login1 cd $CASEROOT ; sbatch</batch_submit>

The other, and the one I prefer, is to use the resubmit-immediate option to case.submit:

--resubmit-immediate This queues all of the resubmissions immediately after
the first job is queued. These rely on the queue system to
handle dependencies.
 
Top