Hi all,
I am currently trying to resolve a resubmit issue with CESM2.1.3. My HPC uses SLURM, and case submissions are sent to compute nodes (e.g. sbatch mpirun -np {total tasks}). I can successfully run cases, but if I use resubmissions, the case fails because SLURM submissions must be from a login node, and not a compute node. My understanding is that since the first case was ran manually from a login node by myself, it's fine. However, since subsequent resubmits are done from the compute nodes after the run, the error is being thrown. I can get around this issue by adding the --resubmit-immediate argument to case.submit, but I'd like to avoid queuing up many dependency jobs for long runs, as the HPC sets a cap on the number of jobs I can have queued.
In older versions of CESM (e.g. 1.2.2), I got around this but modifying the post run .csh script to SSH back into a login node before running the $CASE.run script. However, since CESM2 uses revised infrastructure, I don't see an analogous workaround.
Any guidance on this issue would be much appreciated!
Jack
I am currently trying to resolve a resubmit issue with CESM2.1.3. My HPC uses SLURM, and case submissions are sent to compute nodes (e.g. sbatch mpirun -np {total tasks}). I can successfully run cases, but if I use resubmissions, the case fails because SLURM submissions must be from a login node, and not a compute node. My understanding is that since the first case was ran manually from a login node by myself, it's fine. However, since subsequent resubmits are done from the compute nodes after the run, the error is being thrown. I can get around this issue by adding the --resubmit-immediate argument to case.submit, but I'd like to avoid queuing up many dependency jobs for long runs, as the HPC sets a cap on the number of jobs I can have queued.
In older versions of CESM (e.g. 1.2.2), I got around this but modifying the post run .csh script to SSH back into a login node before running the $CASE.run script. However, since CESM2 uses revised infrastructure, I don't see an analogous workaround.
Any guidance on this issue would be much appreciated!
Jack