Case resubmit failure

Hello everyone,

I am trying to conduct a multi-year simulation and I would like to use the "Resubmit" flag to let the model run continuously. However, whenever the first run finishes, the model immediately fails during the next iteration with a memory failure and the following message:

slurmstepd: error: Detected 1 oom-kill event(s) in step 865636.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: h1c38: task 1: Out Of Memory
srun: Terminating job step 865636.0

I can manually resubmit continuation runs successfully, but that is can be cumbersome when conducting a long simulation. I am using CESM1.5b on Hera, but I also ran into a similar problem on Cheyenne using CESM 1.2.2.1 a while back. Being able to use the resubmit flag on Cheyenne wasn't a big deal at the time so I never really followed up on it. Regardless, this is likely something that I am doing incorrectly that Slurm doesn't like, but I am unsure what it could be . Does anyone have any suggestions?

Thank you for your help,
Chris
 
Back
Top