Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Case resubmit failure

Hello everyone,

I am trying to conduct a multi-year simulation and I would like to use the "Resubmit" flag to let the model run continuously. However, whenever the first run finishes, the model immediately fails during the next iteration with a memory failure and the following message:

slurmstepd: error: Detected 1 oom-kill event(s) in step 865636.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: h1c38: task 1: Out Of Memory
srun: Terminating job step 865636.0

I can manually resubmit continuation runs successfully, but that is can be cumbersome when conducting a long simulation. I am using CESM1.5b on Hera, but I also ran into a similar problem on Cheyenne using CESM 1.2.2.1 a while back. Being able to use the resubmit flag on Cheyenne wasn't a big deal at the time so I never really followed up on it. Regardless, this is likely something that I am doing incorrectly that Slurm doesn't like, but I am unsure what it could be . Does anyone have any suggestions?

Thank you for your help,
Chris
 
Top