christopher_maloney@colorado_edu
New Member
Hello everyone,
I am trying to conduct a multi-year simulation and I would like to use the "Resubmit" flag to let the model run continuously. However, whenever the first run finishes, the model immediately fails during the next iteration with a memory failure and the following message:
slurmstepd: error: Detected 1 oom-kill event(s) in step 865636.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: h1c38: task 1: Out Of Memory
srun: Terminating job step 865636.0
I can manually resubmit continuation runs successfully, but that is can be cumbersome when conducting a long simulation. I am using CESM1.5b on Hera, but I also ran into a similar problem on Cheyenne using CESM 1.2.2.1 a while back. Being able to use the resubmit flag on Cheyenne wasn't a big deal at the time so I never really followed up on it. Regardless, this is likely something that I am doing incorrectly that Slurm doesn't like, but I am unsure what it could be . Does anyone have any suggestions?
Thank you for your help,
Chris
I am trying to conduct a multi-year simulation and I would like to use the "Resubmit" flag to let the model run continuously. However, whenever the first run finishes, the model immediately fails during the next iteration with a memory failure and the following message:
slurmstepd: error: Detected 1 oom-kill event(s) in step 865636.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: h1c38: task 1: Out Of Memory
srun: Terminating job step 865636.0
I can manually resubmit continuation runs successfully, but that is can be cumbersome when conducting a long simulation. I am using CESM1.5b on Hera, but I also ran into a similar problem on Cheyenne using CESM 1.2.2.1 a while back. Being able to use the resubmit flag on Cheyenne wasn't a big deal at the time so I never really followed up on it. Regardless, this is likely something that I am doing incorrectly that Slurm doesn't like, but I am unsure what it could be . Does anyone have any suggestions?
Thank you for your help,
Chris