christopher_maloney@colorado_edu
New Member
Hi all,
A few days ago I posted this question on a different thread, but I realize now that this thread is more likely the correct place to present my problem.
I am running into a 'Resubmit' memory issue after porting CESM1.5b over to Hera which uses the Slurm scheduling system. The situation I have encountered occurs when I attempt to set the 'Resubmit' flag to any value greater than zero in my case's env_run.xml file. When I do that, the first iteration of the continuation job runs successfully, however, the next iteration immediately fails with the following error message:
slurmstepd: error: Detected 2 oom-kill event(s) in step 903765.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: h3c44: task 1: Out Of Memory
srun: Terminating job step 903765.0
slurmstepd: error: *** STEP 903765.0 ON h3c44 CANCELLED AT 2019-11-20T22:57:54 ***
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
cesm.exe 00000000023D939E Unknown Unknown Unknown
libpthread-2.17.s 00002B8D9A9305D0 Unknown Unknown Unknown
libmpi.so.12 00002B8D999BCC9E PMPIDI_CH3I_Progr Unknown Unknown
libmpi.so.12.0 00002B8D99B241E0 Unknown Unknown Unknown
libmpi.so.12.0 00002B8D999A4CD8 Unknown Unknown Unknown
libmpi.so.12.0 00002B8D9999625D Unknown Unknown Unknown
libmpi.so.12 00002B8D9999A23D MPI_Barrier Unknown Unknown
libmpifort.so.12. 00002B8D9953572C pmpi_barrier Unknown Unknown
cesm.exe 000000000042C3AF cesm_comp_mod_mp_ 1694 cesm_comp_mod.F90
cesm.exe 0000000000434C63 MAIN__ 62 cesm_driver.F90
cesm.exe 000000000041D2DE Unknown Unknown Unknown
libc-2.17.so 00002B8D9AE61495 __libc_start_main Unknown Unknown
cesm.exe 000000000041D1E9 Unknown Unknown Unknown
In response to this issue I have increased the memory requested for the short term archiving job which executes promptly after the conclusion of the main compute job. Doing this allows for the next iteration of the continuation run to successfully start. However, after about 1 minute, the job again fails with the same memory issue. I am not exactly sure when or how the resubmit process goes about completing its business, but I am wondering if this could be a dependency issue where the resubmit scripts are attempting to start the new iteration prior to the completion of the archiving job? Or, could the resubmit scripts be trying to execute the next iteration of the compute job on the single node where the archiving scripts are executed? I'd really appreciate any suggestions as this error is making running long term simulations rather difficult.
A few days ago I posted this question on a different thread, but I realize now that this thread is more likely the correct place to present my problem.
I am running into a 'Resubmit' memory issue after porting CESM1.5b over to Hera which uses the Slurm scheduling system. The situation I have encountered occurs when I attempt to set the 'Resubmit' flag to any value greater than zero in my case's env_run.xml file. When I do that, the first iteration of the continuation job runs successfully, however, the next iteration immediately fails with the following error message:
slurmstepd: error: Detected 2 oom-kill event(s) in step 903765.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: h3c44: task 1: Out Of Memory
srun: Terminating job step 903765.0
slurmstepd: error: *** STEP 903765.0 ON h3c44 CANCELLED AT 2019-11-20T22:57:54 ***
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
cesm.exe 00000000023D939E Unknown Unknown Unknown
libpthread-2.17.s 00002B8D9A9305D0 Unknown Unknown Unknown
libmpi.so.12 00002B8D999BCC9E PMPIDI_CH3I_Progr Unknown Unknown
libmpi.so.12.0 00002B8D99B241E0 Unknown Unknown Unknown
libmpi.so.12.0 00002B8D999A4CD8 Unknown Unknown Unknown
libmpi.so.12.0 00002B8D9999625D Unknown Unknown Unknown
libmpi.so.12 00002B8D9999A23D MPI_Barrier Unknown Unknown
libmpifort.so.12. 00002B8D9953572C pmpi_barrier Unknown Unknown
cesm.exe 000000000042C3AF cesm_comp_mod_mp_ 1694 cesm_comp_mod.F90
cesm.exe 0000000000434C63 MAIN__ 62 cesm_driver.F90
cesm.exe 000000000041D2DE Unknown Unknown Unknown
libc-2.17.so 00002B8D9AE61495 __libc_start_main Unknown Unknown
cesm.exe 000000000041D1E9 Unknown Unknown Unknown
In response to this issue I have increased the memory requested for the short term archiving job which executes promptly after the conclusion of the main compute job. Doing this allows for the next iteration of the continuation run to successfully start. However, after about 1 minute, the job again fails with the same memory issue. I am not exactly sure when or how the resubmit process goes about completing its business, but I am wondering if this could be a dependency issue where the resubmit scripts are attempting to start the new iteration prior to the completion of the archiving job? Or, could the resubmit scripts be trying to execute the next iteration of the compute job on the single node where the archiving scripts are executed? I'd really appreciate any suggestions as this error is making running long term simulations rather difficult.