Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Resubmit memory failure

Hi all,

A few days ago I posted this question on a different thread, but I realize now that this thread is more likely the correct place to present my problem.

I am running into a 'Resubmit' memory issue after porting CESM1.5b over to Hera which uses the Slurm scheduling system. The situation I have encountered occurs when I attempt to set the 'Resubmit' flag to any value greater than zero in my case's env_run.xml file. When I do that, the first iteration of the continuation job runs successfully, however, the next iteration immediately fails with the following error message:

slurmstepd: error: Detected 2 oom-kill event(s) in step 903765.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: h3c44: task 1: Out Of Memory
srun: Terminating job step 903765.0
slurmstepd: error: *** STEP 903765.0 ON h3c44 CANCELLED AT 2019-11-20T22:57:54 ***
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
cesm.exe 00000000023D939E Unknown Unknown Unknown
libpthread-2.17.s 00002B8D9A9305D0 Unknown Unknown Unknown
libmpi.so.12 00002B8D999BCC9E PMPIDI_CH3I_Progr Unknown Unknown
libmpi.so.12.0 00002B8D99B241E0 Unknown Unknown Unknown
libmpi.so.12.0 00002B8D999A4CD8 Unknown Unknown Unknown
libmpi.so.12.0 00002B8D9999625D Unknown Unknown Unknown
libmpi.so.12 00002B8D9999A23D MPI_Barrier Unknown Unknown
libmpifort.so.12. 00002B8D9953572C pmpi_barrier Unknown Unknown
cesm.exe 000000000042C3AF cesm_comp_mod_mp_ 1694 cesm_comp_mod.F90
cesm.exe 0000000000434C63 MAIN__ 62 cesm_driver.F90
cesm.exe 000000000041D2DE Unknown Unknown Unknown
libc-2.17.so 00002B8D9AE61495 __libc_start_main Unknown Unknown
cesm.exe 000000000041D1E9 Unknown Unknown Unknown

In response to this issue I have increased the memory requested for the short term archiving job which executes promptly after the conclusion of the main compute job. Doing this allows for the next iteration of the continuation run to successfully start. However, after about 1 minute, the job again fails with the same memory issue. I am not exactly sure when or how the resubmit process goes about completing its business, but I am wondering if this could be a dependency issue where the resubmit scripts are attempting to start the new iteration prior to the completion of the archiving job? Or, could the resubmit scripts be trying to execute the next iteration of the compute job on the single node where the archiving scripts are executed? I'd really appreciate any suggestions as this error is making running long term simulations rather difficult.
 
I'd like to provide an update on this issue as I have finally had the time to get around to porting the newer CESM2.1 over to Hera. The resubmit option is still failing for continuation runs with CESM2.1. If I set "DOUT_S" to FALSE, the model can successfully resubmit the case for another run, but if "DOUT_S" is TRUE, the resubmit process fails. This tells me that something is wrong with "st_archive". I've been poking around the code but it's unclear to me what to do. Do I need to change the python scripts related to short term archiving or do I need to modify something in the "config_xx.xml" scripts? Hera is using the Slurm workload manager.

Any assistance on this would be greatly appreciated.

Cheers,
Chris
 

jedwards

CSEG and Liaisons
Staff member
Hi,

I have not seen this problem with resubmit on other slurm systems - a workaround you may consider is to use the --resubmit-immediate option to
./case.submit this will submit all RESUBMIT number of jobs at once instead of trying to resubmit from the compute node which may be the problem here.

Another possible solution is to ssh back to a login node before resubmitting - look at the setup for stampede-skx in the config_batch.xml file for an example.
 
Hi,

I have not seen this problem with resubmit on other slurm systems - a workaround you may consider is to use the --resubmit-immediate option to
./case.submit this will submit all RESUBMIT number of jobs at once instead of trying to resubmit from the compute node which may be the problem here.

Another possible solution is to ssh back to a login node before resubmitting - look at the setup for stampede-skx in the config_batch.xml file for an example.

Hi Jim,

Sorry for the delayed response, for some reason I wasn't notified this time about a comment. Thank you for the suggestions, I was able to get a continuation run to successfully resubmit multiple times by using the "--resubmit-immediate" option. This will greatly help with simulation efficiency!

Cheers,
Chris
 
Top