Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

resubmit error

xliu

Jon
Member
Normally, what the reason cesm.exe shows MPI error and exit, after finished first run. StOP_N=30 RESUBMIT=2. Only works for first submit.
Thanks,
 

nusbaume

Jesse Nusbaumer
CSEG and Liaisons
Staff member
Hi Jon,

I am not sure I fully understand the question. Are you saying that even though you set RESUBMIT=2, the model doesn't actually re-submit and thus only runs once? If so, then my first guess is that the model isn't producing restart files correctly, which can occur if the model fails before it gets to the restart file-writing stage, or if the variables REST_N, REST_OPTION, or REST_DATE are set such that they are much greater than the equivalent STOP_XXX variables.

If possible, can you send along your env_run.xml and cesm.log.XXX files? That might help us better understand what is causing the issue (if what I described is indeed the problem).

Thanks, and have a great day!

Jesse
 

xliu

Jon
Member
Thanks Jesse. Yes, when set the RESUBMIT > 0, it only runs one round. It shows the results of first one to archive folder.

env_run.eml:
<!--"Run initialization type, valid values: startup,hybrid,branch (char) " -->
<entry id="RUN_TYPE" value="startup" />

<!--"Run start date (yyyy-mm-dd). Only used for startup or hybrid runs (char) " -->
<entry id="RUN_STARTDATE" value="0001-01-01" />

<!--"start time-of-day (integer) " -->
<entry id="START_TOD" value="0" />

<!--"Reference case for hybrid or branch runs (char*256) " -->
<entry id="RUN_REFCASE" value="case.std" />

<!--"Reference date for hybrid or branch runs (yyyy-mm-dd) (char*10) " -->
<entry id="RUN_REFDATE" value="0001-01-01" />

<!--"Reference time of day (seconds) for hybrid or branch runs (sssss) (char) " -->
<entry id="RUN_REFTOD" value="00000" />

<!--"allow same branch casename as reference casename, valid values: TRUE,FALSE (logical) " -->
<entry id="BRNCH_RETAIN_CASENAME" value="FALSE" />

<!--"flag for automatically prestaging the refcase restart dataset, valid values: TRUE,FALSE (logical) " -->
<entry id="GET_REFCASE" value="FALSE" />

<!-- ====================================== -->

<!--"sets the run length with STOP_N and STOP_DATE (must be nyear(s) for _GLC compsets for restarts to work properly), valid values: none,never,nsteps,nstep,nseconds,nsecond,nminutes,nminute,nhours,nhour,ndays,nday,nmonths,nmonth,nyears,nyear,date,ifdays0,end (char) " -->
<entry id="STOP_OPTION" value="nmonths" />

<!--"sets the run length with STOP_OPTION and STOP_DATE (integer) " -->
<entry id="STOP_N" value="3" />

<!--"date in yyyymmdd format, sets the run length with STOP_OPTION and STOP_N (integer) " -->
<entry id="STOP_DATE" value="-999" />

<!-- ====================================== -->

<!--"sets frequency of model restart writes (same options as STOP_OPTION) (must be nyear(s) for _GLC compsets) (char) " -->
<entry id="REST_OPTION" value="$STOP_OPTION" />

<!--"sets model restart writes with REST_OPTION and REST_DATE (char) " -->
<entry id="REST_N" value="$STOP_N" />

<!--"date in yyyymmdd format, sets model restart write date with REST_OPTION and REST_N (char) " -->
<entry id="REST_DATE" value="$STOP_DATE" />

<!--"A setting of TRUE implies a continuation run, valid values: TRUE,FALSE (logical) " -->
<entry id="CONTINUE_RUN" value="FALSE" />

<!--"If RESUBMIT is greater than 0, then case will automatically resubmit (integer) " -->
<entry id="RESUBMIT" value="2" />

cesm.log:
--------------------------------------------------------------------------
[[42351,1],7]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
Host: ip-10-70-132-13

Another transport will be used instead, although this may result in
lower performance.
NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
(seq_comm_setcomm) initialize ID ( 1 GLOBAL ) pelist = 0 62 1 ( npes = 63) ( nthreads = 1)
(seq_comm_setcomm) initialize ID ( 2 CPL ) pelist = 1 56 1 ( npes = 56) ( nthreads = 1)
(seq_comm_setcomm) initialize ID ( 17 ATM ) pelist = 0 55 1 ( npes = 56) ( nthreads = 1)
(seq_comm_joincomm) initialize ID ( 18 CPLATM ) join IDs = 2 17 ( npes = 57) ( nthreads = 1)
.
.
(seq_comm_setcomm) initialize ID ( 29 WAV ) pelist = 7 62 1 ( npes = 56) ( nthreads = 1)
(seq_comm_joincomm) initialize ID ( 30 CPLWAV ) join IDs = 2 29 ( npes = 62) ( nthreads = 1)
(seq_comm_jcommarr) initialize ID ( 9 ALLWAVID ) join multiple comp IDs ( npes = 56) ( nthreads = 1)
(seq_comm_joincomm) initialize ID ( 16 CPLALLWAVID ) join IDs = 2 9 ( npes = 62) ( nthreads = 1)
[ip-10-70-132-10:44533] *** An error occurred in MPI_Irecv
[ip-10-70-132-10:44533] *** reported by process [2775515137,0]
[ip-10-70-132-10:44533] *** on communicator MPI_COMM_WORLD
[ip-10-70-132-10:44533] *** MPI_ERR_COUNT: invalid count argument
[ip-10-70-132-10:44533] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-10-70-132-10:44533] *** and potentially your MPI job)
[ip-10-70-132-10:44527] 62 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics
[ip-10-70-132-10:44527] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
 

nusbaume

Jesse Nusbaumer
CSEG and Liaisons
Staff member
Hi Jon,

Thanks for the file info! It looks like your restart settings are correct, but the model is failing when re-submitted. Can you send the run logs of the model components (atm.log, lnd.log, etc.)? In particular if you can see if one of them has a specific error then that might point to what is actually failing. The same goes for any sort of run.<casename> file generated by the batch system.

Also, if you need to re-generate these files then you can simply set CONTINUE_RUN to TRUE in env_run.xml and then run case.submit again.

Thanks!

Jesse
 

xliu

Jon
Member
Hi Jon,

Thanks for the file info! It looks like your restart settings are correct, but the model is failing when re-submitted. Can you send the run logs of the model components (atm.log, lnd.log, etc.)? In particular if you can see if one of them has a specific error then that might point to what is actually failing. The same goes for any sort of run.<casename> file generated by the batch system.

Also, if you need to re-generate these files then you can simply set CONTINUE_RUN to TRUE in env_run.xml and then run case.submit again.

Thanks!

Jesse
There is not other log files. It run only like 1 second, then stopped.
I guess it's mpi-task setting issue, but curious why the first run works good. then show error.
thanks
 

nusbaume

Jesse Nusbaumer
CSEG and Liaisons
Staff member
Hi Jon,

The MPI settings are exactly the same between an initial run and its restart (assuming they weren't modified by the user). Given that this error seemed to occur before any non-coupler model components ran, and with the same configuration as a simulation that ran successfully, makes me believe that it might be a system error, as opposed to an error with CESM itself. What happens if you just set CONTINUE_RUN to TRUE and then re-submit manually?

Also, I am moving this thread to the "Infrastructure" forum, as it is watched by people with more expertise in MPI and coupling then I have, and who might be able to provide additional advice.

Thanks, and have a great day!

Jesse
 

jedwards

CSEG and Liaisons
Staff member
In order for the cesm resubmit feature to work you must be able to submit tasks from system compute nodes. The error from openmpi about
> unable to find any relevant network interfaces:
suggests that there is a problem with doing this. As an alternative try using the --resubmit-immediate flag to ./case.submit this will submit all of the
model runs at once and use the queueing system to hold them until the previous run has completed. You might also want to consult with your systems support staff about the nature of that error.
 

xliu

Jon
Member
Hi Jon,

The MPI settings are exactly the same between an initial run and its restart (assuming they weren't modified by the user). Given that this error seemed to occur before any non-coupler model components ran, and with the same configuration as a simulation that ran successfully, makes me believe that it might be a system error, as opposed to an error with CESM itself. What happens if you just set CONTINUE_RUN to TRUE and then re-submit manually?

Also, I am moving this thread to the "Infrastructure" forum, as it is watched by people with more expertise in MPI and coupling then I have, and who might be able to provide additional advice.

Thanks, and have a great day!

Jesse
Thanks Jesse, The CONTINUE_RUN change to TRUE automatically after first run. I have to xlmchange it to FALSE for restart, will show error if not.
 

xliu

Jon
Member
In order for the cesm resubmit feature to work you must be able to submit tasks from system compute nodes. The error from openmpi about
> unable to find any relevant network interfaces:
suggests that there is a problem with doing this. As an alternative try using the --resubmit-immediate flag to ./case.submit this will submit all of the
model runs at once and use the queueing system to hold them until the previous run has completed. You might also want to consult with your systems support staff about the nature of that error.
Thanks Jim. Hard to say what is the issue from, hardware?. Each time run, it stops at different steps: sometime after first run, sometime after a few runs.

...
st_archive.sh: short-term archiving completed successfully
RESUBMIT is now 1
-------------------------------------------------------------------------
CESM BUILDNML SCRIPT STARTING
- To prestage restarts, untar a restart.tar file into /home/ssm-1/My_Projects/ucar_CESM/CESM1_2_2_1/projects/test2/run
infile is /home/ssm-1/My_Projects/ucar_CESM/CESM1_2_2_1/projects/test2/Buildconf/cplconf/cesm_namelist
CESM BUILDNML SCRIPT HAS FINISHED SUCCESSFULLY
-------------------------------------------------------------------------
-------------------------------------------------------------------------
CESM PRESTAGE SCRIPT STARTING
- Case input data directory, DIN_LOC_ROOT, is /home/ssm-1/My_Projects/ucar_CESM/CESM1_2_2_1/projects/inputdata
- Checking the existence of input datasets in DIN_LOC_ROOT
CESM PRESTAGE SCRIPT HAS FINISHED SUCCESSFULLY
-------------------------------------------------------------------------
Thu May 6 20:39:43 UTC 2021 -- CSM EXECUTION BEGINS HERE
Thu May 6 20:39:43 UTC 2021 -- CSM EXECUTION HAS FINISHED
grep: cpl.log.210506-203918: No such file or directory
Model did not complete - see /home/ssm-1/My_Projects/ucar_CESM/CESM1_2_2_1/projects/test2/run/cesm.log.210506-203918
ccsm_postrun error: problem sourcing tempres..
 
Top