What sequence of model run and st_archive jobs to expect?

gus@ldeo_columbia_edu · Feb 6, 2020

I am running a CESM-2 a case where I set:
RESUBMIT: 4
CONTINUE_RUN: TRUE

I ran ./case.submit, which produced this terminal output:

**********
Submitting job script qsub -q economy -l walltime=12:00:00 -A UCLB0023 -v ARGS_FOR_SCRIPT='--resubmit' .case.run
Submitted job id is 856447.chadmin1.ib0.cheyenne.ucar.edu
Submitting job script qsub -q economy -l walltime=0:20:00 -A UCLB0023 -W depend=afterok:856447.chadmin1.ib0.cheyenne.ucar.edu -v ARGS_FOR_SCRIPT='--resubmit' case.st_archive
Submitted job id is 856448.chadmin1.ib0.cheyenne.ucar.edu
Submitted job case.run with id 856447.chadmin1.ib0.cheyenne.ucar.edu
Submitted job case.st_archive with id 856448.chadmin1.ib0.cheyenne.ucar.edu
**********
I was puzzled that qsub passed the --resubmit flag to BOTH the main job (.case.run) and the short term archiving job (case.st_archive).

QUESTION: Should I expect the jobs to run in a fully sequential fashion, alternating the main model run, and the corresponding st_archive?
Or do the main model run job and the st_archive job resubmit themselves independent of each other?

In other words, I was hoping the sequence would be tight like this:
1) run year 1, holding st_archive of year 1 until run year 1 ends (qstat shows that this is the case)
2) run st_archive of year 1, at the end submit run year 2 (qstat doesn't tell if this will or not happen this way)
3) run year 2, holding st_archive of year 2 until run year 2 ends
4) run st_archive of year 2, at the end submit run year 3
and so on until RESUBMIT=0.
That would avoid overlapping the run and the archiving of different years.

However, the presence of the --resubmit flag on the qsub command line of both types of job
suggests that each one resubmits itself independent of the other one.
In this case, my impression is that there is a risk of st_archive or year 1 to be still running while the main run of year 2 starts
and begins producing history files in the RUNDIR, therefore with the risk that incomplete history files that still being written could be archived,
and a partial/small set of history files of year 2 being archived, and similar inconsistencies.

I tried to answer this question myself.
I read the Python scripts that actually implement .case.run and case.st_archive.
However, they are somewhat cryptic, and I didn't really come to any conclusion.

Thank you,
Gus Correa

fischer · Feb 6, 2020

Hi Gus,

The sequence works they way you hope. The next years run will not be submitted to the queue, until st_archive has
finished.

Chris

gus@ldeo_columbia_edu · Feb 6, 2020

Thank, you Chris!
That is good news!

Actually, before I read your answer, I was babysitting the jobs with qstat,
and the time sequence did match what you said.
I include some samples below, for reference.

Thank you,
Gus Correa

*************
Job 856447 is the model run year 1 is running (R, time 1). It finishes at 6:13 (E, time2)
Job 856448 is st_archive year 1. It switches from hold (H, time 1) to queued (Q, time2) at that same time,
then it runs alone (time 3).
Job 865383 is the model run year 2. It is queued (Q, time 4), and eventually runs (R, time 5)
Job 865384 is the st_archive year 2. It is put on hold (H, time 4), and continues on hold (H, time 5).

***
time 1
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
856447.chadmin1 gus economy wa6_ic1.00 63628 99 356 -- 12:00 R 06:12
856448.chadmin1 gus economy wa6_ic1.00 -- 1 1 -- 00:20 H --

***
time 2

Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
856447.chadmin1 gus economy wa6_ic1.00 63628 99 356 -- 12:00 E 06:13
856448.chadmin1 gus economy wa6_ic1.00 -- 1 1 -- 00:20 Q --

***
time 3

Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
856448.chadmin1 gus economy wa6_ic1.00 53390 1 1 -- 00:20 R 00:00

***
time 4
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
865383.chadmin1 gus economy wa6_ic1.00 -- 99 356 -- 12:00 Q --
865384.chadmin1 gus economy wa6_ic1.00 -- 1 1 -- 00:20 H --

***
time 5

Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
865383.chadmin1 gus economy wa6_ic1.00 58001 99 356 -- 12:00 R 00:00
865384.chadmin1 gus economy wa6_ic1.00 -- 1 1 -- 00:20 H --

What sequence of model run and st_archive jobs to expect?

gus@ldeo_columbia_edu

Member

fischer

CSEG and Liaisons

gus@ldeo_columbia_edu

Member