gus@ldeo_columbia_edu
Member
I am running a CESM-2 a case where I set:
RESUBMIT: 4
CONTINUE_RUN: TRUE
I ran ./case.submit, which produced this terminal output:
**********
Submitting job script qsub -q economy -l walltime=12:00:00 -A UCLB0023 -v ARGS_FOR_SCRIPT='--resubmit' .case.run
Submitted job id is 856447.chadmin1.ib0.cheyenne.ucar.edu
Submitting job script qsub -q economy -l walltime=0:20:00 -A UCLB0023 -W depend=afterok:856447.chadmin1.ib0.cheyenne.ucar.edu -v ARGS_FOR_SCRIPT='--resubmit' case.st_archive
Submitted job id is 856448.chadmin1.ib0.cheyenne.ucar.edu
Submitted job case.run with id 856447.chadmin1.ib0.cheyenne.ucar.edu
Submitted job case.st_archive with id 856448.chadmin1.ib0.cheyenne.ucar.edu
**********
I was puzzled that qsub passed the --resubmit flag to BOTH the main job (.case.run) and the short term archiving job (case.st_archive).
QUESTION: Should I expect the jobs to run in a fully sequential fashion, alternating the main model run, and the corresponding st_archive?
Or do the main model run job and the st_archive job resubmit themselves independent of each other?
In other words, I was hoping the sequence would be tight like this:
1) run year 1, holding st_archive of year 1 until run year 1 ends (qstat shows that this is the case)
2) run st_archive of year 1, at the end submit run year 2 (qstat doesn't tell if this will or not happen this way)
3) run year 2, holding st_archive of year 2 until run year 2 ends
4) run st_archive of year 2, at the end submit run year 3
and so on until RESUBMIT=0.
That would avoid overlapping the run and the archiving of different years.
However, the presence of the --resubmit flag on the qsub command line of both types of job
suggests that each one resubmits itself independent of the other one.
In this case, my impression is that there is a risk of st_archive or year 1 to be still running while the main run of year 2 starts
and begins producing history files in the RUNDIR, therefore with the risk that incomplete history files that still being written could be archived,
and a partial/small set of history files of year 2 being archived, and similar inconsistencies.
I tried to answer this question myself.
I read the Python scripts that actually implement .case.run and case.st_archive.
However, they are somewhat cryptic, and I didn't really come to any conclusion.
Thank you,
Gus Correa
RESUBMIT: 4
CONTINUE_RUN: TRUE
I ran ./case.submit, which produced this terminal output:
**********
Submitting job script qsub -q economy -l walltime=12:00:00 -A UCLB0023 -v ARGS_FOR_SCRIPT='--resubmit' .case.run
Submitted job id is 856447.chadmin1.ib0.cheyenne.ucar.edu
Submitting job script qsub -q economy -l walltime=0:20:00 -A UCLB0023 -W depend=afterok:856447.chadmin1.ib0.cheyenne.ucar.edu -v ARGS_FOR_SCRIPT='--resubmit' case.st_archive
Submitted job id is 856448.chadmin1.ib0.cheyenne.ucar.edu
Submitted job case.run with id 856447.chadmin1.ib0.cheyenne.ucar.edu
Submitted job case.st_archive with id 856448.chadmin1.ib0.cheyenne.ucar.edu
**********
I was puzzled that qsub passed the --resubmit flag to BOTH the main job (.case.run) and the short term archiving job (case.st_archive).
QUESTION: Should I expect the jobs to run in a fully sequential fashion, alternating the main model run, and the corresponding st_archive?
Or do the main model run job and the st_archive job resubmit themselves independent of each other?
In other words, I was hoping the sequence would be tight like this:
1) run year 1, holding st_archive of year 1 until run year 1 ends (qstat shows that this is the case)
2) run st_archive of year 1, at the end submit run year 2 (qstat doesn't tell if this will or not happen this way)
3) run year 2, holding st_archive of year 2 until run year 2 ends
4) run st_archive of year 2, at the end submit run year 3
and so on until RESUBMIT=0.
That would avoid overlapping the run and the archiving of different years.
However, the presence of the --resubmit flag on the qsub command line of both types of job
suggests that each one resubmits itself independent of the other one.
In this case, my impression is that there is a risk of st_archive or year 1 to be still running while the main run of year 2 starts
and begins producing history files in the RUNDIR, therefore with the risk that incomplete history files that still being written could be archived,
and a partial/small set of history files of year 2 being archived, and similar inconsistencies.
I tried to answer this question myself.
I read the Python scripts that actually implement .case.run and case.st_archive.
However, they are somewhat cryptic, and I didn't really come to any conclusion.
Thank you,
Gus Correa