Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

how to resubmit the job automatically

I want to resubmit the job automatically because T42 CAM3 runs too slowly, If I want to runs 5 years but every sumit only runs 3 months then it stops. I follows the buildscript in the NCAR webpage but my job still stops at the end of 3 months run. My runscript is listed below. Could you help me point out what's wrong with my runscript? Thanks very much.



...................
## Create the namelist
cd $blddir || echo "cd $blddir failed" && exit 1
echo "Building the namelist."
$cfgdir/build-namelist -s
-case $case
-runtype initial
-o $rundir/namelist
-namelist "&camexp nelapse=$nelapse,mss_irt=0 /"
|| echo "build-namelist failed" && exit 1


## Run CAM
cd $rundir
touch output.txt
echo "Beg:: "`date` `perl -e 'print time();'` >> output.txt
poe time $blddir/cam < namelist >>& output.txt
set year = `grep restart file output.txt | tail -1 | sed s/^.*.r.// | sed s/-.*\$//`
#cat namelist | sed s/nsrest.*=.*0/nsrest = 1/ > temp.namelist
#mv temp.namelist namelist
##zt
#if ( $year < 200 ) then
if ( $year < 5 ) then
##zt
echo Year is ${year}, resubmitting.
llsubmit run-cam.csh
else
touch $case.done
# setenv CASE $case
# /fs/cgd/home0/jmccaa/scripts/mkclimo.csh
endif
echo "End:: "`date` `perl -e 'print time();'` >> output.txt
chmod +x $rundir/run-cam.csh
exit 0
~
 

pjr

Member
Why dont you post the last 50 or so lines of the output from this
script, rather than the script itself. Maybe it will tell us what went wrong.

Phil
 
Thanks for your reply. My original "run-cam.csh" file is listed below.
Hope to hear from you, thanks.



#! /usr/bin/csh -f

#-----------------------------------------------------------------------
## IBM
##------------
##
## This is an example script to build and run the default CAM configuration
## on an IBM SP. The default configuration is T42L26, Eulerian dynamics,
## CLM2 land model, and CSIM4 ice model.
##
## Setting LoadLeveler options for batch queue submission.
## @ class is the queue in which to run the job.
## To see a list of available queues, type "llclass" interactively.
## @ node is the number of nodes. @tasks_per_node should be set to 1 because
## of the hybrid OpenMP/MPI configuration of CAM. The number of nodes should
## be a power of 2, up to a max of 16 for T42.
## @ output and @error are the names of file written to the directory from
## which the script is submitted containing STDOUT and STDERR respectively
## @ job_type = parallel declares that multiple nodes will be used.
## @ network.MPI: Has to do with network connection between nodes. Best to leave alone.
## @ node_usage = not_shared acquires dedicated access to nodes for the job.
## @ queue tells load leveler to submit the job

#@ class = com_reg
#@ account_no = XXXXXX
#@ node = 2
#@ tasks_per_node = 1
#@ output = out.$(jobid)
#@ error = out.$(jobid)
#@ job_type = parallel
#@ network.MPI = csss,not_shared,us
#@ node_usage = not_shared
#@ wall_clock_limit = 21600
#@ queue

## POE Environment. Set these for interactive jobs. They're ignored by LoadLeveler
## MP_NODES is the number of nodes. The number chosen should be a power of 2, up to a max o
f 16 for T42.
setenv MP_NODES 2
setenv MP_TASKS_PER_NODE 1
setenv MP_EUILIB us
setenv MP_RMPOOL 1
# TH: bug fix suggested by Brian Eaton 1/24/03
unsetenv MP_PROCS

# must be set equal to (CPUs-per-node / tasks_per_node)
setenv OMP_NUM_THREADS 4

## suggestion from Jim Edwards to reintroduce XLSMPOPTS on 11/13/03
setenv XLSMPOPTS "stack=256000000"
setenv AIXTHREAD_SCOPE S
setenv MALLOCMULTIHEAP true
setenv OMP_DYNAMIC false
## Do our best to get sufficient stack memory
limit stacksize unlimited

## netCDF stuff
setenv INC_NETCDF /usr/local/include
setenv LIB_NETCDF /usr/local/lib64/r4i4

## ROOT OF CAM DISTRIBUTION - probably needs to be customized.
## Contains the source code for the CAM distribution.
## (the root directory contains the subdirectory "models")
#set camroot = /fs/cgd/data0/$LOGNAME/cam2_0
set camroot = /home/blackforest/taoz/cam3.0/cam1

## ROOT OF CAM DATA DISTRIBUTION - needs to be customized unless running at NCAR.
## Contains the initial and boundary data for the CAM distribution.
## (the root directory contains the subdirectories "atm" and "lnd")
#setenv CSMDATA /fs/cgd/csm/inputdata
setenv CSMDATA /home/blackforest/taoz/cam3.0/inputdata

## Set compile options
# $dycore is the dynamical core: sld, eul, or fv.
# $resolution: for sld or eul: 128x256, 64x128,32x64,or 8x16; for fv: 2x2.5, 4x5, or 10x15
# $usr_src specifies location of user-modified files: 'none', or valid directory.
# $caseid if for notes to myself - keep it short!!!
#set dycore = sld
#set dycore = fv
set dycore = eul
#set usr_src = /fs/cgd/data0/$LOGNAME/t31_icealbedos
#set usr_src = /fs/cgd/home0/$LOGNAME/empty
set usr_src = /home/blackforest/taoz/cam3.0/empty
#set caseid = ialb
set caseid = test1

##zt added
## Default namelist settings:
## $case is the case identifier for this run. It will be placed in the namelist.
## $nelapse is the number of timesteps to integrate, or number of days if negative.
if ( $dycore == 'fv' ) then
set resolution = 2x2.5
# set resolution = 1x1.25
# set resolution = 4x5
set nelapse = -490
# set nelapse = -578
# set nelapse = -245
else
set resolution = 64x128
# set resolution = 128x256
# set resolution = 48x96
##zt
# set nelapse = -490
##zt
set nelapse = -1095

endif
set nlev = 26
set case = ${dycore}${resolution}_$caseid
#
## Default namelist settings:
## $case is the case identifier for this run. It will be placed in the namelist.
## $runtype is the run type: initial, restart, or branch.
## $nelapse is the number of timesteps to integrate, or number of days if negative.
#set case = camrun
#set runtype = initial
#set nelapse = -1
#
##zt added

## $wrkdir is a working directory where the model will be built and run.
## $blddir is the directory where model will be compiled.
## $rundir is the directory where the model will be run.
## $cfgdir is the directory containing the CAM configuration scripts.
set wrkdir = /ptmp/$LOGNAME
set blddir = $wrkdir/$case/bld
set rundir = $wrkdir/$case
set cfgdir = $camroot/models/atm/cam/bld

## Ensure that run and build directories exist
mkdir -p $rundir || echo "cannot create $rundir" && exit 1
mkdir -p $blddir || echo "cannot create $blddir" && exit 1

## If an executable doesn't exist, build one.
if ( ! -x $blddir/cam ) then
cd $blddir || echo "cd $blddir failed" && exit 1

## Control case
$cfgdir/configure
-dyn $dycore
-res $resolution
-usr_src $usr_src
-nlev $nlev
|| echo "configure failed" && exit 1

# $cfgdir/configure || echo "configure failed" && exit 1
#
echo "building CAM in $blddir ..."
rm -f Depends
gmake -j4 >&! MAKE.out || echo "CAM build failed: see $blddir/MAKE.out" && exit 1
else
echo Found $blddir/cam - not building a new one.
endif

## Create the namelist
cd $blddir || echo "cd $blddir failed" && exit 1
echo "Building the namelist."
##zt
#$cfgdir/build-namelist -s -case $case -runtype $runtype -o $rundir/namelist
# -namelist "&camexp nelapse=$nelapse /" || echo "build-namelist failed" && exit 1
#
$cfgdir/build-namelist -s
-case $case
-runtype initial
-o $rundir/namelist
-namelist "&camexp nelapse=$nelapse,mss_irt=0 /"
|| echo "build-namelist failed" && exit 1

##zt

## Run CAM
#cd $rundir || echo "cd $rundir failed" && exit 1
#echo "running CAM in $rundir"
#poe $blddir/cam < namelist || echo "CAM run failed" && exit 1
#exit 0
#
## Run CAM
cd $rundir
touch output.txt
echo "Beg:: "`date` `perl -e 'print time();'` >> output.txt
poe time $blddir/cam < namelist >>& output.txt
set year = `grep restart file output.txt | tail -1 | sed s/^.*.r.// | sed s/-.*\$//`
#cat namelist | sed s/nsrest.*=.*0/nsrest = 1/ > temp.namelist
#mv temp.namelist namelist
##zt
#if ( $year < 200 ) then
if ( $year < 5 ) then
##zt
echo Year is ${year}, resubmitting.
llsubmit run-cam.csh
else
touch $case.done
# setenv CASE $case
# /fs/cgd/home0/jmccaa/scripts/mkclimo.csh
endif
echo "End:: "`date` `perl -e 'print time();'` >> output.txt
chmod +x $rundir/run-cam.csh
exit 0
 

pjr

Member
No!

I dont want to see the script. I want to see the output from the script.
This will be stored in the "log" file (probably called "out.xxxxxx" where xxxxx is some random sequence) that contains results from the run
after it is complete. It should tell us why the resubmit failed.

Phil
 
Thanks, the output is listed below.

..............
nstep, te 7909 3330742628.80873442 1.13602371851603201 -0.113399101904875966E-03 98456.7838
964076800
NSTEP = 7909 8.928216917231319E-05 7.364670377060318E-06 252.211 9.84568E+04 2.3940
36043617633E+01 0.88 0.26
nstep, te 7910 3330734994.95552540 0.970720070401827506 -0.968983384116682420E-04 98456.760
2150869934
NSTEP = 7910 8.928537203024047E-05 7.366188885522735E-06 252.211 9.84567E+04 2.3933
48839756835E+01 0.87 0.26
nstep, te 7911 3330720000.87083054 0.401526928544044481 -0.400808820938823419E-04 98456.698
3940480568
NSTEP = 7911 8.928861336679507E-05 7.366817985557253E-06 252.211 9.84567E+04 2.3931
62146424636E+01 0.87 0.26
nstep, te 7912 3330709770.46544361 -0.497100987235705061 0.496212069389963731E-04 98456.674
9507322238
NSTEP = 7912 8.929195606508984E-05 7.368006315395193E-06 252.211 9.84566E+04 2.3925
46291998017E+01 0.87 0.26
nstep, te 7913 3330694273.66299772 -0.534442988832791621 0.533487607022528417E-04 98456.617
6564534835
NSTEP = 7913 8.929524301614946E-05 7.369224362964630E-06 252.210 9.84566E+04 2.3920
80086778448E+01 0.87 0.25
nstep, te 7914 3330671024.25287676 -1.00088129202524811 0.999092571321882986E-04 98456.5708
260479441
time: 0551-010 The process was stopped abnormally. Try again.

Real 21105.65
User 72356.70
System 437.02
time: 0551-010 The process was stopped abnormally. Try again.

Real 21104.02
User 76253.06
System 434.07
 

jmccaa

New Member
It appears your job is timing out of the queue before it completes. In this case, control will never be passed back to your run script, so it won't ever have a chance to resubmit itself. So you need to submit with a smaller value of nelapse in the namelist. But your script wouldn't work anyway -- it has numerous syntax errors. For instance, this line
tz0616 said:
if ( $year < 5 ) then
is not going to work as written. I find the csh man page to be a good resource for this sort of stuff.

Jim
 
Thanks, jim
I only want to run several years (for example, 5yrs), not 200 yrs.

From your reply, I understand that the too large nelapse value results in
no chance to resubmit the job, but I can not understand why you say my
runscript wouldn't work anyway-it has numerous syntax errors, because I follow the example runscript in the NCAR webpage
http://www.ccsm.ucar.edu/models/atm-cam/sims/cam3.0/eul48x96_ialb/buildscript

Could you help me point out which sentence is wrong?
Thanks very much.
 

jmccaa

New Member
tz0616 said:

This is not a run script. It's a build script that creates a run script. Therefore it uses backslashes to preserve a variety of control characters. If you want an example run script, you should examine the one distributed with the model in the directory models/atm/cam/bld/run-ibm.csh. You're of course welcome to use the buildscript for ideas, but it is not something for which there is user support, so you may have to put some effort into understanding what it is doing. In particular, beware everything after the line
cat >! $rundir/run-cam.csh
 
exactly. The buildscript can not be run in the "blackforst". I submit the runscript based on the "run-ibm.csh" but upgrade it following the buildscript, because I want to learn how to resubmit it automatically.

In this case, I still have some questions about the resumbiting.

For resubmitting manually, First I set runtype "initial", after the job stops,
then I change the runtype "restart", then submit it again.

For resubmiting automatically, I am not sure how the system can restart from the stop point because I can not see the runtype is changed
from "initial" to "restart", please take a look at my original runscript posted above.

Thanks again for your valuable sugestions.
 
Top