Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

bsub script for resubmission?

whannah

Member
I'm working with a the SAM cloud resolving model by Marat Khairoutdinov on Yellowstone and I'd like my run script to get the exit code from the current job and decide whether to resubmit or not. Here's what I have so far:#!/bin/tcsh
#
# LSF batch script to run an MPI application
#
#BSUB -P P35081334
#BSUB -W 02:00 # wall-clock time (hrs:mins)
#BSUB -n 16 # number of tasks in job
#BSUB -R "span[ptile=16]" # run 16 MPI tasks per node
#BSUB -J BUBBLE_500_64x64_08km_1.0k_3g # job name
#BSUB -o BUBBLE_500_64x64_08km_1.0k_3g.out.%J # output file name in which %J is replaced by the job ID
#BSUB -e BUBBLE_500_64x64_08km_1.0k_3g.err.%J # error file name in which %J is replaced by the job ID
#BSUB -q regular # queueset case = BUBBLE
set subcase = advsel
set jobfile = $case/resub.$subcase
set prmfile = $case/prm.$subcase
set prmloc = $case/prmsetenv LID "`date +%y%m%d-%H%M%S`"#--------------------------------------------------------------
#run the executable
#--------------------------------------------------------------
mpirun.lsf ./SAM_ADV_MPDATA_RAD_CAM_MICRO_SAM1MOM_64x64_B08km >&! sam.log.$LID
echo
echo sam.log.$LID
echo
#--------------------------------------------------------------
# Resubmit the job if not finished
#--------------------------------------------------------------
set exitstatus = $?
echo SAM stopped with exit status $exitstatusif [ $exitstatus -eq 0 ]
then

echo It appears the previous run ended properly and job not yet finished.
echo Resubmitting $jobfile
cat $prmfile | sed s/nrestart.*=.*0/nrestart = 1/ > temp.namelist
mv temp.namelist $prmfile
cp $prmfile $prmloc
bsub < sam_run
fi
#--------------------------------------------------------------
#--------------------------------------------------------------The variable $exitcode does not have the right value in the test runs that I've done so far. $exitcode has the value 0 when I know that the model indeed exited with exit code 9.So my question is, is there a different syntax for obtaining the exit code rather than $?, which I think was meant for a different system. I don't know wher eot look any of this up for the LSF. Thanks,Walter
 

jedwards

CSEG and Liaisons
Staff member
Rather than looking for a return code we look in the model log to deterrmine if it has successfully completed. 
Code:
cd $RUNDIR
set CESMLogFile = `ls -1t cesm.log* | head -1` 
if ($CESMLogFile == "") then
  echo "Model did not complete - no cesm.log file present - exiting"
  exit -1
endif
set CPLLogFile = `echo $CESMLogFile | sed -e 's/cesm/cpl/'`
if ($CPLLogFile == "") then
  echo "Model did not complete - no cpl.log file corresponding to most recent CESM log ($RUNDIR/$CESMLogFile)"
  exit -1
endif
grep 'SUCCESSFUL TERMINATION' $CPLLogFile  || echo "Model did not complete - see $RUNDIR/$CESMLogFile" && echo "run FAILED $sdate" >>& $CASEROOT/CaseStatus && exit -1

echo "run SUCCESSFUL $sdate" >>& $CASEROOT/CaseStatus
 

jedwards

CSEG and Liaisons
Staff member
Rather than looking for a return code we look in the model log to deterrmine if it has successfully completed. 
Code:
cd $RUNDIR
set CESMLogFile = `ls -1t cesm.log* | head -1` 
if ($CESMLogFile == "") then
  echo "Model did not complete - no cesm.log file present - exiting"
  exit -1
endif
set CPLLogFile = `echo $CESMLogFile | sed -e 's/cesm/cpl/'`
if ($CPLLogFile == "") then
  echo "Model did not complete - no cpl.log file corresponding to most recent CESM log ($RUNDIR/$CESMLogFile)"
  exit -1
endif
grep 'SUCCESSFUL TERMINATION' $CPLLogFile  || echo "Model did not complete - see $RUNDIR/$CESMLogFile" && echo "run FAILED $sdate" >>& $CASEROOT/CaseStatus && exit -1

echo "run SUCCESSFUL $sdate" >>& $CASEROOT/CaseStatus
 

santos

Member
I agree with Jim that checking the log may be a more certain way of doing this.If you would rather check the return code, put "set exitstatus=$?" immediately after the mpirun.lsf command. "$?" is a special variable that always holds the return code from the very last command. The way that your script is written now, it will use the return code from the preceding "echo" command, which will pretty much always be 0.
 

santos

Member
I agree with Jim that checking the log may be a more certain way of doing this.If you would rather check the return code, put "set exitstatus=$?" immediately after the mpirun.lsf command. "$?" is a special variable that always holds the return code from the very last command. The way that your script is written now, it will use the return code from the preceding "echo" command, which will pretty much always be 0.
 
Top