Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

errors occur in case.submit

xgao304

Member
I am able to build the cesm case successfully. However, for the case.submit, I got the following error message:Submitting job script sbatch  .case.run --resubmitERROR: Command: 'sbatch  .case.run --resubmit' failed with error 'sbatch: error: Invalid numeric value "/bin/tcsh" for core_spec.' from dir '/net/fs05/d1/xgao/cesm/cases/test2000'Here is the basic info when I run preview_run:

CASE INFO:
>   nodes: 2
>   total tasks: 64
>   tasks per node: 32
>   thread count: 1
>
> BATCH INFO:
>   FOR JOB: case.run
>     ENV:
>       module command is /usr/bin/modulecmd python load pgi/17.3 netcdf/4 openmpi/1.10.7
>     SUBMIT CMD:
>       sbatch  .case.run --resubmit
>
>   FOR JOB: case.st_archive
>     ENV:
>       module command is /usr/bin/modulecmd python load pgi/17.3 netcdf/4 openmpi/1.10.7
>     SUBMIT CMD:
>       sbatch --dependency=afterok:0 case.st_archive --resubmit
>
> MPIRUN:
>   mpirun  -n 64 /net/fs05/d1/xgao/cesm/cases/test2000/bld/cesm.exe  >> cesm.log.$LID 2>&1
Here is my config_batch.xml file:

>  
>  
>     sbatch
>    
>       -S {{ shell }} 
>    
>    
>       queue
>    
>  
Any information about how to solve the problem is appreciated.
 

jedwards

CSEG and Liaisons
Staff member
If you need the directive statement it should be bash.   You may not need it at all.   
 

xgao304

Member
I set the model to run for 12 months, but it only ran for three months and I got the following error message:
> -----------------------
> 2018-07-15 20:53:41 MODEL EXECUTION BEGINS HERE
> run command is mpirun  -n 64 /net/fs05/d1/xgao/cesm/cases/test2000/bld/cesm.exe  >> cesm.log.$LID 2>&1
> slurmstepd: error: *** JOB 74369 ON c062 CANCELLED AT 2018-07-15T22:54:00 DUE TO TIME LIMIT ***
> -----------------------I am not sure if it is related to the setting of "queue" in my config_batch.xml as follows. The confusion is our the queue system does not need to set walltimemin, walltimemax, nodemin and nodemax at all. We usually just need to set run time as follwos:
#SBATCH --time=1-12:00:00       # format is DAYS-HOURS:MINUTES:SECONDSBut I know "queue" setting is required. In that case, how should I specify the "queue" line in config_batch.xml?
-----------------
>  
>     sbatch
>    
>       queue
>    
>   --------------
Thanks,
Xiang



 

xgao304

Member
I have added the args you suggested and the model is running. Not sure if it will solve the problem yet, but I do have some related questions:1. when I submit the job, I always get a companion "dependency" job related to st_archive which seems sitting in the slurm queue-----------------------submit_jobs case.run
Submit job case.run
Submitting job script sbatch -t 72:00:00 -p edr .case.run --resubmit
Submitted job id is 74619
Submit job case.st_archive
Submitting job script sbatch -t 0:20:00 -p edr  --dependency=afterok:74619 case.st_archive --resubmit
Submitted job id is 74620
Submitted job case.run with id 74619
Submitted job case.st_archive with id 74620------------And the queue system will look something like this:     74619       edr test2000       xgao  R 2018-07-16T15:59 0:43       2-23:59:17       2:64     c[092-093]     74620       edr test2000       xgao PD N/A              0:00       20:00            1:32     (Dependency)I know I can set "$DOUT_S = FALSE", But if I do want "archive", how to avoid this "dependency" job?2. for the config_batch.xml, I saw some machines set job_QUEUE with flag ="-p" in submit_args, while some machines set indirectives as "--partition=lr3". What are the difference between arg flag and directive? Thanks. 2.
 

jedwards

CSEG and Liaisons
Staff member
If you have DOUT_S=TRUE archiving is run as a seperate dependent job.  That's the way it works.   You can set DOUT_S=FALSE and run case.st_archive by hand if you really want to.Changes to submit_args can be made anytime while changes in directives call for rerunning case.setup and rebuilding before you can submit again.  submit_args are usually preferedbut some options such as the partition option you see there can only be set as directives.   This is machine dependent.
 
I meet some same type problems when i execute ./case.submit ,I get the error as follows:Finished creating component namelistsChecking that inputdata is available as part of case submissionLoading input file list: 'Buildconf/rtm.input_data_list'Loading input file list: 'Buildconf/cam.input_data_list'Loading input file list: 'Buildconf/cice.input_data_list'Loading input file list: 'Buildconf/pop.input_data_list'Loading input file list: 'Buildconf/clm.input_data_list'Loading input file list: 'Buildconf/cpl.input_data_list'Check case OKsubmit_jobs case.runSubmit job case.runSubmitting job script qsub  -v .case.run --resubmitERROR: Command: 'qsub  -v .case.run --resubmit' failed with error 'usage: qsub [-a date_time] [-A account_string] [-b secs]      [-c [ none | { enabled | periodic | shutdown |      depth= | dir= | interval=}... ]      [-C directive_prefix] [-d path] [-D path]      [-e path] [-h] [-I] [-j oe] [-k {oe}] [-l resource_list] [-m n|{abe}]      [-M user_list] [-N jobname] [-o path] [-p priority] [-P proxy_user] [-q queue]       [-r y|n] [-S path] [-t number_to_submit] [-T type]  [-u user_list] [-w] path      [-W additional_attributes] [-v variable_list] [-V ] [-x] [-X] [-z] [script]' from dir 

when i don't use pbs batch system submit my case. It run properly. Here is the batch system set:      qstat    qsub    qdel    -v    #PBS    --dependency=afterok:jobid    --dependency=afterany:jobid    :          batch       BUT i see the command :qsub -v ARGS_FOR_SCRIPTS=“--resubmit” .case.run  can  execute successfully on the Internet
I don't know where is the error existdo you have any advice?Any information about how to solve the problem is appreciated 
 
Top