Main menu

Navigation

errors occur in case.submit

7 posts / 0 new
Last post
xgao304@...
errors occur in case.submit

I am able to build the cesm case successfully. However, for the case.submit, I got the following error message:

Submitting job script sbatch  .case.run --resubmit

ERROR: Command: 'sbatch  .case.run --resubmit' failed with error 'sbatch: error: Invalid numeric value "/bin/tcsh" for core_spec.' from dir '/net/fs05/d1/xgao/cesm/cases/test2000'

Here is the basic info when I run preview_run:


CASE INFO:
>   nodes: 2
>   total tasks: 64
>   tasks per node: 32
>   thread count: 1
>
> BATCH INFO:
>   FOR JOB: case.run
>     ENV:
>       module command is /usr/bin/modulecmd python load pgi/17.3 netcdf/4 openmpi/1.10.7
>     SUBMIT CMD:
>       sbatch  .case.run --resubmit
>
>   FOR JOB: case.st_archive
>     ENV:
>       module command is /usr/bin/modulecmd python load pgi/17.3 netcdf/4 openmpi/1.10.7
>     SUBMIT CMD:
>       sbatch --dependency=afterok:0 case.st_archive --resubmit
>
> MPIRUN:
>   mpirun  -n 64 /net/fs05/d1/xgao/cesm/cases/test2000/bld/cesm.exe  >> cesm.log.$LID 2>&1


Here is my config_batch.xml file:


>   <!-- svante is SLURM -->
>   <batch_system MACH="svante" type="slurm">
>     <batch_submit>sbatch</batch_submit>
>     <directives>
>       <directive default="/bin/tcsh" > -S {{ shell }}  </directive>
>     </directives>
>     <queues>
>       <queue walltimemin="0" walltimemax="24:00:00" nodemin="0" nodemax="312" default="true">queue</queue>
>     </queues>
>   </batch_system>


Any information about how to solve the problem is appreciated.

jedwards

If you need the directive statement it should be bash.   You may not need it at all.   

xgao304@...

I set the model to run for 12 months, but it only ran for three months and I got the following error message:


> -----------------------
> 2018-07-15 20:53:41 MODEL EXECUTION BEGINS HERE
> run command is mpirun  -n 64 /net/fs05/d1/xgao/cesm/cases/test2000/bld/cesm.exe  >> cesm.log.$LID 2>&1
> slurmstepd: error: *** JOB 74369 ON c062 CANCELLED AT 2018-07-15T22:54:00 DUE TO TIME LIMIT ***
> -----------------------

I am not sure if it is related to the setting of "queue" in my config_batch.xml as follows. The confusion is our the queue system does not need to set walltimemin, walltimemax, nodemin and nodemax at all. We usually just need to set run time as follwos:
#SBATCH --time=1-12:00:00       # format is DAYS-HOURS:MINUTES:SECONDS

But I know "queue" setting is required. In that case, how should I specify the "queue" line in config_batch.xml?

-----------------
>   <batch_system MACH="svante" type="slurm">
>     <batch_submit>sbatch</batch_submit>
>     <queues>
>       <queue walltimemin="0" walltimemax="24:00:00" nodemin="0" nodemax="312" default="true">queue</queue>
>     </queues>
>   </batch_system>

--------------

Thanks,

Xiang




chenyh1991@...

I meet some same type problems 

when i execute ./case.submit ,I get the error as follows:

Finished creating component namelists

Checking that inputdata is available as part of case submission

Loading input file list: 'Buildconf/rtm.input_data_list'

Loading input file list: 'Buildconf/cam.input_data_list'

Loading input file list: 'Buildconf/cice.input_data_list'

Loading input file list: 'Buildconf/pop.input_data_list'

Loading input file list: 'Buildconf/clm.input_data_list'

Loading input file list: 'Buildconf/cpl.input_data_list'

Check case OK

submit_jobs case.run

Submit job case.run

Submitting job script qsub  -v .case.run --resubmit

ERROR: Command: 'qsub  -v .case.run --resubmit' failed with error 'usage: qsub [-a date_time] [-A account_string] [-b secs]

      [-c [ none | { enabled | periodic | shutdown |

      depth=<int> | dir=<path> | interval=<minutes>}... ]

      [-C directive_prefix] [-d path] [-D path]

      [-e path] [-h] [-I] [-j oe] [-k {oe}] [-l resource_list] [-m n|{abe}]

      [-M user_list] [-N jobname] [-o path] [-p priority] [-P proxy_user] [-q queue] 

      [-r y|n] [-S path] [-t number_to_submit] [-T type]  [-u user_list] [-w] path

      [-W additional_attributes] [-v variable_list] [-V ] [-x] [-X] [-z] [script]' from dir 

when i don't use pbs batch system submit my case. It run properly. Here is the batch system set:

  <batch_system type="pbs" >

    <batch_query args="">qstat</batch_query>

    <batch_submit>qsub</batch_submit>

    <batch_cancel>qdel</batch_cancel>

    <batch_redirect>-v</batch_redirect>

    <batch_directive>#PBS</batch_directive>

    <depend_string> --dependency=afterok:jobid</depend_string>

    <depend_allow_string> --dependency=afterany:jobid</depend_allow_string>

    <depend_separator>:</depend_separator>

    <queues>

      <queue walltimemax="00:59:00" nodemin="1" nodemax="624" default="true">batch</queue>

    </queues>

 

  </batch_system>

BUT i see the command :qsub -v ARGS_FOR_SCRIPTS=“--resubmit” .case.run  can  execute successfully on the Internet


I don't know where is the error exist

do you have any advice?

Any information about how to solve the problem is appreciated

 

jedwards

I think you are missing this in config_batch.xml

<submit_args> <arg flag="--time" name="$JOB_WALLCLOCK_TIME"/> <arg flag="-p" name="$JOB_QUEUE"/>   </submit_args>
xgao304@...

I have added the args you suggested and the model is running. Not sure if it will solve the problem yet, but I do have some related questions:

1. when I submit the job, I always get a companion "dependency" job related to st_archive which seems sitting in the slurm queue

-----------------------

submit_jobs case.run
Submit job case.run
Submitting job script sbatch -t 72:00:00 -p edr .case.run --resubmit
Submitted job id is 74619
Submit job case.st_archive
Submitting job script sbatch -t 0:20:00 -p edr  --dependency=afterok:74619 case.st_archive --resubmit
Submitted job id is 74620
Submitted job case.run with id 74619
Submitted job case.st_archive with id 74620

------------

And the queue system will look something like this:

     74619       edr test2000       xgao  R 2018-07-16T15:59 0:43       2-23:59:17       2:64     c[092-093]

     74620       edr test2000       xgao PD N/A              0:00       20:00            1:32     (Dependency)

I know I can set "$DOUT_S = FALSE", But if I do want "archive", how to avoid this "dependency" job?

2. for the config_batch.xml, I saw some machines set job_QUEUE with flag ="-p" in submit_args, while some machines set in

directives as "<directive>--partition=lr3</directive>". What are the difference between arg flag and directive?

 

Thanks.

 

2.

jedwards

If you have DOUT_S=TRUE archiving is run as a seperate dependent job.  That's the way it works.   You can set DOUT_S=FALSE and run case.st_archive by hand if you really want to.

Changes to submit_args can be made anytime while changes in directives call for rerunning case.setup and rebuilding before you can submit again.  submit_args are usually prefered

but some options such as the partition option you see there can only be set as directives.   This is machine dependent.

Log in or register to post comments

Who's new

  • zweina@...
  • yuan.liang@...
  • lian.xue@...
  • 353482168@...
  • 76414461@...