What version of the code are you using?
CESM 2.1.5
Have you made any changes to files in the source tree?
no
Describe every step you took leading up to the problem:
export CASE=run06_00_00
export CIME_MODEL=cesm
export SRCROOT=/home/djw/Downloads/NCAR/CESM_2.1.5
export CIMEROOT=/home/djw/Downloads/NCAR/CESM_2.1.5/cime
export CASEROOT=/dssgfs01/working/djw/NCAR/CASES
export BLDRUN_DIR=/dssgfs01/scratch/djw/NCAR/CASES
export SHORT_ARCH=/dssgfs01/scratch/djw/NCAR/SHORT_ARCH
export CESMDATA=/dssgfs01/working/djw/NCAR/DATA_IN
export CLIMDATA=/dssgfs01/working/djw/NCAR/CLIM_IN
cd $CASEROOT ; rm -rf $CASE ; cd $BLDRUN_DIR ; rm -rf $CASE ; cd $SHORT_ARCH ; rm -rf $CASE
cd $CIMEROOT/scripts
./create_newcase --case $CASEROOT/$CASE --res f09_g17 --compset B1850 | tee 01_newcase.out
mv 01_newcase.out $CASEROOT/$CASE
cd $CASEROOT/$CASE
./xmlquery RUN_TYPE,RUN_REFCASE,RUN_REFDATE,RUN_STARTDATE,RUN_REFDIR,STOP_OPTION,STOP_N,REST_OPTION,REST_N,CONTINUE_RUN,RESUBMIT,RESUBMIT_SETS_CONTINUE_RUN,JOB_WALLCLOCK_TIME,JOB_QUEUE,USER_REQUESTED_WALLTIME,USER_REQUESTED_QUEUE,DOUT_S,BATCH_COMMAND_FLAGS | tee 02_xmlquery.out
./case.setup | tee 03_setup.out
./preview_run | tee 04_preview.out
./case.build --skip-provenance-check | tee 05_build.out
./case.submit | tee 06_submit.out
If this is a port to a new machine: Please attach any files you added or changed for the machine port (e.g., config_compilers.xml, config_machines.xml, and config_batch.xml) and tell us the compiler version you are using on this machine.
Please attach any log files showing error messages or other useful information.
See attached xml files
Describe your problem or question:
Some of my runs with the CESM model have been very slow - wall clock time per day increased by a factor of 10. In trying to track down the problem I want to use the sbatch 'exclude' option to see if one or mode of the nodes have problems.
I have tried to do this in two ways. First by adding an 'exclude' directive to the config_batch.xml file (included below). Secondly by removing the directive and using the xmlchange command:
./xmlchange BATCH_COMMAND_FLAGS="--exclude=compute[001-022] "
before running case_setup.
Without either 'exclude' option the program compiles and submits correctly. However if either version of the 'exclude' command is used, the submission ends with the line:
ERROR: Command: 'sbatch --time 24:00:00 -p compute .case.run --resubmit' failed with error 'sbatch: error: Batch job submission failed: Invalid node name specified' from dir '/dssgfs01/working/djw/NCAR/CASES/run06_00_00'
As a check that I have the format of the command correct, I have tried running the main program with the enclose batch script. The script works and the job runs as normally but there is no transfer to the SHORT_ARCH directory at the end of the run and if I suppled my own archive script, the next submission would fail in the same way.
Unfortunately I cannot find (within the CIMEROOT, CASEROOT or BLDRUN_DIR directories) a copy of the batch file that CIME submitted. No job number was generated so I cannot get it from slurm. As a result there seems to be no way that I can find out what was wrong with the node name.
So my question is : have I written the exclude commands in ways that CIME cannot process corectly or should I be using another method?
Note that when submitting my own batch file I found that the node numbers had to include three digits. Is it possible that CIME changes the above exclude commands to "--exclude=compute[1-22]"?
Thanks for any help.
David.
CESM 2.1.5
Have you made any changes to files in the source tree?
no
Describe every step you took leading up to the problem:
export CASE=run06_00_00
export CIME_MODEL=cesm
export SRCROOT=/home/djw/Downloads/NCAR/CESM_2.1.5
export CIMEROOT=/home/djw/Downloads/NCAR/CESM_2.1.5/cime
export CASEROOT=/dssgfs01/working/djw/NCAR/CASES
export BLDRUN_DIR=/dssgfs01/scratch/djw/NCAR/CASES
export SHORT_ARCH=/dssgfs01/scratch/djw/NCAR/SHORT_ARCH
export CESMDATA=/dssgfs01/working/djw/NCAR/DATA_IN
export CLIMDATA=/dssgfs01/working/djw/NCAR/CLIM_IN
cd $CASEROOT ; rm -rf $CASE ; cd $BLDRUN_DIR ; rm -rf $CASE ; cd $SHORT_ARCH ; rm -rf $CASE
cd $CIMEROOT/scripts
./create_newcase --case $CASEROOT/$CASE --res f09_g17 --compset B1850 | tee 01_newcase.out
mv 01_newcase.out $CASEROOT/$CASE
cd $CASEROOT/$CASE
./xmlquery RUN_TYPE,RUN_REFCASE,RUN_REFDATE,RUN_STARTDATE,RUN_REFDIR,STOP_OPTION,STOP_N,REST_OPTION,REST_N,CONTINUE_RUN,RESUBMIT,RESUBMIT_SETS_CONTINUE_RUN,JOB_WALLCLOCK_TIME,JOB_QUEUE,USER_REQUESTED_WALLTIME,USER_REQUESTED_QUEUE,DOUT_S,BATCH_COMMAND_FLAGS | tee 02_xmlquery.out
./case.setup | tee 03_setup.out
./preview_run | tee 04_preview.out
./case.build --skip-provenance-check | tee 05_build.out
./case.submit | tee 06_submit.out
If this is a port to a new machine: Please attach any files you added or changed for the machine port (e.g., config_compilers.xml, config_machines.xml, and config_batch.xml) and tell us the compiler version you are using on this machine.
Please attach any log files showing error messages or other useful information.
See attached xml files
Describe your problem or question:
Some of my runs with the CESM model have been very slow - wall clock time per day increased by a factor of 10. In trying to track down the problem I want to use the sbatch 'exclude' option to see if one or mode of the nodes have problems.
I have tried to do this in two ways. First by adding an 'exclude' directive to the config_batch.xml file (included below). Secondly by removing the directive and using the xmlchange command:
./xmlchange BATCH_COMMAND_FLAGS="--exclude=compute[001-022] "
before running case_setup.
Without either 'exclude' option the program compiles and submits correctly. However if either version of the 'exclude' command is used, the submission ends with the line:
ERROR: Command: 'sbatch --time 24:00:00 -p compute .case.run --resubmit' failed with error 'sbatch: error: Batch job submission failed: Invalid node name specified' from dir '/dssgfs01/working/djw/NCAR/CASES/run06_00_00'
As a check that I have the format of the command correct, I have tried running the main program with the enclose batch script. The script works and the job runs as normally but there is no transfer to the SHORT_ARCH directory at the end of the run and if I suppled my own archive script, the next submission would fail in the same way.
Unfortunately I cannot find (within the CIMEROOT, CASEROOT or BLDRUN_DIR directories) a copy of the batch file that CIME submitted. No job number was generated so I cannot get it from slurm. As a result there seems to be no way that I can find out what was wrong with the node name.
So my question is : have I written the exclude commands in ways that CIME cannot process corectly or should I be using another method?
Note that when submitting my own batch file I found that the node numbers had to include three digits. Is it possible that CIME changes the above exclude commands to "--exclude=compute[1-22]"?
Thanks for any help.
David.