Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Adding an --exclude directive to slurm submissions

djw

David Webb
New Member
What version of the code are you using?
CESM 2.1.5


Have you made any changes to files in the source tree?
no


Describe every step you took leading up to the problem:
export CASE=run06_00_00
export CIME_MODEL=cesm
export SRCROOT=/home/djw/Downloads/NCAR/CESM_2.1.5
export CIMEROOT=/home/djw/Downloads/NCAR/CESM_2.1.5/cime
export CASEROOT=/dssgfs01/working/djw/NCAR/CASES
export BLDRUN_DIR=/dssgfs01/scratch/djw/NCAR/CASES
export SHORT_ARCH=/dssgfs01/scratch/djw/NCAR/SHORT_ARCH
export CESMDATA=/dssgfs01/working/djw/NCAR/DATA_IN
export CLIMDATA=/dssgfs01/working/djw/NCAR/CLIM_IN

cd $CASEROOT ; rm -rf $CASE ; cd $BLDRUN_DIR ; rm -rf $CASE ; cd $SHORT_ARCH ; rm -rf $CASE

cd $CIMEROOT/scripts
./create_newcase --case $CASEROOT/$CASE --res f09_g17 --compset B1850 | tee 01_newcase.out
mv 01_newcase.out $CASEROOT/$CASE

cd $CASEROOT/$CASE
./xmlquery RUN_TYPE,RUN_REFCASE,RUN_REFDATE,RUN_STARTDATE,RUN_REFDIR,STOP_OPTION,STOP_N,REST_OPTION,REST_N,CONTINUE_RUN,RESUBMIT,RESUBMIT_SETS_CONTINUE_RUN,JOB_WALLCLOCK_TIME,JOB_QUEUE,USER_REQUESTED_WALLTIME,USER_REQUESTED_QUEUE,DOUT_S,BATCH_COMMAND_FLAGS | tee 02_xmlquery.out

./case.setup | tee 03_setup.out
./preview_run | tee 04_preview.out
./case.build --skip-provenance-check | tee 05_build.out
./case.submit | tee 06_submit.out

If this is a port to a new machine: Please attach any files you added or changed for the machine port (e.g., config_compilers.xml, config_machines.xml, and config_batch.xml) and tell us the compiler version you are using on this machine.
Please attach any log files showing error messages or other useful information.


See attached xml files

Describe your problem or question:

Some of my runs with the CESM model have been very slow - wall clock time per day increased by a factor of 10. In trying to track down the problem I want to use the sbatch 'exclude' option to see if one or mode of the nodes have problems.

I have tried to do this in two ways. First by adding an 'exclude' directive to the config_batch.xml file (included below). Secondly by removing the directive and using the xmlchange command:
./xmlchange BATCH_COMMAND_FLAGS="--exclude=compute[001-022] "
before running case_setup.

Without either 'exclude' option the program compiles and submits correctly. However if either version of the 'exclude' command is used, the submission ends with the line:
ERROR: Command: 'sbatch --time 24:00:00 -p compute .case.run --resubmit' failed with error 'sbatch: error: Batch job submission failed: Invalid node name specified' from dir '/dssgfs01/working/djw/NCAR/CASES/run06_00_00'

As a check that I have the format of the command correct, I have tried running the main program with the enclose batch script. The script works and the job runs as normally but there is no transfer to the SHORT_ARCH directory at the end of the run and if I suppled my own archive script, the next submission would fail in the same way.

Unfortunately I cannot find (within the CIMEROOT, CASEROOT or BLDRUN_DIR directories) a copy of the batch file that CIME submitted. No job number was generated so I cannot get it from slurm. As a result there seems to be no way that I can find out what was wrong with the node name.

So my question is : have I written the exclude commands in ways that CIME cannot process corectly or should I be using another method?

Note that when submitting my own batch file I found that the node numbers had to include three digits. Is it possible that CIME changes the above exclude commands to "--exclude=compute[1-22]"?

Thanks for any help.

David.
 

Attachments

  • config_batch.xml.txt
    3.2 KB · Views: 1
  • config_compilers.xml.txt
    3 KB · Views: 0
  • config_machines.xml.txt
    5.6 KB · Views: 0
  • 06_submit.out.txt
    5.6 KB · Views: 1
  • 06_batch_run06.bat.txt
    602 bytes · Views: 1

jedwards

CSEG and Liaisons
Staff member
The batch file that cime submits is .case.run in your run directory. You can also
use the ./preview_run command to examine the submission arguments.
 

djw

David Webb
New Member
Thanks for getting back but there is no .case.run file in the run directory or the associated bld directory under $BLDRUN_DIR/$CASE/). There is one in the $CASEROOT/$CASE directory, with case.submit etc, but this is a python script that uses a function 'case.case_run', defined somewhere else in cime to submit the job. I've check my successful cases and find the same. [I think I found and tried to decode 'case.case_run' a few weeks ago but not being a python expert eventually gave up.}

The ./preview.run command output include the following for the main job:
SUBMIT CMD:
sbatch --time 24:00:00 -p compute .case.run --resubmit

MPIRUN (job=case.run):
mpiexec.hydra -np 450 -prepend-rank /dssgfs01/scratch/djw/NCAR/CASES/run06_00_00/bld/cesm.exe >> cesm.log.$LID 2>&1
These are similar to the commands I used in the successful batch job - except in this case sbatch is handled a python script which is presumably then run internally by sbatch to generate the full set of instructions.

I expect the python script generates a #BATCH line with the exclude option similar to that included in my own batch script. It is this line that probably causes the problem.

To see if I could get the output from the script, I tried running ./.case.run from the command line. It initially produced output similar to some of that generated by ./case.submit, but then hangs soon after printing "MODEL EXECUTION BEGINS HERE". I was worried it was trying to run the model in the login node so after a minute I killed it.

Maybe I'll just have to learn some more python. D.
 

djw

David Webb
New Member
I see my ideas in the first paragraph are a bit wrong. ./case.run is not submitting the batch job, neither is it a normal text file containing instructions to slurm. Slurm itself has to process the script to generate a useful set of commands.
 

jedwards

CSEG and Liaisons
Staff member
The file .case.run in the case directory contains the slurm code that you are looking for.
You can also try moving your --exclude code from the directives section to the arguments section of config_batch.xml
 

djw

David Webb
New Member
Thanks for your note - which started me looking at the files again and helped solve the problem.

For the information of anyone else who reaches this point I have uploaded a copy of file ".case.run" as file "06_case.run.txt". You will see that it is a python script, calling other python scripts from a cime library to do the heavy lifting - but it does start with a set of slurm "#SBATCH" lines.

My problem arises from the line:
#SBATCH --exclude=compute[01-22]
which should read
#SBATCH --exclude=compute[001-022]
Tracing back this is due to a similar error in file "config_batch.xml". A copy of which was uploaded in message #1. I should have spotted the error earlier.

Anyway many thanks for your help. D.
 

Attachments

  • 06.case.run.txt
    2.8 KB · Views: 0
Top