How to add a batch scheduler (SGE)

MarkR_UoLeeds · Jun 15, 2021

Referring to CESM 2.1.3 and 2.2.
Our HPC system at University of Leeds uses SGE (son of grid engine) but as this is now a little used scheduler there is no entry for it in the config_batch.xml.
I am adding an entry for my self (in ~/.cime file) so that I can test out what happens when there is a batch system rather than using "none" and forcing run-unsupported to create a case.
I tried to fake it with saying PBS wa avail but of course the clever system tells me no there is not PBS and stops.
I am looking atthe other batch entries and trying to figure out what I need for SGE (and unfortunately I am not an expert SGE user either).
Can anyone give me advice on what are the most important settings (and maybe a minimum) needed to to create job scripts automatically?

jedwards · Jun 15, 2021

If you want to open a PR to cime with at least your config_machines.xml and config_batch.xml entries I can review it and perhaps make more informed comments. It looks like SGE is very close to PBS and this should be straightforward. It looks like the batch_directive symbol is '#$' and query submit and cancel commands are the same as PBS.

MarkR_UoLeeds · Jun 15, 2021

Hello Jedwards,
I think you mean I ought to create a branch (some where, escomp github or esmci/cime?) and then point you to that branch?
Sorry if that seems fundamental but so far I have only been following the getting started notes so did the my_cesm_sandbox thing.
Yes some SGE CLI commands overlap with PBS (qstat, qdel, qsub) but the directives have a different form (as you surmised #$ sentinels).
I can post a "desired" job script here and the "sge" definition. But it seems you prefer to inspect through github.
The local hpc specific config_batch.xml and config_machines.xml are in my ~/.cime and so outside version control at the moment.
This is all that we have to go on for SGE: Batch jobs — ARC Documentation however i have more advanced job scripts from previous non-cesm work.
I was able to create a case with batch scheduler set to "none" but then i tried adding an SGE entry but then the newcase process fails as somewhere down the line it rejects sge as an option.

Thanks for any guidance.

jedwards · Jun 15, 2021

You can create a git fork of esmci/cime by selecting the fork button in the top right of the github page. Then you would create a branch in your fork:
cd cesm/cime
git remote add myfork path/to/myfork
git checkout maint-5.6 (for cesm2.1.x) or master (for cesm2.2.x)
git checkout -b myforkname
[ merge the files in .cime into cime/config/cesm/machines ]
git commit
git push myfork myforkname

then open a PR to esmci/cime
this gives a nice interface to compare and discuss your changes.

MarkR_UoLeeds · Jun 16, 2021

Thanks, I have done those steps and issued PR #4001 . I hope it is not too far from a solution.

MarkR_UoLeeds · Jun 30, 2021

Okay so now I able to build execs but the SGE work needs more guidance. e.g. I rarely used job dependency with SGE so I am unfamiliar with its use. I have these errors from preview_runs

$ ./preview_run
CASE INFO:
nodes: 6
total tasks: 240
tasks per node: 40
thread count: 1

BATCH INFO:
ERROR: depend_string is missing jobid for prerequisite jobs

and this is what is in the config_batch.xml

config_batch.xml: <depend_string> -hold_jid {{ wc_job_list }} </depend_string>
config_batch.xml: <depend_separator> , </depend_separator>

I am not sure how to form "wc_job_list". In SGE jargon "wc_" means wildcard. so I presume a comma separated list of strings. Also expecting a problem about the job_ok and job_nok concept for allowing subsequent job to start. Sorry.

MarkR_UoLeeds · Jul 7, 2021

I made some progress but I need more information about how to put all the SGE options in a reference XML place so tha the tools that process them can formulate a correct job script.
I tried the scripts_regression_tests.py but do not know how to interpret (or even find) the output. I notice several fails whizz pass in the console but I am not sure where to look for the logged output.

Typically a user will set the optionsins job submission file using hte sentinel #$ and that seems to work for some of the options.
Also our system allows people to use "any available" cores with a setting like this: #$ -l pe ib 40, for example. Other people want dedicated nodes (in particular when using OpenMP and would select the number of nodes: #$ -l nodes=2 I have managed that with config_batch.xml #$ - l nodes= {{ num_nodes }}.
However, I do not know how to allow users the freedom of those two options. Perhaps i restrict CESM2 to run on only dedicated nodes, but queue time might be too long.

Back to my original question: how can I interpret scripts_regression_tests.py output and is there a specific test for the batch XML setup?

jedwards · Jul 7, 2021

Experience shows that usually sharing a node when running cesm is a bad idea, there are exceptions, such as mpi-serial or single task runs.
To test that you are getting the correct settings in a script you can create a test like:
cd cime/scripts
./create_test SMS.f19_g16.X --no-run
cd testdir
./preview_run # this shows the mpiexec and submit commands
#(examine the file .case.test) this shows the sentinel settings.

# Refine the settings in config_batch.xml
./case.setup --reset
#repeat the above sequence

if you are getting a lot of errors from scripts_regression_tests.py save the console output to a file using redirect or tee

MarkR_UoLeeds · Jul 7, 2021

Well the header seems okay. I will change back to nodes when finished testing due to excessive queue time - the system is a free for all with no special debug/dev queue

#!/usr/bin/env python
#$ -N test.SMS.f19_g16.X.arc4_intel.20210707_162141_ao2xhk
#$ -V
#$ -pe ib 40
#$ -l node_type=40core-192G

what I am missing is the resubmission info. As SGE is an ageing bac scheduler i cannot find anyone skilled enough to advise locally. preview_run gives
SUBMIT CMD:
qsub -q 40core-192G.q -l h_rt=24:00:00 -v ARGS_FOR_SCRIPT='' .case.test

I do not know what ARGS_FOR_SCRIPT does and if it is correct to be empty string.
Also h_rt has set itself to the queue max and not a short 20min job. I would rather not have to xmlchange JOB_WALLCLOCK_TIME every case although I see i can do that and I could change MPI size too.

thanks for the help so far.

jedwards · Jul 7, 2021

Typically you want the default wallclock to be the maximum for the queue because that's how production runs are done.
If you use ./create_test you can change the wallclock time for the entire suite of tests with the --wallclock option.

What do you mean by resubmission info? Can you run the qsub command from the compute nodes? If so then it should work fine.

You can tune the number of mpitasks in file config_pes.xml for the case you are running on your machine. Without knowing your system the best we can do is guess. The PFS test is designed for performance tuning a compset, you might try it.

How to add a batch scheduler (SGE)

MarkR_UoLeeds

Mark Richardson

New Member

jedwards

CSEG and Liaisons

MarkR_UoLeeds

Mark Richardson

New Member

jedwards

CSEG and Liaisons

MarkR_UoLeeds

Mark Richardson

New Member

MarkR_UoLeeds

Mark Richardson

New Member

MarkR_UoLeeds

Mark Richardson

New Member

jedwards

CSEG and Liaisons

MarkR_UoLeeds

Mark Richardson

New Member

jedwards

CSEG and Liaisons