Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

SGE Serial vs Parallel Job Queue Selection

douglowe

Douglas Lowe
New Member
We're trying to port CESM2 to our local HPC system, which is running SGE.

We have added an entry to the config_batch.xml file:
<batch_system type="sge" >
<batch_query args="-j">qstat</batch_query>
<batch_submit>qsub </batch_submit>
<batch_cancel>qdel</batch_cancel>
<batch_env>-v</batch_env>
<batch_directive>#$ </batch_directive>
<jobid_pattern>(\d+)</jobid_pattern>
<depend_string> -hold_jid jobid</depend_string>
<depend_separator> , </depend_separator>
<walltime_format>%H:%M:%S</walltime_format>
<batch_mail_flag>-M</batch_mail_flag>
<batch_mail_type_flag>-m</batch_mail_type_flag>
<batch_mail_type>, bea, b, e, a, n, bes</batch_mail_type>
<submit_args>
<arg flag="-q" name="$JOB_QUEUE"/>
<arg flag="-P" name="$PROJECT"/>
<arg flag="-l h_rt=" name="$JOB_WALLCLOCK_TIME"/>
</submit_args>
<directives>
<directive> -N {{ job_id }}</directive>
<directive> -V </directive>
<directive> -pe smp.pe {{ tasks_per_node }} </directive>
</directives>
</batch_system>

<batch_system MACH="csf3" type="sge">
<queues>
<queue walltimemax="01:00:00" nodemax="1" default="true">short</queue>
</queues>
</batch_system>

Unfortunately, this fails when running 'case.submit' with the following error:
ERROR: Command: 'qsub -1 short -] h_rt=0:20:00 -hold_jid 4512947 -v ARGS_FOR_SCRIPT='--resubmit' case.st_archive' failed with error
Unable to run job: Parallel job within smp.pe: number of slots must be at least 2.
Job has been rejected.

I think what we need to do is add a conditional to the '-pe smp.pe' directive, so that it is only used when the number of tasks is greater than 1 (as an aside, can someone tell me of a better task indicator to use than 'tasks_per_node', as this is going to fail us when we use more than one node?). But I can't work out how I could add such a conditional.

What can we do to fix this problem? And should this fix be in the config_batch.xml file, or another configuration file?
 

Yuan Sun

Yuan Sun
Member
Hi Jedwards, Douglas and I use the CESM 2.1.3. We have run './case.setup' and './case.build' successfully with config_machines.xml and config_compilers.xml edited already. But we failed to run './case.submit' to submit a job to the sge batch system.
 

MarkR_UoLeeds

Mark Richardson
New Member
Hello Yuan Sun, I have seen that you reached out to CEMAC. I can answer your question but have several meetings today. I will let you know as soon as I can.
Dr. Mark Richardson
Technical Head of CEMAC
 

MarkR_UoLeeds

Mark Richardson
New Member
Actually I have a text file of notes but it might be out-of-date but a good starting point. You have to train CIME to be "SGE aware". in
" (your system paths)/cesm_2.1.3/cime/config/cesm/machines/config_batch.xml "
you have to add 2 entries:

(1)
<batch_system type="sge" >
<batch_query args="-j">qstat</batch_query>
<batch_submit>qsub </batch_submit>
<batch_cancel>qdel</batch_cancel>
<batch_env>-v</batch_env>
<batch_directive>#$ </batch_directive>
<jobid_pattern>(\d+)</jobid_pattern>
<depend_string> -hold_jid jobid</depend_string>
<depend_separator> , </depend_separator>
<walltime_format>%H:%M:%S</walltime_format>
<batch_mail_flag>-M</batch_mail_flag>
<batch_mail_type_flag>-m</batch_mail_type_flag>
<batch_mail_type>, bea, b, e, a, n, bes</batch_mail_type>
<submit_args>
<arg flag="-q" name="$JOB_QUEUE"/>
<arg flag="-P" name="$PROJECT"/>
<arg flag="-l h_rt=" name="$JOB_WALLCLOCK_TIME"/>
<arg flag="-ar" name="$RESERVATION"/>
</submit_args>
<directives>
<directive> -N {{ job_id }}</directive>
<directive> -V </directive>
<directive> -l nodes={{ num_nodes }},ppn={{ tasks_per_node }},tpp={{ thread_count }}</directive>
</directives>
</batch_system>


and then for the specific machine (ours is arc4) replace XXXX and QQQQ with queue names.
(2)
<batch_system MACH="arc4" type="sge">
<submit_args>
<arg flag="-ar" name="$RESERVATION"/>
</submit_args>
<queues>
<queue walltimemax="48:00:00" nodemin="1" nodemax="149" default="true" >40core-192G.q</queue>
<queue walltimemax="48:00:00" nodemax="1">XXXX</queue>
<queue walltimemax="48:00:00" nodemax="4">QQQQ</queue>
</queues>
</batch_system>

there could be subtle differences depending on your sysadmin installation of SGE.
 

Yuan Sun

Yuan Sun
Member
Hi Mark,

1. when the config_batch.xml set with "<directive> -l nodes={{ num_nodes }},ppn={{ tasks_per_node }},tpp={{ thread_count }}</directive>"

ERROR: Command: 'qsub -l short -l h_rt=00:20:00 -v ARGS_FOR_SCRIPT='--resubmit' .case.run' failed with error 'Unable to run job: unknown resource "nodes"

2. when the config_batch.xml set with "<directive> -pe smp.pe {{ tasks_per_node }} </directive>"
ERROR: Command: 'qsub -1 short -] h_rt=0:20:00 -hold_jid 4512947 -v ARGS_FOR_SCRIPT='--resubmit' case.st_archive' failed with error
Unable to run job: Parallel job within smp.pe: number of slots must be at least 2.
Job has been rejected.


I guess UoM's HPC (csf3) sets parallel jobs differently, refer: https://ri.itservices.manchester.ac.uk/csf3/batch/parallel-jobs/
However, it is difficult for me to understand. Do you have any suggestions?

Thanks,
Yuan
 

jedwards

CSEG and Liaisons
Staff member
Are you checking the .case.run file for the directive output? I think that this "<directive> -pe smp.pe {{ tasks_per_node }} </directive>"
should be "<directive> -pe smp.pe {{ total_tasks }} </directive>"
 

Yuan Sun

Yuan Sun
Member
Hi Mike,

Still the same error. I decided to give up using config_batch.xml and set it as 'none'. Alternatively, we wrote a script to submit jobs.

>>touch myjobscript.sh
>>vim myjobscript.sh
>>chmod -x myjobscript.sh
>>qsub myjobscript.sh

It works. Anyway, thank you for offering suggestions.

Best,
Yuan
 

MarkR_UoLeeds

Mark Richardson
New Member
Hi Mike,

Still the same error. I decided to give up using config_batch.xml and set it as 'none'. Alternatively, we wrote a script to submit jobs.

>>touch myjobscript.sh
>>vim myjobscript.sh
>>chmod -x myjobscript.sh
>>qsub myjobscript.sh

It works. Anyway, thank you for offering suggestions.

Best,
Yuan
that is disappointing. but if it suits you then carry on. If you want to resume the config_batch.xml edits then let me know.
 

douglowe

Douglas Lowe
New Member
Hi jedwards & MarkR_UoLeeds - thank you for your comments & suggestions so far!

I think the problem we have is caused by our local SGE configuration, rather than it being a general SGE issue.

For serial jobs we have to not specify a parallel environment or queue, i.e. the '-pe' argument has to be passed to qsub.

For parallel jobs we have to specify a parallel environment, i.e. '-pe smp.pe [#cores]' for a small parallel task, or '-pe mpi-24-ib.pe [#cores]' for a multinode parallel task.

For other workflow engines we've used an if statement to deal with this, based on how many cores a particular task requires (see, for example, toil/src/toil/batchSystems/gridengine.py at 28758cf380d2f4f3935436e7522b3a624f3a0ec7 · douglowe/toil). Would it be possible to do similar for the CESM2 workflow system?
 
Top