Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

CESM2 porting to Niagara on SciNet (Part of Digital Research Alliance of Canada), batch resubmission must occur from login node (config_batch.xml)

nstant

Noah Stanton
New Member
To Whom this may concern,

I am porting CESM2 to the Niagara supercomputer on SciNet. I have managed to complete builds and submit single jobs (successfully), but the resubmission tool is not working.

I receive a "Jobs can only be submitted from the login node" SBATCH error (using slurm scheduler).

I have tried modifying the batch_submit in config_batch.xml as follows:

<batch_submit>ssh -t nia-login01 "cd $PROJECT/cesm2_1_3_OUT/$CASE; sbatch"</batch_submit>

This fails as the quotes really need to go around the whole submission argument, ie...

ssh nia-login01 "cd $PROJECT/cesm2_1_3_OUT/$CASE; sbatch --time 12:00:00 --mail-user nstant@my.yorku.ca --mail-type all .case.run --resubmit"

But what I currently have it, it does this:

ssh nia-login01 "cd $PROJECT/cesm2_1_3_OUT/$CASE; sbatch" --time 12:00:00 --mail-user nstant@my.yorku.ca --mail-type all .case.run --resubmit

Note the bolded ".

Is there another workaround to modify either the batch_submit command or change batch_config.xml in some other way to allow for resubmission of jobs to occur on the login node (with computation of the job on the compute nodes)? I am currently working with sysadmin to find a workaround as well.

Thank you in advance!

Sincerely,

Noah Stanton
 

fischer

CSEG and Liaisons
Staff member
Hi Noah,

Here's an example of what we had to do for stampede.

<batch_system MACH="stampede2-skx" type="slurm" >
<batch_submit>ssh stampede2.tacc.utexas.edu cd $CASEROOT ; sbatch</batch_submit>
<submit_args>
<arg flag="--time" name="$JOB_WALLCLOCK_TIME"/>
<arg flag="-p" name="$JOB_QUEUE"/>
<arg flag="--account" name="$PROJECT"/>
</submit_args>
<queues>
<queue walltimemax="48:00:00" nodemin="1" nodemax="256" default="true">skx-normal</queue>
<queue walltimemax="02:00:00" nodemin="1" nodemax="4" >skx-dev</queue>
</queues>
</batch_system>


Chris
 

nstant

Noah Stanton
New Member
Hello Chris,

Thank you for the quick response. Ok I tried something similar, but not the exact syntax so I will try to emulate that example.

We (the digital alliance team and I) are trying another workaround in the meantime. I will report back on what works and doesn't work.

Noah S
 

nstant

Noah Stanton
New Member
Hello Chris,

I modified config_batch.xml to have:

<batch_submit>ssh niagara.computecanada.ca cd $CASEROOT ; sbatch</batch_submit>

This then runs, but I believe it screws up at the resubmission (but at least I can see a record of this in the log files, before I could not).

So log file says this:

ERROR: RUN FAIL: Command 'mpirun -np 240 /project/n/ntandon/nstant/cesm2_1_3_OUT/BWma.f19_g17.test24/bld/cesm.exe >> cesm.log.$LID 2>&1 ' failed
See log file for details: /scratch/n/ntandon/nstant/cesm2_1_3_RUN/BWma.f19_g17.test24/run/cesm.log.9703935.230627-165841

Then I check the log file and these are the 2 (or 3) errors I get (note any ## represent digits):

[nia####.scinet.local:######] pml_ucx.c:208 Error: Failed to create UCP worker

[nia####.scinet.local:######] Error: coll_hcoll_module.c:311 - mca_coll_hcoll_comm_query() Hcol library init failed

And this was at the top of the log:

WARNING: No preset parameters were found for the device that Open MPI
detected:

Local host: nia1951
Device name: mlx5_0
Device vendor ID: 0x02c9
Device vendor part ID: 4123

Default device parameters will be used, which may result in lower
performance. You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.

NOTE: You can turn off this warning by setting the MCA parameter
btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

Local host: nia1823
Local device: mlx5_0

So it appears this has to do with OpenMPI, maybe I am not defining the libraries correctly. I might try compiling with intelmpi instead and see if that helps.
Any suggestions? Thank you again for all of the help. I suspect it may be something with the LD_LIBRARY_PATH.

Noah S
 

nstant

Noah Stanton
New Member
Hi Noah,

Here's an example of what we had to do for stampede.

<batch_system MACH="stampede2-skx" type="slurm" >
<batch_submit>ssh stampede2.tacc.utexas.edu cd $CASEROOT ; sbatch</batch_submit>
<submit_args>
<arg flag="--time" name="$JOB_WALLCLOCK_TIME"/>
<arg flag="-p" name="$JOB_QUEUE"/>
<arg flag="--account" name="$PROJECT"/>
</submit_args>
<queues>
<queue walltimemax="48:00:00" nodemin="1" nodemax="256" default="true">skx-normal</queue>
<queue walltimemax="02:00:00" nodemin="1" nodemax="4" >skx-dev</queue>
</queues>
</batch_system>


Chris
Hey Chris,

We are running into this issue still actually, ignore the reply above this one.
ERROR: Command: 'ssh niagara.computecanada.ca cd $CASEROOT ; sbatch --time 03:00:00 .case.run --resubmit' failed with error 'SBATCH ERROR:
Job submission must be done from a login node
SBATCH: 1 error was found.

So I believe that this is because the command should actually be sent as:

ssh niagara.computecanada.ca "cd $CASEROOT ; sbatch --time 03:00:00 .case.run --resubmit"

not

ssh niagara.computecanada.ca cd $CASEROOT ; sbatch --time 03:00:00 .case.run --resubmit

It easy to get the quotes around "cd $CASROOT ; sbatch", but obvious this is incomplete. Is there a way to get the quote to occur after --resubmit?

This is called during case.submit and any arg flag I add always ends up before the --resubmit.

Thanks for you help.

Noah S
 

nstant

Noah Stanton
New Member
I just want to update this thread as I have been able to successfully port the model, with the help of digital alliance and Dr. Kushner's group from the University of Toronto.

To avoid having to ssh into the login node instead the best option on niagara is to leave <batch_submit>sbatch</batch_submit> as is and just add the flag "--resubmit-immediate" to ./case.submit. This avoids any need to use login nodes, but obviously puts many jobs in the queue.

Once I have run the ensemble tests and clean up the config files I will post the niagara specific files in this thread.

Thank you for your help Chris!

Noah S
 
Top