Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

CPU count per node can not be satisfied when execute ./case.submit

Mikasa

sky
Member
After I execute the submit command ./case.submit for the B1850 case, the following error occurs:
Creating component namelists
Finished creating component namelists
Check case OK
submit_jobs case.run
Submit job case.run
Submitting job script sbatch -t 01:00:00 -p cpu .case.run --resubmit
ERROR: Command: 'sbatch -t 01:00:00 -p cpu .case.run --resubmit' failed with error 'sbatch: error: CPU count per node can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration is not available' from dir '/cases/case04'
The HPC has 40 cores per node and the batch system is SLURM, so I set
MAX_TASKS_PER_NODE=40

MAX_MPITASKS_PER_NODE=40
in the ~/.cime/config_machines.xml. I have tried to set the MAX_TASKS_PER_NODE and MAX_MPITASKS_PER_NODE to 32 or 20, but the error still appeared.

And the ./preview_run shows
CASE INFO:
nodes: 6
total tasks: 240
tasks per node: 40
thread count: 1

BATCH INFO:
FOR JOB: case.run
ENV:
Setting Environment OMP_STACKSIZE=256M
Setting Environment NETCDF_C_PATH=/lustre/opt/cascadelake/linux-centos7-cascadelake/intel-19.0.4/netcdf-c-4.7.3-45qqhnbk676cblst47yl3mltzbfgdu4j
Setting Environment NETCDF_FORTRAN_PATH=/lustre/opt/cascadelake/linux-centos7-cascadelake/intel-19.0.4/netcdf-fortran-4.5.2-qf7eyugntntno4oxsc4gmm52rhy7p3dt
Setting Environment HDF5_PATH=/lustre/opt/cascadelake/linux-centos7-cascadelake/intel-19.0.4/hdf5-1.10.6-rbc5csp26yx6ybkjgjkpqe3ni4lm3nro
Setting Environment OMP_NUM_THREADS=1

SUBMIT CMD:
sbatch -t 01:00:00 -p cpu .case.run --resubmit

MPIRUN (job=case.run):
srun -n 240 /lustre/home/acct-ioomj/ioomj-stu3/cesm/scratch/case04/bld/cesm.exe >> cesm.log.$LID 2>&1

FOR JOB: case.st_archive
ENV:
Setting Environment OMP_STACKSIZE=256M
Setting Environment NETCDF_C_PATH=/lustre/opt/cascadelake/linux-centos7-cascadelake/intel-19.0.4/netcdf-c-4.7.3-45qqhnbk676cblst47yl3mltzbfgdu4j
Setting Environment NETCDF_FORTRAN_PATH=/lustre/opt/cascadelake/linux-centos7-cascadelake/intel-19.0.4/netcdf-fortran-4.5.2-qf7eyugntntno4oxsc4gmm52rhy7p3dt
Setting Environment HDF5_PATH=/lustre/opt/cascadelake/linux-centos7-cascadelake/intel-19.0.4/hdf5-1.10.6-rbc5csp26yx6ybkjgjkpqe3ni4lm3nro
Setting Environment OMP_NUM_THREADS=1

SUBMIT CMD:
sbatch -t 0:20:00 -p cpu --dependency=afterok:0 case.st_archive --resubmit

I have checked the idle nodes of HPC is more than the requested 6 nodes.

So how can I solve this problem?

Thanks for your help and patience.
 

jedwards

CSEG and Liaisons
Staff member
I don't see any problem here, you may need to check with your system administrators. Are you specifying the correct queue?
 

Mikasa

sky
Member
I don't see any problem here, you may need to check with your system administrators. Are you specifying the correct queue?
Hello, thanks for your prompt reply! The following script runs successfully on the HPC which requires the same queue, node and total cores as the CESM case above.
#!/bin/bash

#SBATCH --job-name=hostname
#SBATCH --partition=cpu
#SBATCH -N 6
#SBATCH --ntasks-per-node=40
#SBATCH --output=%j.out
#SBATCH --error=%j.err

/bin/hostname
So the system administrator say the problem is located in your ./case_submit script.
I have checked my config_batch.xml and config_machines.xml with no problem found. And the specified queue is "cpu" which is right.
 

Mikasa

sky
Member
Check the #SBATCH lines in file .case.run in your case directory.
the #SBATCH lines in file .case.run are :
#!/usr/bin/env python
# Batch system directives
#SBATCH --job-name=run.case06
#SBATCH --nodes=6
#SBATCH --ntasks-per-node=40
#SBATCH --output=run.case06
#SBATCH --exclusive
#SBATCH --job-name=run.case06
#SBATCH --nodes=6
#SBATCH --ntasks-per-node=40
#SBATCH --output=run.case06
#SBATCH --exclusive
#SBATCH --mem=0
it is strange that all directives are repeated twice.
 

jedwards

CSEG and Liaisons
Staff member
It is strange and it indicates an error someplace. Check config_batch.xml, are you setting these twice?
You can try modifying by hand and removing the repeated lines and see if that solves the problem, before you find the source of the repetition.
 

Mikasa

sky
Member
It is strange and it indicates an error someplace. Check config_batch.xml, are you setting these twice?
You can try modifying by hand and removing the repeated lines and see if that solves the problem, before you find the source of the repetition.
I created a completely new case and after ./case.setup the repetition still occured. Here is my config_batch.xml:
<?xml version="1.0"?>
<config_batch version="2.0">
<batch_system MACH="pi" type="slurm">
<batch_query per_job_arg="-j">squeue</batch_query>
<batch_submit>sbatch</batch_submit>
<batch_cancel>scancel</batch_cancel>
<batch_directive>#SBATCH</batch_directive>
<jobid_pattern>(\d+)$</jobid_pattern>
<depend_string> --dependency=afterok:jobid</depend_string>
<depend_allow_string> --dependency=afterany:jobid</depend_allow_string>
<depend_separator>,</depend_separator>
<walltime_format>%H:%M:%S</walltime_format>
<batch_mail_flag>--mail-user</batch_mail_flag>
<batch_mail_type_flag>--mail-type</batch_mail_type_flag>
<batch_mail_type>none, all, begin, end, fail</batch_mail_type>
<submit_args>
<arg flag="-t" name="$JOB_WALLCLOCK_TIME"/>
<arg flag="-p" name="$JOB_QUEUE"/>
</submit_args>
<directives>
<directive> --job-name={{ job_id }}</directive>
<directive> --nodes={{ num_nodes }}</directive>
<directive> --ntasks-per-node={{ tasks_per_node }}</directive>
<directive> --output={{ job_id }} </directive>
<directive> --exclusive </directive>
<directive> --mem=0 </directive>
</directives>
<queues>
<queue walltimemax="24:00:00" default="true">cpu</queue>
</queues>
</batch_system>
</config_batch>
There are no repeated lines.
 

jedwards

CSEG and Liaisons
Staff member
You do not need to repeat the lines defined in the generic slurm section, only the lines that differ from the default.
 

Mikasa

sky
Member
You do not need to repeat the lines defined in the generic slurm section, only the lines that differ from the default.
Wonderful! I delete the repeat the lines defined in the generic slurm section and the submit success!
In December 2021, I just submit the case successfully with the same repeated #SBATCH lines as above. That is really confusing.
Any way, the problem has been solved. Thank you very much !
I have another two questions:
  1. I just copy the whole generic template for config_machines.xml and config_compilers to .cime/ directoty and then make some modification. But they don't seem to have any problem. Why?
  2. As the B1850 compset requires 6 nodes automatically, is there any way to manually specify more nodes (such as 30 nodes) to accelerate the calculation? Because I plan to run the CMIP6 related compset for 150 years and i want to save time.
I appreciate your answer very much !
 
Top