Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Questions on NTASKS, ROOTPE, and submission

xiangli

Xiang Li
Member
Hi all,

I have several questions on pelayout and task submission.

1) The default setting of NTASKS in the env_mach_pes.xml for a B1850 case is like this:

1706643672918.png

By running ./pelayout, I have:

1706644531890.png

It looks like that one node corresponds to 32 tasks. Can I or should I change this correspondence (1 node ~ 32 tasks), considering that each node corresponds to 92 cpus in my supercomputer?

Can I uniformly set all NTASKS to 32? In other words, how should I set these NTASKS so that the model will run more efficiently?

2) With respect to ROOTPE, why there is 4 and 2 nodes for OCN and ICE, respectively, while 0 nodes for others?

1706644234897.png

Can I change the "-4" and "-2" to "1" or "0"?

3) If I am going to submit the CESM job to another partition, should I add "#SBATCH -p" at the top of the case.submit file like this:

1706644386736.png

However, it did not work because the partition did not change. How can I change the partition correctly?

Thanks,
Xiang
 

jedwards

CSEG and Liaisons
Staff member
The relationship between nodes and tasks is defined in config_machines.xml and used by pelayout. If
pelayout is getting it wrong then you should review your machine definition.

The case.submit job is not submitted, case.submit prepares the case for submission and submits the .case.run or ,case.test (hidden)
scripts. If the header of these scripts is not correct it's because it is not defined correctly in config_batch.xml,
Did you read the cime porting guide?
 

xiangli

Xiang Li
Member
The relationship between nodes and tasks is defined in config_machines.xml and used by pelayout. If
pelayout is getting it wrong then you should review your machine definition.

The case.submit job is not submitted, case.submit prepares the case for submission and submits the .case.run or ,case.test (hidden)
scripts. If the header of these scripts is not correct it's because it is not defined correctly in config_batch.xml,
Did you read the cime porting guide?
Hi Jim,

I detailedly read this guide and modified my config_batch.xml, config_machines.xml, and config_compilers.xml.

I am currently testing my configuration by running ./scripts_regression_test.py. The output looks generally good. However, the test jobs submitted by this script did not start, even if they were given priority and there were adequate computing resources. Any possible reasons for that?

1707322389953.png

1707322480117.png

Looking forward to your suggestions.

Thanks,
Xiang
 

jedwards

CSEG and Liaisons
Staff member
This is a system issue, you may need to consult your system administrator. - but it looks as if you have set a runtime for those jobs of 90 days - that can't be right.
 

xiangli

Xiang Li
Member
This is a system issue, you may need to consult your system administrator. - but it looks as if you have set a runtime for those jobs of 90 days - that can't be right.
Hi Jim,

I have been actively communicating with our system administrator on this issue, but we could not figure it out yet.

I set the max wall time to 120 hours, but could not know why the runtime would be 90 days after submission.

Here is how I set my configuration. I would appreciate it if you could take a look at it and provide some hints on possible fault.

config_batch.xml:

1707323879670.png

config_machines.xml:

1707323945326.png

1707323968232.png

Thanks,
Xiang
 

jedwards

CSEG and Liaisons
Staff member
I think that the walltime_format field should just be 00:00:00
try playing with different values of the walltime, I'm pretty sure that's where the problem is.
You can create a single test like:
cd cime/scripts
./create_test SMS.f19_g17.X
then cd to the test directory and try different values of wallclock time with
./xmlchange JOB_WALLCLOCK_TIME=00:10:00 (for example)
 

xiangli

Xiang Li
Member
I think that the walltime_format field should just be 00:00:00
try playing with different values of the walltime, I'm pretty sure that's where the problem is.
You can create a single test like:
cd cime/scripts
./create_test SMS.f19_g17.X
then cd to the test directory and try different values of wallclock time with
./xmlchange JOB_WALLCLOCK_TIME=00:10:00 (for example)
Hi Jim,

This was what I did:

1707339457287.png

I used a much smaller walltimemax. And the JOB_WALLCLOCK_TIME in the env_workflow.xml was correspondingly changed.

1707339511373.png

However, the time limit of this test job was still 90 days. I tried 00:00:00 for walltime_format as well. That made no difference.

1707339595712.png

Actually, the time limit for a bash job was also 90 days. The bash job was created by running this:

1707339689649.png

Therefore, I think the 90-day time limit may have nothing to do with the CESM configuration.

My test job was still not able to run. Happy to hear your opinion!

Thanks,
Xiang
 

jedwards

CSEG and Liaisons
Staff member
In config_batch you need to add some submit_args, the following is for machine perlmutter, yours should be similar.

Code:
<submit_args>                                                                                                                 
      <arg flag="--time" name="$JOB_WALLCLOCK_TIME" />                                                                           
      <arg flag="-q" name="$JOB_QUEUE" />                                                                                         
      <arg flag="--account" name="$PROJECT" />                                                                                   
    </submit_args>
 

xiangli

Xiang Li
Member
In config_batch you need to add some submit_args, the following is for machine perlmutter, yours should be similar.

Code:
<submit_args>                                                                                                               
      <arg flag="--time" name="$JOB_WALLCLOCK_TIME" />                                                                         
      <arg flag="-q" name="$JOB_QUEUE" />                                                                                       
      <arg flag="--account" name="$PROJECT" />                                                                                 
    </submit_args>
Hi Jim,

It turned out that there would be an error when I add one or all of these submit_args. In contrast, the case could be successfully created, builded, and submitted if I did not add the submit_args.

1707342182620.png

Here are some most recent updates.

Without adding the submit_args, my config_batch.xml looks like this:

1707421769259.png

I did 3 kinds of tests.

1) ./create_test SMS.f19_g17.X

This test was successful. The case could be created, builded, submitted, and it ran for 2 mins.

Here is CaseStatus:

1707421940693.png

Here is TestStatus:

1707421999590.png

2) I also submitted several B1850 cases, but they could not start to run. Some of them were pending on resources, but I checked the resources were adequate. We plan to run B1850 cases for research.

1707422118651.png

1707422132071.png

3) ./scripts_regression_tests.py

There were some test runs submitted by this script, and some of them finished running, with some output:

1707422283864.png

However, some test runs are still pending. I'm not sure it would start to run. Resources should be adequate.

1707422356957.png


1707422369491.png

Looking forward to your comments and suggestions.

Thanks,
Xiang
 

jedwards

CSEG and Liaisons
Staff member
It looks like rather than correcting the error you introduced with the submit args you just abandoned that approach. Since you didn't provide an information other than the error I can't really be sure of what the problem was, but it looks to me like you ordered the batch_system section incorrectly. Submit args should immediatly precede the <directives> entry and follow any <batch_ fields provided.
 

xiangli

Xiang Li
Member
It looks like rather than correcting the error you introduced with the submit args you just abandoned that approach. Since you didn't provide an information other than the error I can't really be sure of what the problem was, but it looks to me like you ordered the batch_system section incorrectly. Submit args should immediatly precede the <directives> entry and follow any <batch_ fields provided.
Hi Jim,

I reordered the section like this:

1707427299606.png

But there was error with create case, as I mentioned previously:

1707427346115.png

I also tried several other order, which made no difference.

Any suggestions would be appreciated.

Thanks,
Xiang
 

jedwards

CSEG and Liaisons
Staff member
The error says that the order of fields in this file matters. submit_args should follow the batch_mail_type and be followed by the directives and queues.
 

xiangli

Xiang Li
Member
The error says that the order of fields in this file matters. submit_args should follow the batch_mail_type and be followed by the directives and queues.
Hi Jim,

Thanks! I adjusted the order and the ./create_test SMS.f19_g17.X test ran successfully! Here is my config_batch.xml:

1707506076803.png

However, the B1850 case was not still able to start to run. I am requesting 4 nodes with 8 tasks per node. Resources should be enough.

It should be noted that the TIME LIMIT was successfully changed!

1707506172521.png

1707506282590.png

The ./scripts_regression_tests.py test was not able to finish maybe because two test jobs were always pending:

1707506393321.png

There were only 2 FAIL in the output, and all others were ok.

1707506459521.png

1707506531284.png

Looking forward to your comments and suggestions.

Thanks,
Xiang
 

jedwards

CSEG and Liaisons
Staff member
The scripts regression tests prints the nature of the failure later in the output. In your B case you
have NTASKS_ATM=32 but ROOTPE_OCN=16, so they are overlapping and cannot progress.
Change ROOTPE_OCN=32, likewise change ROOTPE_ICE=16

I'm confused by having 92 cpus per node but only using 32 of them? In config_machines.xml this is set in variables MAX_TASKS_PER_NODE and MAX_MPITASKS_PER_NODE.
 

xiangli

Xiang Li
Member
The scripts regression tests prints the nature of the failure later in the output. In your B case you
have NTASKS_ATM=32 but ROOTPE_OCN=16, so they are overlapping and cannot progress.
Change ROOTPE_OCN=32, likewise change ROOTPE_ICE=16

I'm confused by having 92 cpus per node but only using 32 of them? In config_machines.xml this is set in variables MAX_TASKS_PER_NODE and MAX_MPITASKS_PER_NODE.
Hi Jim,

As you see, the partition hulab only has 5 nodes with 92 CPUs per node, but some of CPUs in each node may have been allocated.

Here is the default setting of NTASKS and ROOTPE, which request 6 nodes (but we only have 5 for our partition).

1707509484103.png

1707509507006.png

I tried to reduce ROOTPE to reduce the nodes. If I set the ROOTPE for OCN and ICE to half, I should also set the NTASKS for all components to half, right?

Yes, currently, MAX_TASKS_PER_NODE and MAX_MPITASKS_PER_NODE are set to 8.

Thanks,
Xiang
 

jedwards

CSEG and Liaisons
Staff member
We generally use systems with dedicated nodes, shared node systems introduce a huge complication and frankly I just don't have any experience using them.
 

xiangli

Xiang Li
Member
We generally use systems with dedicated nodes, shared node systems introduce a huge complication and frankly I just don't have any experience using them.
Hi Jim,

Our system administrator has installed all the prerequisites following this list:

1707940278974.png

But we got this error when testing:

xl468@dcc-hulab-01 /hpc/group/hulab/xl468/cesm2.1/my_cesm_sandbox/cime/scripts $ module load CESM/prereqs


OpenMPI/4.1.6


NetCDF/c-4.9.2


NetCDF/fortran-4.6.1


cmake/3.28.3


OpenBLAS 3.23


Subversion/1.14.3


CESM/prereqs





Loading CESM/prereqs


Loading requirement: OpenMPI/4.1.6 NetCDF/c-4.9.2 NetCDF-F/fortran-4.6.1 cmake/3.28.3 OpenBLAS/3.23 Subversion/1.14.3


xl468@dcc-hulab-01 /hpc/group/hulab/xl468/cesm2.1/my_cesm_sandbox/cime/scripts $ module list


Currently Loaded Modulefiles:


1) OpenMPI/4.1.6 2) NetCDF/c-4.9.2 3) NetCDF-F/fortran-4.6.1 4) cmake/3.28.3 5) OpenBLAS/3.23 6) Subversion/1.14.3 7) CESM/prereqs





Key:


auto-loaded


xl468@dcc-hulab-01 /hpc/group/hulab/xl468/cesm2.1/my_cesm_sandbox/cime/scripts $ ./create_test SMS.f19_g17.X


Testnames: ['SMS.f19_g17.X.duke_gnu']


No project info available


Creating test directory /hpc/group/hulab/xl468/cesm2.1/scratch/SMS.f19_g17.X.duke_gnu.20240214_145312_ldx4wk


RUNNING TESTS:


SMS.f19_g17.X.duke_gnu


Starting CREATE_NEWCASE for test SMS.f19_g17.X.duke_gnu with 1 procs


Finished CREATE_NEWCASE for test SMS.f19_g17.X.duke_gnu in 1.407000 seconds (PASS)


Starting XML for test SMS.f19_g17.X.duke_gnu with 1 procs


Finished XML for test SMS.f19_g17.X.duke_gnu in 0.313794 seconds (PASS)


Starting SETUP for test SMS.f19_g17.X.duke_gnu with 1 procs


Finished SETUP for test SMS.f19_g17.X.duke_gnu in 1.512333 seconds (PASS)


Starting SHAREDLIB_BUILD for test SMS.f19_g17.X.duke_gnu with 1 procs


Finished SHAREDLIB_BUILD for test SMS.f19_g17.X.duke_gnu in 4.181885 seconds (FAIL). [COMPLETED 1 of 1]


Case dir: /hpc/group/hulab/xl468/cesm2.1/scratch/SMS.f19_g17.X.duke_gnu.20240214_145312_ldx4wk


Errors were:


b'Building test for SMS in directory /hpc/group/hulab/xl468/cesm2.1/scratch/SMS.f19_g17.X.duke_gnu.20240214_145312_ldx4wk\nERROR: /hpc/group/hulab/xl468/cesm2.1/my_cesm_sandbox/cime/src/build_scripts/buildlib.gptl FAILED, cat /hpc/group/hulab/xl468/cesm2.1/scratch/SMS.f19_g17.X.duke_gnu.20240214_145312_ldx4wk/bld/gptl.bldlog.240214-145317'





Due to presence of batch system, create_test will exit before tests are complete.


To force create_test to wait for full completion, use --wait


At test-scheduler close, state is:


FAIL SMS.f19_g17.X.duke_gnu (phase SHAREDLIB_BUILD)


Case dir: /hpc/group/hulab/xl468/cesm2.1/scratch/SMS.f19_g17.X.duke_gnu.20240214_145312_ldx4wk


test-scheduler took 7.832472324371338 seconds


xl468@dcc-hulab-01 /hpc/group/hulab/xl468/cesm2.1/my_cesm_sandbox/cime/scripts $ cat /hpc/group/hulab/xl468/cesm2.1/scratch/SMS.f19_g17.X.duke_gnu.20240214_145312_ldx4wk/bld/gptl.bldlog.240214-145317


make -f /hpc/group/hulab/xl468/cesm2.1/my_cesm_sandbox/cime/src/share/timing/Makefile install -C /hpc/group/hulab/xl468/cesm2.1/scratch/SMS.f19_g17.X.duke_gnu.20240214_145312_ldx4wk/bld/gnu/openmpi/nodebug/nothreads/gptl MACFILE=/hpc/group/hulab/xl468/cesm2.1/scratch/SMS.f19_g17.X.duke_gnu.20240214_145312_ldx4wk/Macros.make MODEL=gptl GPTL_DIR=/hpc/group/hulab/xl468/cesm2.1/my_cesm_sandbox/cime/src/share/timing GPTL_LIBDIR=/hpc/group/hulab/xl468/cesm2.1/scratch/SMS.f19_g17.X.duke_gnu.20240214_145312_ldx4wk/bld/gnu/openmpi/nodebug/nothreads/gptl SHAREDPATH=/hpc/group/hulab/xl468/cesm2.1/scratch/SMS.f19_g17.X.duke_gnu.20240214_145312_ldx4wk/bld/gnu/openmpi/nodebug/nothreads


make: Entering directory '/hpc/group/hulab/xl468/cesm2.1/scratch/SMS.f19_g17.X.duke_gnu.20240214_145312_ldx4wk/bld/gnu/openmpi/nodebug/nothreads/gptl'


mpicc -c -I/hpc/group/hulab/xl468/cesm2.1/my_cesm_sandbox/cime/src/share/timing -std=gnu99 -O -DFORTRANUNDERSCORE -DNO_R16 -DCPRGNU -DHAVE_MPI /hpc/group/hulab/xl468/cesm2.1/my_cesm_sandbox/cime/src/share/timing/gptl.c


mpicc -c -I/hpc/group/hulab/xl468/cesm2.1/my_cesm_sandbox/cime/src/share/timing -std=gnu99 -O -DFORTRANUNDERSCORE -DNO_R16 -DCPRGNU -DHAVE_MPI /hpc/group/hulab/xl468/cesm2.1/my_cesm_sandbox/cime/src/share/timing/GPTLutil.c


mpicc -c -I/hpc/group/hulab/xl468/cesm2.1/my_cesm_sandbox/cime/src/share/timing -std=gnu99 -O -DFORTRANUNDERSCORE -DNO_R16 -DCPRGNU -DHAVE_MPI /hpc/group/hulab/xl468/cesm2.1/my_cesm_sandbox/cime/src/share/timing/GPTLget_memusage.c


mpicc -c -I/hpc/group/hulab/xl468/cesm2.1/my_cesm_sandbox/cime/src/share/timing -std=gnu99 -O -DFORTRANUNDERSCORE -DNO_R16 -DCPRGNU -DHAVE_MPI /hpc/group/hulab/xl468/cesm2.1/my_cesm_sandbox/cime/src/share/timing/GPTLprint_memusage.c


mpicc -c -I/hpc/group/hulab/xl468/cesm2.1/my_cesm_sandbox/cime/src/share/timing -std=gnu99 -O -DFORTRANUNDERSCORE -DNO_R16 -DCPRGNU -DHAVE_MPI /hpc/group/hulab/xl468/cesm2.1/my_cesm_sandbox/cime/src/share/timing/gptl_papi.c


mpicc -c -I/hpc/group/hulab/xl468/cesm2.1/my_cesm_sandbox/cime/src/share/timing -std=gnu99 -O -DFORTRANUNDERSCORE -DNO_R16 -DCPRGNU -DHAVE_MPI /hpc/group/hulab/xl468/cesm2.1/my_cesm_sandbox/cime/src/share/timing/f_wrappers.c


mpifort -c -I/hpc/group/hulab/xl468/cesm2.1/my_cesm_sandbox/cime/src/share/timing -fconvert=big-endian -ffree-line-length-none -ffixed-line-length-none -O -DFORTRANUNDERSCORE -DNO_R16 -DCPRGNU -DHAVE_MPI -ffree-form /hpc/group/hulab/xl468/cesm2.1/my_cesm_sandbox/cime/src/share/timing/perf_utils.F90


make: Leaving directory '/hpc/group/hulab/xl468/cesm2.1/scratch/SMS.f19_g17.X.duke_gnu.20240214_145312_ldx4wk/bld/gnu/openmpi/nodebug/nothreads/gptl'


/hpc/group/hulab/xl468/cesm2.1/my_cesm_sandbox/cime/src/share/timing/gptl.c: In function ‘GPTLpr_summary_file’:


/hpc/group/hulab/xl468/cesm2.1/my_cesm_sandbox/cime/src/share/timing/gptl.c:3090:8: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]


3090 | if (((int) comm) == 0)


| ^


/hpc/group/hulab/xl468/cesm2.1/my_cesm_sandbox/cime/src/share/timing/perf_utils.F90:282:18:





282 | call MPI_BCAST(vec,lsize,MPI_INTEGER,0,comm,ierr)


| 1


......


314 | call MPI_BCAST(vec,lsize,MPI_LOGICAL,0,comm,ierr)


| 2


Error: Type mismatch between actual argument at (1) and actual argument at (2) (INTEGER(4)/LOGICAL(4)).


make: *** [/hpc/group/hulab/xl468/cesm2.1/my_cesm_sandbox/cime/src/share/timing/Makefile:63: perf_utils.o] Error 1


ERROR: /hpc/group/hulab/xl468/cesm2.1/my_cesm_sandbox/cime/src/share/timing/gptl.c: In function GPTLpr_summary_file :


/hpc/group/hulab/xl468/cesm2.1/my_cesm_sandbox/cime/src/share/timing/gptl.c:3090:8: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]


3090 | if (((int) comm) == 0)


| ^


/hpc/group/hulab/xl468/cesm2.1/my_cesm_sandbox/cime/src/share/timing/perf_utils.F90:282:18:





282 | call MPI_BCAST(vec,lsize,MPI_INTEGER,0,comm,ierr)


| 1


......


314 | call MPI_BCAST(vec,lsize,MPI_LOGICAL,0,comm,ierr)


| 2


Error: Type mismatch between actual argument at (1) and actual argument at (2) (INTEGER(4)/LOGICAL(4)).


make: *** [/hpc/group/hulab/xl468/cesm2.1/my_cesm_sandbox/cime/src/share/timing/Makefile:63: perf_utils.o] Error 1xl468@dcc-hulab-01 /hpc/group/hulab/xl468/cesm2.1/my_cesm_sandbox/cime/scripts $

Here is our config_compilers.xml:

1707940523666.png

Any suggestions would be appreciated!

Thanks,
Xiang
 

jedwards

CSEG and Liaisons
Staff member
For recent gnu compiler versions you will need to add flags
-fallow-argument-mismatch -fallow-invalid-bozto the FCFLAGS
 
Top