How to setup the multiple instances (=ensemble)

Young-chan · Nov 16, 2021

Now, the CESM package was installed/built well in our computing system.
The CASE was successfully created with the "F2000climo" compset and it was well set up and built by the "case.setup" and "case.build".
In this case, I tried to run the multi instances by modifying the "env_mach_pes" as follows.
I set up this file to make ten instances of CAM (10 CAM ensembles).
This case was well built, but the "case.submit" process was failed. These related errors are described below.

Now, I don't know what I have to do to solve this problem.

Does anybody know to set up the way to make the multiple instances of CAM with a single another component?

Thank you for your help in advance.

Youngchan

env_mach_pes.xml
---------------------------------------------------------------------------------------------------------------
<?xml version="1.0"?>
<file id="env_mach_pes.xml" version="2.0">
<comment>none</comment>
<group id="mach_pes_last">
<entry id="COST_PES" value="96">
<type>integer</type>
<desc>pes or cores used relative to MAX_MPITASKS_PER_NODE for accounting (0 means TOTALPES is valid)</desc>
</entry>
<entry id="TOTALPES" value="90">
<type>integer</type>
<desc>total number of physical cores used (setup automatically - DO NOT EDIT)</desc>
</entry>
<entry id="MAX_TASKS_PER_NODE" value="24">
<type>integer</type>
<desc>maximum number of tasks/ threads allowed per node </desc>
</entry>
<entry id="MAX_MPITASKS_PER_NODE" value="24">
<type>integer</type>
<desc>pes or cores per node for mpitasks </desc>
</entry>
<entry id="COSTPES_PER_NODE" value="$MAX_MPITASKS_PER_NODE">
<type>integer</type>
<desc>pes or cores per node for accounting purposes </desc>
</entry>
</group>
<group id="mach_pes">
<entry id="ALLOCATE_SPARE_NODES" value="FALSE">
<type>logical</type>
<valid_values>TRUE,FALSE</valid_values>
<desc>Allocate some spare nodes to handle node failures. The system will pick a reasonable number</desc>
</entry>
<entry id="FORCE_SPARE_NODES" value="-999">
<type>integer</type>
<desc>Force this exact number of spare nodes to be allocated</desc>
</entry>
<entry id="NTASKS">
<type>integer</type>
<values>
<value compclass="ATM">90</value>
<value compclass="CPL">90</value>
<value compclass="OCN">90</value>
<value compclass="WAV">90</value>
<value compclass="GLC">90</value>
<value compclass="ICE">90</value>
<value compclass="ROF">90</value>
<value compclass="LND">90</value>
<value compclass="ESP">1</value>
</values>
<desc>number of tasks for each component</desc>
</entry>
<entry id="NTASKS_PER_INST">
<type>integer</type>
<values>
<value compclass="ATM">9</value>
<value compclass="OCN">90</value>
<value compclass="WAV">90</value>
<value compclass="GLC">90</value>
<value compclass="ICE">90</value>
<value compclass="ROF">90</value>
<value compclass="LND">90</value>
<value compclass="ESP">1</value>
</values>
<desc>Number of tasks per instance for each component. DO NOT EDIT: Set automatically by case.setup based on NTASKS, NINST and MULTI_DRIVER</desc>
</entry>
<entry id="ROOTPE">
<type>integer</type>
<values>
<value compclass="ATM">0</value>
<value compclass="CPL">0</value>
<value compclass="OCN">0</value>
<value compclass="WAV">0</value>
<value compclass="GLC">0</value>
<value compclass="ICE">0</value>
<value compclass="ROF">0</value>
<value compclass="LND">0</value>
<value compclass="ESP">0</value>
</values>
<desc>ROOTPE (mpi task in MPI_COMM_WORLD) for each component</desc>
</entry>
<entry id="MULTI_DRIVER" value="FALSE">
<type>logical</type>
<valid_values>TRUE,FALSE</valid_values>
<desc>MULTI_DRIVER mode provides a separate driver/coupler component for each
ensemble member. All components must have an equal number of members. If
MULTI_DRIVER mode is False prognostic components must have the same number
of members but data or stub components may also have 1 member. </desc>
</entry>
<entry id="NINST">
<type>integer</type>
<values>
<value compclass="ATM">10</value>
<value compclass="OCN">1</value>
<value compclass="WAV">1</value>
<value compclass="GLC">1</value>
<value compclass="ICE">1</value>
<value compclass="ROF">1</value>
<value compclass="LND">1</value>
<value compclass="ESP">1</value>
</values>
<desc>Number of instances for each component. If MULTI_DRIVER is True
the NINST_MAX value will be used.
</desc>
</entry>
<entry id="PSTRID">
<type>integer</type>
<values>
<value compclass="ATM">1</value>
<value compclass="CPL">1</value>
<value compclass="OCN">1</value>
<value compclass="WAV">1</value>
<value compclass="GLC">1</value>
<value compclass="ICE">1</value>
<value compclass="ROF">1</value>
<value compclass="LND">1</value>
<value compclass="ESP">1</value>
</values>
<desc>The mpi global processors stride associated with the mpi tasks for the a component</desc>
</entry>
</group>
</file>

Error message from cesm.log.1846.elsa00.211117-013027
---------------------------------------------------------------------------------------------------------------

ERROR: (seq_mct_drv) ERROR: lnd_prognostic but num_inst_lnd not num_inst_max
ERROR: (seq_mct_drv) ERROR: lnd_prognostic but num_inst_lnd not num_inst_max
ERROR: (seq_mct_drv) ERROR: lnd_prognostic but num_inst_lnd not num_inst_max

:
:
[cli_13]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 13
Image PC Routine Line Source
cesm.exe 00000000029A14B6 Unknown Unknown Unknown
cesm.exe 000000000264C04E shr_abort_mod_mp_ 114 shr_abort_mod.F90
cesm.exe 0000000000425C2F cime_comp_mod_mp_ 1652 cime_comp_mod.F90
cesm.exe 0000000000430719 MAIN__ 114 cime_driver.F90
cesm.exe 0000000000413C1E Unknown Unknown Unknown
libc-2.17.so 00002AEFB961A445 __libc_start_main Unknown Unknown
cesm.exe 0000000000413B29 Unknown Unknown Unknown
[cli_15]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 15
[cli_14]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 14
[cli_16]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 16
[cli_23]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 23
Image PC Routine Line Source
cesm.exe 00000000029A14B6 Unknown Unknown Unknown
cesm.exe 000000000264C04E shr_abort_mod_mp_ 114 shr_abort_mod.F90
cesm.exe 0000000000425C2F cime_comp_mod_mp_ 1652 cime_comp_mod.F90
cesm.exe 0000000000430719 MAIN__ 114 cime_driver.F90
cesm.exe 0000000000413C1E Unknown Unknown Unknown
libc-2.17.so 00002ABAE46D5445 __libc_start_main Unknown Unknown
cesm.exe 0000000000413B29 Unknown Unknown Unknown
[cli_0]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 0

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 34639 RUNNING AT elsa14
= EXIT CODE: 233
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:0@elsa17] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:909): assert (!closed) failed
[proxy:0:0@elsa17] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:0@elsa17] main (pm/pmiserv/pmip.c:206): demux engine error waiting for event
[proxy:0:1@elsa16] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:909): assert (!closed) failed
[proxy:0:1@elsa16] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:1@elsa16] main (pm/pmiserv/pmip.c:206): demux engine error waiting for event
[proxy:0:2@elsa15] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:909): assert (!closed) failed
[proxy:0:2@elsa15] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:2@elsa15] main (pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec@elsa17] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec@elsa17] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec@elsa17] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
[mpiexec@elsa17] main (ui/mpich/mpiexec.c:344): process manager error waiting for completion

erik · Nov 17, 2021

Hello. Note, the answer to your question is really in the description for MULTI_DRIVER, which you have above...

XML:

MULTI_DRIVER mode provides a separate driver/coupler component for each
ensemble member. All components must have an equal number of members. If
MULTI_DRIVER mode is False prognostic components must have the same number
of members but data or stub components may also have 1 member.

So since you are doing an F compset you need to have the same number of instances for the LND, ROF, GLC and ICE models (the prognostic components). Since OCN is data it's fine as it is. And if in your particular version the stub glacier or river models are used they can remain as they are as well.

How to setup the multiple instances (=ensemble)

Young-chan

Young-chan Noh

New Member

erik

Erik Kluzek

CSEG and Liaisons