Optimize ensemble configuration with NUOPC MULTI_DRIVER=TRUE?

Molly Wieringa · Aug 5, 2024

What version of the code are you using?
cesm2.3_beta17

Have you made any changes to files in the source tree?
No changes to files in the source tree

Describe every step you took leading up to the problem:
1. create a new case using predefined compset information
${cesmroot}/cime/scripts/create_newcase --res ${resolution} \
--machine ${machine} \
--compset ${compset} \
--case ${caseroot} \
--project ${project} \
--output-root $SCRATCH \
--run-unsupported || exit 1
2. xml file changes
./xmlchange DOUT_S=TRUE
./xmlchange DOUT_S_SAVE_INTERIM_RESTART_FILES=TRUE
./xmlchange EXEROOT=$exeroot
./xmlchange STOP_OPTION=$stop_option
./xmlchange STOP_N=$stop_n
./xmlchange REST_OPTION=$rest_option
./xmlchange REST_N=$rest_n
./xmlchange JOB_QUEUE='main'
./xmlchange JOB_WALLCLOCK_TIME=$job_time
./xmlchange RUN_STARTDATE=$startdate
./xmlchange RUN_TYPE=$runtype
./xmlchange MULTI_DRIVER=TRUE
./xmlchange PIO_TYPENAME=netcdf

setenv MAX_TASKS_PER_NODE `./xmlquery MAX_TASKS_PER_NODE --value`
@ ptile = $MAX_TASKS_PER_NODE
@ nthreads = 1
@ nodes_per_instance = 1
@ comptasks = $ptile * $nodes_per_instance * $num_inst
./xmlchange ROOTPE=0,NTHRDS=$nthreads,NTASKS=$comptasks,NINST=$num_inst

3. set up the case
./case.setup || exit 9

4. user_nl_{component} changes
I made changes to user_nl_datm_streams, user_nl_cice, user_nl_docn and user_nl_drof_streams for each ensemble member. I do not believe them to be relevant to my issue.

5. build the case
qcmd -A ${project} -- ./case.build --skip-provenance-check || exit 10

Describe your problem or question:
I am running ensembles of active sea ice, slab ocean, and prescibed data atmosphere and runoff for a set of data assimilation experiments. The remaining components are stub. In previous versions of these experiments using CESM2.1, I had configured ensembles to use a single driver/coupler component and 30 instances of each active component (MULTI_DRIVER=FALSE). This allowed me to optimize the amount of computing resources assigned to each component model of the ensemble and cut down on computing tasks. In the CESM2.3 version, it appears that the NUOPC coupler requires that MULTI_DRIVER=TRUE. This change makes these ensemble jobs prohibitively expensive, as each component's assigned task allocation is based on the max number of tasks per node and the number of ensemble members-- the number of nodes assigned to each job blows up very quickly.

If I'm reading my SAM statements correctly, the outcome is that while the CESM2.1 versions of the ensemble took ~3,000 core-hours on Cheyenne to run a single year, the CESM2.3 version takes ~600,000 core-hours on Derecho to do the same. The only configuration difference (aside from the model version) is that CESM2.3 has 30 prescribed atmospheres; the CESM2.1 ensembles all had the same prescribed atmosphere. Is there any way to optimize the CESM2.3 multi-driver version to more closely match the computing expense of the original CESM2.1 versions? I would appreciate any guidance on tailoring computing resources in MULTI_DRIVER=TRUE cases.

katec · Aug 7, 2024

Hi, I'm going to move this post over to the CIME/Infrastructure board because I think you will get your best answers there.

jedwards · Aug 7, 2024

I think that the issue may be the change in the way that you need to interpret the NTASKS variable.
With MULTI_DRIVER=True NTASKS is the number of tasks per instance. If that doesn't answer your question please
point me to your case directory and I'll see what I can find.

Optimize ensemble configuration with NUOPC MULTI_DRIVER=TRUE?

Molly Wieringa

Molly Wieringa

New Member

katec

CSEG and Liaisons

jedwards

CSEG and Liaisons