Porting CESM1_1_2 LENS in local cluster

Rei · Jul 29, 2020

Hi,

I am trying to port CESM1_1_2 (lens version) on my local cluster. I have managed to build the model, however runs crash each time during initialization of atm componenet, right after (in ccsm.log):
/gpfs01/work/reichemke/inputdata/atm/cam/chem/trop_mozart/dvel/season_wes.nc
BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 23089 RUNNING AT cfl204.chemfarm
= KILLED BY SIGNAL: 11 (Segmentation fault)

My env_mach_specific is:
intel/2019u5, impi/2019u5, mkl/2019u5 and netcdf/4.4.1.1

I tried to change debug to true, but no change.

Also, I tried to use the 2017 version of the intel,impi and mkl, but run still fails at the same place with different error:
control_cb (../../pm/pmiserv/pmiserv_cb.c:798): connection to proxy 20 at host cfl001.chemfarm failed
HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:501): error waiting for event
main (../../ui/mpich/mpiexec.c:1147): process manager error waiting for completion

Another warning/error is related to reading the nc files from inputdata:
inputdata/atm/cam/chem/trop_mozart_aero/emis
inputdata/atm/cam/topo/
with the following error:
NetCDF: Invalid dimension ID or name
NetCDF: Variable not found

Any suggestions?

Thank you!!

jedwards · Jul 29, 2020

The NetCDF messages are expected and can be safely ignored. It appears that you are running out of memory.
Changing DEBUG to TRUE should set options -g and -O0 to the compiler, you might also add flag -debug minimal to
the intel compiler flags. I would try that again and make sure that the source files are recompiled with these flags.

Rei · Jul 30, 2020

Thank you for the prompt resposne. Following your advice I increased in the pbs line the memory for each node, and with that the model seems to finish the run. This occurs with or without the DEBUG TRUE and -debug minimal flags. However, the model runs very slowly. With the same configuration I used in cheyenne the model is approx. 30 slower. Looking at 'timing':

component comp_pes root_pe tasks x threads instances (stride)
--------- ------ ------- ------ ------ --------- ------
cpl = cpl 1440 0 720 x 2 1 (1 )
glc = sglc 2 0 1 x 2 1 (1 )
lnd = clm 288 0 144 x 2 1 (1 )
rof = rtm 288 0 144 x 2 1 (1 )
ice = cice 576 144 576 x 1 1 (1 )
atm = cam 1440 0 720 x 2 1 (1 )
ocn = pop2 288 720 144 x 2 1 (1 )

total pes active : 1728
pes per node : 36
pe count for cost estimate : 1728

Overall Metrics:
Model Cost: 43137.08 pe-hrs/simulated_year

For the same configuration in cheyenne I would have:
Model Cost: 1527.65 pe-hrs/simulated_year

All nodes are always available and running at 100%.

Any idea how to speed up the run time?

jedwards · Jul 30, 2020

Are your nodes and network comparable to cheyenne? Do you have reason to expect the same performance? Perhaps you should discuss with your hardware support staff.

Porting CESM1_1_2 LENS in local cluster

Rei

New Member

jedwards

CSEG and Liaisons

Rei

New Member

jedwards

CSEG and Liaisons