Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Porting CESM1_1_2 LENS in local cluster

Rei

New Member
Hi,

I am trying to port CESM1_1_2 (lens version) on my local cluster. I have managed to build the model, however runs crash each time during initialization of atm componenet, right after (in ccsm.log):
/gpfs01/work/reichemke/inputdata/atm/cam/chem/trop_mozart/dvel/season_wes.nc
BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 23089 RUNNING AT cfl204.chemfarm
= KILLED BY SIGNAL: 11 (Segmentation fault)

My env_mach_specific is:
intel/2019u5, impi/2019u5, mkl/2019u5 and netcdf/4.4.1.1

I tried to change debug to true, but no change.

Also, I tried to use the 2017 version of the intel,impi and mkl, but run still fails at the same place with different error:
control_cb (../../pm/pmiserv/pmiserv_cb.c:798): connection to proxy 20 at host cfl001.chemfarm failed
HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:501): error waiting for event
main (../../ui/mpich/mpiexec.c:1147): process manager error waiting for completion

Another warning/error is related to reading the nc files from inputdata:
inputdata/atm/cam/chem/trop_mozart_aero/emis
inputdata/atm/cam/topo/
with the following error:
NetCDF: Invalid dimension ID or name
NetCDF: Variable not found

Any suggestions?

Thank you!!
 

jedwards

CSEG and Liaisons
Staff member
The NetCDF messages are expected and can be safely ignored. It appears that you are running out of memory.
Changing DEBUG to TRUE should set options -g and -O0 to the compiler, you might also add flag -debug minimal to
the intel compiler flags. I would try that again and make sure that the source files are recompiled with these flags.
 

Rei

New Member
Thank you for the prompt resposne. Following your advice I increased in the pbs line the memory for each node, and with that the model seems to finish the run. This occurs with or without the DEBUG TRUE and -debug minimal flags. However, the model runs very slowly. With the same configuration I used in cheyenne the model is approx. 30 slower. Looking at 'timing':

component comp_pes root_pe tasks x threads instances (stride)
--------- ------ ------- ------ ------ --------- ------
cpl = cpl 1440 0 720 x 2 1 (1 )
glc = sglc 2 0 1 x 2 1 (1 )
lnd = clm 288 0 144 x 2 1 (1 )
rof = rtm 288 0 144 x 2 1 (1 )
ice = cice 576 144 576 x 1 1 (1 )
atm = cam 1440 0 720 x 2 1 (1 )
ocn = pop2 288 720 144 x 2 1 (1 )

total pes active : 1728
pes per node : 36
pe count for cost estimate : 1728

Overall Metrics:
Model Cost: 43137.08 pe-hrs/simulated_year

For the same configuration in cheyenne I would have:
Model Cost: 1527.65 pe-hrs/simulated_year

All nodes are always available and running at 100%.

Any idea how to speed up the run time?
 

jedwards

CSEG and Liaisons
Staff member
Are your nodes and network comparable to cheyenne? Do you have reason to expect the same performance? Perhaps you should discuss with your hardware support staff.
 
Top