Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

CESM2.2.0-rc.01 POP hangs on some pe counts with f09_g17 B1850

tcraig

Member
I have ported CESM2.2.0-rc.01 to a Cray XC40 and the model runs fine for the most part. I have run CESM1 and other CESM2 based versions on this machine for several years. I am now testing f09_g17 B1850 with CESM2.2.0. The default pe layout and performance is

component comp_pes root_pe tasks x threads instances (stride)
--------- ------ ------- ------ ------ --------- ------
cpl = cpl 352 0 352 x 1 1 (1 )
atm = cam 352 0 352 x 1 1 (1 )
lnd = clm 176 0 176 x 1 1 (1 )
ice = cice 176 176 176 x 1 1 (1 )
ocn = pop 44 352 44 x 1 1 (1 )
rof = mosart 176 0 176 x 1 1 (1 )
glc = cism 352 0 352 x 1 1 (1 )
wav = ww 352 0 352 x 1 1 (1 )
iac = siac 1 0 1 x 1 1 (1 )
esp = sesp 1 0 1 x 1 1 (1 )

total pes active : 396
mpi tasks per node : 44
pe count for cost estimate : 396

Overall Metrics:
Model Cost: 4085.46 pe-hrs/simulated_year
Model Throughput: 2.33 simulated_years/day

Init Time : 191.743 seconds
Run Time : 508.774 seconds 101.755 seconds/day
Final Time : 0.009 seconds

Actual Ocn Init Wait Time : 3.971 seconds
Estimated Ocn Init Run Time : 4.212 seconds
Estimated Run Time Correction : 0.241 seconds
(This correction has been applied to the ocean and total run times)

Runs Time in total seconds, seconds/model-day, and model-years/wall-day
CPL Run Time represents time in CPL pes alone, not including time associated with data exchange with other components

TOT Run Time: 508.774 seconds 101.755 seconds/mday 2.33 myears/wday
CPL Run Time: 27.911 seconds 5.582 seconds/mday 42.40 myears/wday
ATM Run Time: 223.893 seconds 44.779 seconds/mday 5.29 myears/wday
LND Run Time: 35.713 seconds 7.143 seconds/mday 33.14 myears/wday
ICE Run Time: 10.327 seconds 2.065 seconds/mday 114.61 myears/wday
OCN Run Time: 505.465 seconds 101.093 seconds/mday 2.34 myears/wday
ROF Run Time: 2.221 seconds 0.444 seconds/mday 532.90 myears/wday
GLC Run Time: 0.957 seconds 0.191 seconds/mday 1236.74 myears/wday
WAV Run Time: 36.272 seconds 7.254 seconds/mday 32.63 myears/wday
IAC Run Time: 0.000 seconds 0.000 seconds/mday 0.00 myears/wday
ESP Run Time: 0.000 seconds 0.000 seconds/mday 0.00 myears/wday
CPL COMM Time: 248.946 seconds 49.789 seconds/mday 4.75 myears/wday

There are 44 cores per node on this machine. In the above pe layout, the ocean model is not running on many cores and is holding up the rest of the model. When I increase the pe count to 80 or 88 cores, POP hangs at the end of the run. I am not changing anything else. It looks like it's associated with writing history or restart files. 44 cores runs fine, but 80 or 88 does not. The default (44 core) POP block size is

<entry id="POP_AUTO_DECOMP" value="TRUE">
<entry id="POP_BLCKX" value="64">
<entry id="POP_BLCKY" value="48">
<entry id="POP_NX_BLOCKS" value="0">
<entry id="POP_NY_BLOCKS" value="0">
<entry id="POP_MXBLCKS" value="1">

which really uses only 32 cores of the 44 cores allocated. I have tried a number of different block sizes (both defaults and manually set) with 80 and 88 cores with no luck. I have also tried switching pio from pnetcdf to netcdf. It always hangs with the ocean on 80 or 88 cores but NOT with 44 cores. Any ideas and anyone else seeing issues like this?
 

hshin74

Ho-Jeong Shin
New Member
I have ported CESM2.2.0-rc.01 to a Cray XC40 and the model runs fine for the most part. I have run CESM1 and other CESM2 based versions on this machine for several years. I am now testing f09_g17 B1850 with CESM2.2.0. The default pe layout and performance is



There are 44 cores per node on this machine. In the above pe layout, the ocean model is not running on many cores and is holding up the rest of the model. When I increase the pe count to 80 or 88 cores, POP hangs at the end of the run. I am not changing anything else. It looks like it's associated with writing history or restart files. 44 cores runs fine, but 80 or 88 does not. The default (44 core) POP block size is

<entry id="POP_AUTO_DECOMP" value="TRUE">
<entry id="POP_BLCKX" value="64">
<entry id="POP_BLCKY" value="48">
<entry id="POP_NX_BLOCKS" value="0">
<entry id="POP_NY_BLOCKS" value="0">
<entry id="POP_MXBLCKS" value="1">

which really uses only 32 cores of the 44 cores allocated. I have tried a number of different block sizes (both defaults and manually set) with 80 and 88 cores with no luck. I have also tried switching pio from pnetcdf to netcdf. It always hangs with the ocean on 80 or 88 cores but NOT with 44 cores. Any ideas and anyone else seeing issues like this?
I have a similar issue with CESM2.1.3 but have not resolved it yet. Have you solved your problem?
 

klindsay

CSEG and Liaisons
Staff member
Hi,

I apologize for the lack of response to tcraig's original post.

I am unable to reproduce this hang on the NCAR machine cheyenne with CESM 2.1.3. That is, I successfully ran for a month with NTASKS_OCN=80 and had a clean model completion. That said, this hang sounds similar to one that was fixed in 2017. The ChangeLog from the tag that fixed that problem reports

"Processing of taavg_contents file was sometimes accessing uninitialized rows of tavg_contents_request in tavg:init_tavg(); this sometimes led to hangs in global sums (in tavg_global()) because different tasks had different flags for which fields were being averaged. Also led to inconsistent tavg buffer sizes."

When this bug was fixed, we added a flag to POP's namelist ldebug that triggers some internal consistency checks, in an attempt to help track down such hangs. Can you please add
ldebug=.true.
to user_nl_pop and rerun. Fingers crossed that this helps isolate the problem.

Keith
 
Top