tcraig
Member
I have ported CESM2.2.0-rc.01 to a Cray XC40 and the model runs fine for the most part. I have run CESM1 and other CESM2 based versions on this machine for several years. I am now testing f09_g17 B1850 with CESM2.2.0. The default pe layout and performance is
There are 44 cores per node on this machine. In the above pe layout, the ocean model is not running on many cores and is holding up the rest of the model. When I increase the pe count to 80 or 88 cores, POP hangs at the end of the run. I am not changing anything else. It looks like it's associated with writing history or restart files. 44 cores runs fine, but 80 or 88 does not. The default (44 core) POP block size is
<entry id="POP_AUTO_DECOMP" value="TRUE">
<entry id="POP_BLCKX" value="64">
<entry id="POP_BLCKY" value="48">
<entry id="POP_NX_BLOCKS" value="0">
<entry id="POP_NY_BLOCKS" value="0">
<entry id="POP_MXBLCKS" value="1">
which really uses only 32 cores of the 44 cores allocated. I have tried a number of different block sizes (both defaults and manually set) with 80 and 88 cores with no luck. I have also tried switching pio from pnetcdf to netcdf. It always hangs with the ocean on 80 or 88 cores but NOT with 44 cores. Any ideas and anyone else seeing issues like this?
component comp_pes root_pe tasks x threads instances (stride)
--------- ------ ------- ------ ------ --------- ------
cpl = cpl 352 0 352 x 1 1 (1 )
atm = cam 352 0 352 x 1 1 (1 )
lnd = clm 176 0 176 x 1 1 (1 )
ice = cice 176 176 176 x 1 1 (1 )
ocn = pop 44 352 44 x 1 1 (1 )
rof = mosart 176 0 176 x 1 1 (1 )
glc = cism 352 0 352 x 1 1 (1 )
wav = ww 352 0 352 x 1 1 (1 )
iac = siac 1 0 1 x 1 1 (1 )
esp = sesp 1 0 1 x 1 1 (1 )
total pes active : 396
mpi tasks per node : 44
pe count for cost estimate : 396
Overall Metrics:
Model Cost: 4085.46 pe-hrs/simulated_year
Model Throughput: 2.33 simulated_years/day
Init Time : 191.743 seconds
Run Time : 508.774 seconds 101.755 seconds/day
Final Time : 0.009 seconds
Actual Ocn Init Wait Time : 3.971 seconds
Estimated Ocn Init Run Time : 4.212 seconds
Estimated Run Time Correction : 0.241 seconds
(This correction has been applied to the ocean and total run times)
Runs Time in total seconds, seconds/model-day, and model-years/wall-day
CPL Run Time represents time in CPL pes alone, not including time associated with data exchange with other components
TOT Run Time: 508.774 seconds 101.755 seconds/mday 2.33 myears/wday
CPL Run Time: 27.911 seconds 5.582 seconds/mday 42.40 myears/wday
ATM Run Time: 223.893 seconds 44.779 seconds/mday 5.29 myears/wday
LND Run Time: 35.713 seconds 7.143 seconds/mday 33.14 myears/wday
ICE Run Time: 10.327 seconds 2.065 seconds/mday 114.61 myears/wday
OCN Run Time: 505.465 seconds 101.093 seconds/mday 2.34 myears/wday
ROF Run Time: 2.221 seconds 0.444 seconds/mday 532.90 myears/wday
GLC Run Time: 0.957 seconds 0.191 seconds/mday 1236.74 myears/wday
WAV Run Time: 36.272 seconds 7.254 seconds/mday 32.63 myears/wday
IAC Run Time: 0.000 seconds 0.000 seconds/mday 0.00 myears/wday
ESP Run Time: 0.000 seconds 0.000 seconds/mday 0.00 myears/wday
CPL COMM Time: 248.946 seconds 49.789 seconds/mday 4.75 myears/wday
There are 44 cores per node on this machine. In the above pe layout, the ocean model is not running on many cores and is holding up the rest of the model. When I increase the pe count to 80 or 88 cores, POP hangs at the end of the run. I am not changing anything else. It looks like it's associated with writing history or restart files. 44 cores runs fine, but 80 or 88 does not. The default (44 core) POP block size is
<entry id="POP_AUTO_DECOMP" value="TRUE">
<entry id="POP_BLCKX" value="64">
<entry id="POP_BLCKY" value="48">
<entry id="POP_NX_BLOCKS" value="0">
<entry id="POP_NY_BLOCKS" value="0">
<entry id="POP_MXBLCKS" value="1">
which really uses only 32 cores of the 44 cores allocated. I have tried a number of different block sizes (both defaults and manually set) with 80 and 88 cores with no luck. I have also tried switching pio from pnetcdf to netcdf. It always hangs with the ocean on 80 or 88 cores but NOT with 44 cores. Any ideas and anyone else seeing issues like this?