Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Identifying appropriate tests and fixes for port of CESM 2.1 on a 40 core system

We plan on using up to 40 cores on a linux cluster, with a focus on CESM 2.1 for short runs of a coarse resolution atmosphere model – see attached for the describe version file. (I was helped in forum posts around July.)

I would like guidance on what tests of our installation should be run.... and of those that are failing, if they matter for us, then are there known fixes?

First the good news:

  • Using BFBFLAG I’ve run compset F2000climo --res f19_f19_mg17 getting identical results in hist/...can.h0 files, for example comparing one 5 month run using 32 cores, with three continuation runs on 20 cores.
  • Using the ensemble test UF-CAM-ECT three runs were validated (thanks for that web service!!)
    I assumed the other two ensembles weren’t feasible for us – 12 months with finer grids.
I selected some tests selected from the Cheyenne pre-alpha list: some passed and some failed. I had chosen to run those that had an f19 grid, excluded those demanding 72 or 144 cores and those needing ESMF as I haven’t built that. I got the following list with overall results from cs.status. After the summary, I describe the problems found.

ERP_D_Ln9_P16.f19_f19_mg17.FSD.eddie_intel.cam-outfrq9s_sd (Overall: FAIL) - netcdf4 problem

ERP_D_Ln9_P16.f19_f19_mg17.QPC6.eddie_intel.cam-outfrq9s (Overall: PASS)

ERP_P16x2_D_Ld5.f19_g17_gl4.I1850Clm50BgcCropG.eddie_intel.clm-default (Overall: FAIL) : Undefined variable problem

ERP_P16x2_D_Ld5.f19_g17.I2000Clm50Sp.eddie_intel.clm-default (Overall: FAIL) – URBPOI subscript problem

ERS_Ld3_P16.f19_g16.X.eddie_intel (Overall: PASS)

IRT_N3_P16_Ld7.f19_g17.BHIST.eddie_intel.allactive-defaultio (Overall: FAIL) – truncation build problem (extra compiler option needed?)

PEM_P16.f19_g16_rx1.A.eddie_intel (Overall: PASS)

SMS_Ld1_P16.f19_f19_mg17.FWmaHIST_BGC.eddie_intel.cam-reduced_hist1d (Overall: FAIL) – netcdf4 problem??

SMS_D_Ln9_P16.f19_f19_mg17.FWsc2010climo.eddie_intel.cam-outfrq9s (Overall: FAIL) - netcdf4 problem

SMS_P16.f19_f19_mg17.PC4.eddie_intel.cam-cam4_port (Overall: PASS)

Am I running tests that are out of scope? And missing some I should run?

Of those that failed in the above list, I report the problems in separate messages in this thread…
 

Attachments

  • eddie_describe_version.txt
    8.3 KB · Views: 2
NetCDF4 problem

Three of the Cheyenne tests reported failures where the cesm.log.. file gives:

“NetCDF: Attempt to use feature that was not turned on when netCDF was built……



cesm.exe 0000000008F1436E pio_support_mp_pi 118 pio_support.F90
cesm.exe 0000000008F12650 pio_utils_mp_chec 59 pio_utils.F90
cesm.exe 00000000090DF29B ionf_mod_mp_open_ 235 ionf_mod.F90
cesm.exe 0000000008F1058E piolib_mod_mp_pio 2831 piolib_mod.F90q “

For two of the three instances, the last nc file mentioned in the atm log file is in netcdf4 format. (Using ncdump built with pnetcdf I can recreate the problem; using the system’s default ncdump, “ncdump –k” says the files are netcdf4 format.) My understanding was pnetcdf does not and cannot read netcdf4, or is this is a failure of my netcdf installation?

The files are:

  • For ERP_D_Ln9_P16.f19_f19_mg17.FSD.eddie_intel.cam-outfrq9s_sd cesm_indata/atm/cam/solar/SolarForcingNRLSSI2_daily_s18820101_e20171231_c180702.nc
  • For SMS_D_Ln9_P16.f19_f19_mg17.FWsc2010climo.eddie_intel.cam-outfrq9s atm/waccm/waccm_forcing/SCWACCM_forcing_WACCM6_zm_5day_L70_1975-2014_c180216.nc
  • SMS_Ld1_P16.f19_f19_mg17.FWmaHIST_BGC.eddie_intel.cam-reduced_hist1d
  • Got the same trace in the log but ncdump –k says the last named file is classic, not netcdf4: atm/cam/chem/emis/CMIP6_emissions_1750_2015_2deg/emissions-cmip6_so4_a2_contvolcano_vertical_850-5000_1.9x2.5_c20190417.nc, so I’m puzzled that failed… I notice in atm_in there is reference to a file that does elicit the same error in ncdump: “/exports/csce/eddie/geos/groups/cesd/CESM/cesm_indata/atm/cam/chem/stratvolc/VolcanEESMv3.11_SO2_850-2016_Mscale_Zreduc_2deg_c180812.nc: NetCDF: Attempt to use feature that was not turned on when netCDF was built.”
 
Undefined variable problem:

ERP_P16x2_D_Ld5.f19_g17_gl4.I1850Clm50BgcCropG.eddie_intel.clm-default


Cesm.log file:

MCT::m_Router::initp_: GSMap indices not increasing...Will correct

calling getglobalwrite with decomp_index= 790 and clmlevel= pft

ERROR: get_proc_bounds ERROR: Calling from inside a threaded region

forrtl: severe (174): SIGSEGV, segmentation fault occurred

forrtl: severe (194): Run-Time Check Failure. The variable 'urbanfluxesmod_mp_urbanfluxes_$PI' is being used in '/exports/csce/eddie/geos/groups/cesd/CESM/my_cesm_sandbox/components/clm/src/biogeophys/UrbanFluxesMod.F90(577,16)' without being defined
 
URBPOI subscript problem

ERP_P16x2_D_Ld5.f19_g17.I2000Clm50Sp.eddie_intel.clm-default

Cesm.log file:
Creating variable thk
Creating variable topg
Creating variable usurf
Writing to file ERP_P16x2_D_Ld5.f19_g17.I2000Clm50Sp.eddie_intel.clm-default
.AL1.cism.initial_hist.0001-01-01-00000.nc at time 0.000000000000000E+000
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
forrtl: severe (408): fort: (2): Subscript #1 of the array URBPOI has value 1081231932 which is greater than the upper bound of 1924
 

jedwards

CSEG and Liaisons
Staff member
The files in question from your first post all have pnetcdf compatible replacements, if you update to release 2.1.3 you should get these updated files.
For example atm/cam/solar/SolarForcingNRLSSI2_daily_s18820101_e20171231_c180702.nc is replaced by atm/cam/solar/SolarForcingNRLSSI2_daily_s18820101_e20171231_c191122.nc

You should be able to use all of the port changes you've already made in this new code base.
 
I'm trying the update now
Do you have any comments on
- whether I am running the right set of tests (and ignoring the right ones)
- That informs whether the undefined variable problem and the subscript problems are ones I need to fix or are irrelevant to our purposes (atmosphere model)
Thanks
 
Top