Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Problem running WACCM on Power6 in Toronto

Hi, we've downloaded and built WACCM v3.1.9 from the sample scripts from the Community Data Portal on the IBM Power6 machine in Toronto. However, it falls over at runtime.

We are using xlf v12 and NetCDF v4.0. However, we are currently running CCSM v3 on the same system without problems. Interestingly, when I followed the instructions given on this board yesterday and compile with -chem waccm_ghg, the model runs fine. So it's only with -chem waccm_mozart that we have a problem.

The model initializes ok, then crashes soon after the run begins. We see many warnings like this prior to the crash:
1: QNEG3 from Gravity waves drag/HORZ:m= 62 lat/lchnk= 40 Min. mixing ratio violated at 26 points. Reset to 1.0E+00 Worst = 1.0E+00 at i,k= 1 16
12: QNEG4 WARNING from TPHYSAC , lchnk = 40; Max possible LH flx exceeded at 1 points. Worst excess = -7.0375E-06 at i = 18

Compiling with -qsigtrap suggests that the error is a FP division by zero and/or FP overflow in
cam3_1_9_brnchT_waccm_14/models/atm/cam/src/chemistry/waccm_mozart/mo_lu_factor.F90
or
cam3_1_9_brnchT_waccm_14/models/atm/cam/src/chemistry/waccm_mozart/mo_imp_sol.F90

We used the input data files and namelist downloaded from the data portal, and haven't changed anything with the scripts (except path names) or compiler flags. Compiling with -qnohot (apparently suggested by Jim Edwards for use with xlf v12) makes no difference in this case.

Any advice on things we could try would be gratefully received.
 

marsh

Member
Does it run a least one timestep? If so, I would check it is not an initialization problem, by dumping instantaneous output of all fields after the first timestep. If that looks OK, then run up to just before the crash and dump again. Often the chemistry is not the cause of the crash, but just an indication of problems elsewhere (e.g. a negative temperature).

You could also try running with v11 of the compiler, which is working without problems for WACCM on the NCAR P6.

BTW, if you have IDL, geov is by far the easiest way to visualize WACCM output. geov is available for download at http://www.acd.ucar.edu/Applications/
 
Hi, thanks for the suggestions.

1. Unfortunately, the model does not complete even one timestep in waccm_mozart mode (i.e. I get the same error when I set nelapse=1). However, by inserting print statements into mo_imp_sol.F90 I found that it's making it through at least 3 newton-raphson iterations (L187 of that routine), each of which calls lu_fac (the routine complaining about FP divide by zeros). This suggests it's not falling over immediately after initialization but doesn't seem to tell us much else.

Also, when I turn *off* the -debug flag in configure, I see lots more output before the model fails. This is just a sample:
10: Op 1.000E+00
10: NOp 1.000E+00
10: N2D 1.000E+00
10: e 3.918E-01
10: imp_sol: Time step 1.1250000000000E+01 failed to converge @ (lchnk,lev,col,nstep) = 264 44 9 0
10: imp_sol: Failed to converge @ (lchnk,lev,col,nstep,dt,time) = 264 44 9 0 1.1250000000000E+01 2.1375000000000E+02

2. I've asked about xlf version 11 and the sys admin tells me we can't use it... but both the waccm_ghg mode and CCSM compile/run ok with v12.
 

marsh

Member
Hi,

OK, so it could still be an initialization problem. Can you print the ion mixing ratios after they are read in, and verify they are the same as in the initial condition file? Have you opened the input file and verified the input fields are reasonable? Perhaps it was corrupted in the ftp. Certainly the solver will fail if multiple fields have VMRs of 1.0.
 
Thanks, do you mean this input file?
0: (GETFIL): using /scratch/cgf/waccm/inputdata/atm/waccm/lb/LBC_Scen=A1b_1950-2050_4x5.nc

Below is the standard output from chem_surfvals_init concerning the mixing ratios for a few species, all of which I think look correct compared to that input file.

Clearly imp_sol is complaining about non-convergence from a large number of species, but I wasn't sure which file the others were being read from? When I try to print out the vmr array in mo_gas_phase_chemdr.F90 it is a huge mess of output, and the values range from 1.0 to 1.0E-37 by the end: I've no idea whether this is correct or not...!

0: chem_surfvals_init: diagnostics
0: chem_surfvals_init: mean co2
0: 3.5976E-04 3.5930E-04
0: chem_surfvals_init: lbc concentrations
0: 3.1153E-07 1.6826E-06 5.0000E-07 5.5066E-10 9.3475E-12 2.6895E-10 5.1888E-10 8.1450E-11 1.1303E-10 1.0389E-10
0: 1.1176E-10 2.2518E-12 3.1357E-12 3.5810E-04
0:
0: 3.1160E-07 1.6718E-06 5.0000E-07 5.5066E-10 9.3550E-12 2.6919E-10 5.1973E-10 8.1704E-11 1.1360E-10 1.0386E-10
0: 1.1151E-10 2.2630E-12 3.1520E-12 3.5601E-04
0: -------------------------------
0: co2 volume mixing ratio = 0.355000000000000010E-03
0: ch4 volume mixing ratio = 0.171400000000000002E-05
0: n2o volume mixing ratio = 0.311000000000000019E-06
0: f11 volume mixing ratio = 0.280000000000000015E-09
0: f12 volume mixing ratio = 0.503000000000000021E-09
0: Warning: Not reading O2_1S from IC file.
0: O2_1S initialized by "chem_init_cnst"
0: Warning: Not reading O2_1D from IC file.
0: O2_1D initialized by "chem_init_cnst"
0: Warning: Not reading AOA1 from IC file.
0: ADDITIONAL_CONSTITUENTS: INITIALIZING AOA1 60
0: AOA1 initialized by "tracers_init_cnst"
0: Warning: Not reading AOA2 from IC file.
0: ADDITIONAL_CONSTITUENTS: INITIALIZING AOA2 61
0: AOA2 initialized by "tracers_init_cnst"
0: Warning: Not reading HORZ from IC file.
0: ADDITIONAL_CONSTITUENTS: INITIALIZING HORZ 62
0: HORZ initialized by "tracers_init_cnst"
0: Warning: Not reading VERT from IC file.
0: ADDITIONAL_CONSTITUENTS: INITIALIZING VERT 63
0: VERT initialized by "tracers_init_cnst"
0: CHECK_VAR Warning: variable SGH30 is not on initial dataset
0: READ_INIDAT Warning: SGH30 not found on topo dataset.
0: The field SGH30 will be filled using data from SGH.
 

marsh

Member
This output looks reasonable - it is opening the LBC file and reading in CO2, CH4, etc. I have exactly the same values in my run here. The input file I was referring to is the 3-D initial condition file specified by 'ncdata' in the namelist file. Can you verify the ions Op, NOp, etc have been read in correctly. Perhaps just print out the max and min values after initialization.

You could also try running the 1.9x2.5 case - the ions are not present in the provided ncdata file for that run, and so would be set to a small (1e-39) number and initialization.
 
UPDATE July 16 2009:

We've moved forward a little with this problem, but we're still unable to run properly using mozart.

To test the code in serial mode I reconfigured with -nosmp and -nospmd and the model runs fine with -chem waccm_mozart. Further tests reveal that it's the smp part that's causing it to crash for us -- it still runs whether I configure with -nospmd or not.

Clearly, this is not exactly great progress as it only allows us to run on a single PE, but at least we know the problem may be something related to MPI/SMP. Furthermore, it seems there is no problem with the input data, and the earlier errors we saw from imp_sol were either a red-herring, or indicative of the MPI-related issue.

If anyone has any suggestions relating to these new developments, we'd be happy to hear them!

Thanks.
 

fvitt

CSEG and Liaisons
Staff member
Have you tried running in a pure MPI mode?

Can you provide the following information about your Power6 system?

-How many CPUs on each node?

-Is SMT (simultaneous multi-threading) enabled on your system?

-Is the batch system Load Leveler?

-What are your Load Leveler settings?

-How many MPI tasks per node are you trying to run?

-How many threads per MPI task? What is OMP_NUM_THREADS set to?
 
Top