Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Runtime error: OOB latitude and longitude during CLM initdata regridding

Seb Eastham

Seb Eastham
New Member
Hello,

I am porting CESM 2.1.1 to another machine, and trying to run the FCHIST compset (1.9x2.5 degree resolution, 32 cores). I am using Intel's most recent compilers (2022.0.2) and OpenMPI v4.0.4. I'm finding a very strange error. During initialization, when CLM tries to regrid the initial conditions, it throws this error:

ERROR initInterp set_mindist: Cannot find any input points matching output poin
t:
subgrid level, index = pft 92169
lat, lon = 1.745329251994330E+034 , 1.745329251994330E+034
ltype: 8
ctype: 71
ptype: 0


At first I thought this might be the issue noted in (eg) CESM2 error: initinterp set_mindist: Cannot find any input points matching output points, but I double checked the input file and it seems fine (correct size, no change after re-downloading, verified against a copy on a machine which DOES work). However, I then realized that the lat and lon reported in the error are both insane. Looking at initInterpMindist, it seems that the output grid on one or more PEs must be corrupted. I've verified that the compilers, MPI implementation, and ESMF installation are all working (running a separate model across 3 nodes). Are there any known causes for this kind of behavior? If so, how might I resolve it?
 

erik

Erik Kluzek
CSEG and Liaisons
Staff member
Since, you are trying to get this working on a new machine. I'd suggest starting with trying out simpler things first and see if they work. So for instance since it's failing in CLM, try running the case with just CLM (an I compset). And then see if you can simplify further to see what you get to work. Down to can you get a single point case to work with mpi-serial for example? Rather than interpolating initial conditions for example see if it will run if you do a cold-start.

It can also help if you have access to a supported machine (such as cheyenne) and try the same thing there to see how it works. That might give you clues as to what starts to go wrong on the new machine. It sounds like you are doing that which is good, I'd recommend doing each step on both machines and compare what happens. At some point you should be able to get something that works on both, and then you make a change that fails on the new machine.

There is some suggestions on porting to other machines here:


There is obviously something really going wrong here. So a possible explanation of that could be something at the compiler optimization level. So seeing if you get the same behavior with compiler optimization turned all the way down might be helpful. You might be able to check that setting DEBUG=TRUE, which will also check for subscripts within bounds and other type of checking that could be at play here. So definitely run that way and see what happens. We have seen compiler optimization produce bad code in some cases that cause something to go totally off the rails. That could be what's happening here. Usually if that's the case you can often figure out that's just maybe one file that's problematic. If so you can lower the optimization level of the compiler for just the file(s) that you need to.
 

Seb Eastham

Seb Eastham
New Member
Thanks for this advice! I have switched over to a pure CLM case (IHistClm50Sp at resolution f19_f19_mg17). Compiling after setting DEBUG=TRUE does not appear to have had any effect, sadly. I noticed that the error crops up when a lat/lon combination in the output grid becomes illogical, so I did try outputting the min and max lon and lat on each PE's output grid. It appears that some subgrids have lats and lons which are "OK" (at least, within 0 - 2*pi) while some are very much not. My fear is that this issue is happening somewhere deep within ncd_io. I added the following line in initInterp.F90 immediately after the read_var_double call for pfts1d_lon where the longitudes are being read in from a file generated by CLM (finidat_interp_dest.nc):

write(iulog,'(a,4(x,I8),4(x,E16.5E4))') ' --> ', beg, end, end + 1 - beg, maxloc(subgrid%lon), minval(subgrid%lon), maxval(subgrid%lon), subgrid%lon(beg), subgrid%lon(end)
The results are as follows *when the output grid is being read in* (showing 3 good and 1 bad):

--> 25557 29181 3625 466 0.00000E+0000 0.35750E+0003 0.90000E+0002 0.27500E+0003
--> 63183 66901 3719 3090 0.00000E+0000 0.10000E+0037 0.21750E+0003 0.26750E+0003
--> 21878 25556 3679 511 0.75000E+0001 0.35500E+0003 0.77500E+0002 0.10500E+0003
--> 33034 36796 3763 2086 0.50000E+0001 0.35750E+0003 0.11500E+0003 0.29000E+0003



No such issues happen during the call to populate the input grid. What's interesting to me is that the first and last longitude are fine, but the maximum value isn't. Not, unfortunately, that I can imagine what is causing the issue there. Looking at the file in question (finidat_interp_dest.nc), the _FillValue and missing_values are 1e+36 (i.e. .010000E+0037), but using Python to read the file shows that there should be no missing values..
 

Seb Eastham

Seb Eastham
New Member
A quick update: I have now confirmed that this error occurs on our cluster when using either Intel (2022) or GNU (6.3.1) compilers, and with either CESM 2.1.1 or CESM 2.2.
 

erik

Erik Kluzek
CSEG and Liaisons
Staff member
Does this occur only when you are using more than 1 MPI task? It would be good to see if this is a multiprocessing problem in MPI. If you can run a smaller case with one processor, or even a single point case that would be an interesting test.
 

Seb Eastham

Seb Eastham
New Member
Thanks for this tip! It turns out that running with 1, 2, or 16 tasks works fine - but running with 17+ fails. This is significant because our nodes are each 16 tasks in size, so the issue seems to be for multi-node runs. As stated earlier I do at least know that our MPI installation is functional (since we run the GCHP MPI code with it), but I wonder if CESM is testing some aspect which GCHP doesn't. Any ideas would be very welcome!
 
Top