How to see the results of the preprocessing C files

raeder · Mar 18, 2024

I've merged (mostly) CIME maint_5.8 into a CESM2_1 that I'd like to keep using in a series of experiments started on cheyenne.
My case builds and runs one 6 hour span, but dies at the start of the second run with the error
Abort with message NetCDF: Index exceeds dimension bound in file
/glade/derecho/scratch/csgteam/temp/spack/derecho/23.09/builds/spack-stage-parallelio-2.6.2-zyhuubh2c6tdzo3o3zugo55lv6atxzzv/spack-src/src/clib/pio_getput_int.c
at line 1212

I don't have access to that file, so I can't see what's happening at line 1212.
The file that was used to derive that file has nothing related to the error message at its line 1212.
I've set DEBUG = TRUE in env_build.xml and even added -save-temps to the ftn compiler options,
but I'm not seeing any kind of pio_getput_int file in $EXEROOT/..., where the fortran preprocessed files (*.i90) are archived.
Is there a way for me to see this spack-src/src/clib/pio_getput_int.c file?

jedwards · Mar 18, 2024

You can look at the parallel IO 2.6.2 source at ParallelIO/src/clib/pio_getput_int.c at pio2_6_2 · NCAR/ParallelIO
but I doubt that's going to help you find the problem. The issue is probably in the start and count arrays that you are passing into that routine or the
dimensions defined in the file you are trying to read or write, the error is that you are trying to read or write something outside the bounds of the defined array.

But maybe you should ask someone to help you in moving your experiment to use the cesm2.1.5 that runs on derecho.

raeder · Mar 18, 2024

Hi Jim,
thanks for the quick reply and pointer. I don't have much experience with preprocessors,
but the code there looks like it is not the output from a preprocessor; it has text substitution directives like __FILE__.
The name of the file is actually one of the clues I'm hoping to find.
I haven't been able to figure it out from the log files.
The last one mentioned in the cesm.log file is ZO_UV2_debug_fpp.cam_0002.rs.2018-01-18-21600.nc.
The opening of that file is followed by several
photo_timestep_init: diagnostics
calday, last, next, dels = 18.3958333333333 1 2 8.261494252873559E-002
-----------------------------------
Then it aborts with the dimension bound problem.
Does it seem likely that reading that file is the problem?
It was created by the first 6 hour forecast, and the only thing I changed for the second is CONTINUE_RUN=TRUE.

Meanwhile, I noticed that the output from the preprocessor(?) for my latest test is in
/glade/derecho/scratch/jedwards/
tmp/spack-stage/spack-stage-parallelio-2.6.2-q7fyefeg5lg44337zrklqh6rduj62g2m/spack-src/src/clib/pio_getput_int.c
Instead of /glade/derecho/scratch/csgteam.
I still can't see it, but it seems slightly more accessible now!

There's certainly wisdom in using only what's proven to run on a new machine,
but making that change causes problems downstream in the interpretation of results
and documentation of code in publications. So I'm giving this my best shot.
I'm also learning a lot about the inner workings of CESM.

jedwards · Mar 19, 2024

If you would like to point out your case directory path on derecho I can have a look.

raeder · Mar 19, 2024

Thanks for offering! I was hoping that there were just a couple bits of knowledge
that I needed to debug it myself, since this is a non-standard CESM.
But if that's not the case, my latest experiment is in
/glade/work/raeder/Exp/ZO_UV2_debug_fpp and
/glade/derecho/scratch/raeder/ZO_UV2_debug_fpp/run
The CESM is /glade/work/raeder/Models/cesm2_1_m5.8

It's a 3 instance CAM data assimilation test. I haven't gone to the trouble yet of recreating it
in a single instance case.

jedwards · Mar 19, 2024

I cloned your case in /glade/derecho/scratch/jedwards/ZO_UV2_debug_fpp
and found that the problem seems to be in reading file:
/glade/campaign/cesm/cesmdata/inputdata/atm/cam/ozone_strataero/ozone_strataero
_WACCM_L70_zm5day_18500101-20150103_CMIP6ensAvg_c180923.nc

Look for differences in tracer_data.F90 between your version and cesm2.1.5

raeder · Mar 20, 2024

So far I don't see in your clone that reading the ozone file is the problem.
The last mention of the ozone file in the atm.log file is followed by all the GW stuff,
liquid and ice optics, microphysics, the MASTER field list and ends with "Included fields"
Then the action seems to move to CLM. The first ERROR in the cesm.log file
is the failure to find Z_OSSE_Trop_UV2.clm2.r.2018-01-18-00000.nc.
There are 2 problems there; it has the wrong date (should be 18-21600
because my problem happens only in the second 6 hour forecast),
and there aren't *any* CLM restart files in your rundir.
What am I missing?

I compared the tracer_data.F90 files and see only that the 2.1.5 has
an additional subroutine "vert_interp_ub_var", which does vertical interpolation
in the top layer.
There are also new error checks involving top_layer and top_bndry,
and not getting a variable ID from a file. I don't think a dimension problem
would trigger that, so it seems irrelevant to my failure.

jedwards · Mar 20, 2024

I think that I was wrong on that but I think that the method I am recommending to resolve the problem is sound - can you
try updating the file components/clm/src/main/ncdio_pio.F90.in to the one used in 2.1.5? possibly also the file restFileMod.F90 in the same directory.

raeder · Mar 20, 2024

Those 2 files are identical in the 2 CLMs (mine; 5.0.14, cesm2.1.5; 5.0.37)

Meanwhile, I discovered env_run.xml:PIO_DEBUG_LEVEL and set it to 2.
I'm wading through lots of output, but the job didn't go as far (didn't get to CLM),
so I suspect that this setting activated a check that killed it sooner than necessary(?).
In all the pio_log_###.log files the last message about opening files referenced
/glade/campaign/cesm/cesmdata/cseg/inputdata/atm/waccm/lb/LBC_2014-2500_CMIP6_SSP370_0p5degLat_GlobAnnAvg_c190301.nc

The traceback has cam_grid_support.F90 as the last known module.
The listed line is
call pio_read_darray(File, varid, iodesc, hbuf, ierr)
where File is type(file_desc_t)

jedwards · Mar 20, 2024

You don't want to do that. If you want to debug at a particular spot in the code instead
add calls to pio_setdebuglevel(3) before and pio_setdebuglevel(0) after the pio call you are interested in.

raeder · Mar 26, 2024

Well, that takes us back to my original question of how to see the temporary C source file created by the preprocessor.
So I gave up on trying to track the problem through the C code and focused on the fortran code.
My prints eventually showed that the mosart history restart file name (.rh0.) had garbage character(s) after the end of the file name.
I couldn't see where those characters were coming from, until it occured to me that those file names
were coming from the mosart restart file (.r.) from the first cycle. The file names were corrupted in there.
That led me to RtmIO.F90:ncd_io_char_var1_nf; elseif (flag == 'write'). I found that I could pad the rest of the tmpString variable with ' ',
and set the number of 'good' characters to 255 (size(tmpString)), which prevented the garbage from becoming part of the file name
when the file names were written to the restart file.

It works now, but I'm guessing that this is not the preferred solution, and it's probably not a problem
in the derecho-compatible releases.
So this issue can be closed.

jedwards · Mar 26, 2024

Did you compare your file to the one in 2.1.5?

raeder · Apr 12, 2024

The RtmIO.F90 I was working with is identical to the 2.1.5.

jedwards · Apr 12, 2024

I think that the file you want to check in Mosart is RtmHistFile.F90and/or RtmRestFile.F90
You could just try updating mosart to the release-cesm2.0.04 tag.

How to see the results of the preprocessing C files

raeder

Member

jedwards

CSEG and Liaisons

raeder

Member

jedwards

CSEG and Liaisons

raeder

Member

jedwards

CSEG and Liaisons

raeder

Member

jedwards

CSEG and Liaisons

raeder

Member

jedwards

CSEG and Liaisons

raeder

Member

jedwards

CSEG and Liaisons

raeder

Member

jedwards

CSEG and Liaisons