problems with running 0.47x0.63 CESM

Danny Leung · Nov 17, 2021

Hi CESM community:
I am trying to run high-resolution simulations (0.47x0.63, f05_g16) and am having trouble running CESM1.2 F compset. In debug mode, the error message is:
MPT ERROR: Rank 132(g:132) received signal SIGFPE(8)
according to my limited knowledge this could be related to writing some bad numbers into the output file. However, I did not modify the .F90 files as shown in the log file:
--
#1 0x00002b7160094db6 in mpi_sgi_system (
80:MPT: #2 MPI_SGI_stacktraceback (
80:MPT: header=header@entry=0x7fff84208a40 "MPT ERROR: Rank 80(g:80) received signal SIGFPE(8).\n\tProcess ID: 13560, Host: r4i6n4, Program: /glade/scratch/dleung/CESM/trial_res_047x063/cesm.exe\n\tMPT Version: HPE MPT 2.19 02/23/19 05:30:09\n") at sig.c:340
80:MPT: #3 0x00002b7160094fb2 in first_arriver_handler (signo=signo@entry=8,
80:MPT: stack_trace_sem=stack_trace_sem@entry=0x2b716a680080) at sig.c:489
80:MPT: #4 0x00002b716009534b in slave_sig_handler (signo=8, siginfo=<optimized out>,
80:MPT: extra=<optimized out>) at sig.c:564
80:MPT: #5 <signal handler called>
80:MPT: #6 0x00000000039428da in subgridavemod::p2g_1d (lbp=325327, ubp=329278,
80:MPT: lbc=50783, ubc=51262, lbl=32227, ubl=32595, lbg=25029, ubg=25340,
80:MPT: parr=..., garr=..., p2c_scale_type=..., c2l_scale_type=...,
80:MPT: l2g_scale_type=..., .tmp.P2C_SCALE_TYPE.len_V$1118=8,
80:MPT: .tmp.C2L_SCALE_TYPE.len_V$111b=8, .tmp.L2G_SCALE_TYPE.len_V$111e=8)
80:MPT: at /gpfs/fs1/work/dleung/cesm1_2_2_1_diameter_roughness_clayfrc_LULC/models/lnd/clm/src/clm4_0/main/subgridAveMod.F90:762
80:MPT: #7 0x000000000362a05b in histfilemod::hist_update_hbuf_field_1d (t=1, f=119,
80:MPT: begp=325327, endp=329278, begc=50783, endc=51262, begl=32227, endl=32595,
80:MPT: begg=25029, endg=25340)
80:MPT: at /gpfs/fs1/work/dleung/cesm1_2_2_1_diameter_roughness_clayfrc_LULC/models/lnd/clm/src/clm4_0/main/histFileMod.F90:1150
80:MPT: #8 0x0000000003626507 in histfilemod::hist_update_hbuf ()
80:MPT: at /gpfs/fs1/work/dleung/cesm1_2_2_1_diameter_roughness_clayfrc_LULC/models/lnd/clm/src/clm4_0/main/histFileMod.F90:1063
80:MPT: #9 0x0000000003471604 in clm_driver::clm_drv (doalb=.FALSE.,
80:MPT: nextsw_cday=1.0625, declinp1=-0.40294823456129064,
80:MPT: declin=-0.4030289369547867, rstwr=.FALSE., nlend=.FALSE., rdate=...,
80:MPT: .tmp.RDATE.len_V$2ef8=32)
--
I did not modify subgridAveMod.F90 and histFileMod.F90. I think the error in subgridAveMod.F90 line 762 was related to a line averaging pft-level quantities parr(p) to grid-level quantities garr(g). I am not familiar with this code, and I guess parr(p) is a general dummy which gets any pft-level variables aggregated to grid-level. All of my codes could run for many years in 1.9x2.5 (f19_g16) and in 0.9x1.25 (f09_g16), but errors occurred when running 0.47x0.63, and I have no idea how to fix it.
I attached the cesm and lnd log files. Any help or comment would be greatly appreciated. I could provide further information if helpful.
Thank you,
Danny Leung

Some paths if helpful:
My case directory: /glade/scratch/dleung/CESM/trial_res_047x063
The cesm log file (in debug mode): /glade/scratch/dleung/CESM/trial_res_047x063/run/cesm.log.211117-140041
Some modified codes that work in coarse resolutions: /glade/scratch/dleung/CESM/trial_res_047x063/SourceMods/src.clm
My source code directory: /gpfs/fs1/work/dleung/cesm1_2_2_1_diameter_roughness_clayfrc_LULC/

erik · Dec 11, 2021

Hmmm. I have a few suggestions here. One is there a reason you need to use CESM1.2? I would suggest trying the same thing with a newer version of the model such as CESM2.1.3 and see if you get the same problem. The other suggestion would be to try a different resolution for the same case and make sure that works for you. And replicating a case that was run for this version would be a good idea as well. It would be good to know that the only thing you've done differently here is to run at the FV half degree resolution.

And yes I'd agree with your diagnosis, you are getting a floating point exception in line 762 of subgridAveMod.F90 from the above. The thing to figure out is why that's happening.

Danny Leung · Dec 13, 2021

Thanks for your reply, Erik.
You are right, I was using the older version because I wanted to compare our newly developed parameterization in the older CESM1/CLM4/CAM4-BAM vs the newer of CESM2.1.1/CLM5/CAM6-MAM4. My guess was that during the p2g averaging there is some mismatch or bad definition of grids in the higher resolutions, so some NaN or negative values from lake or ocean was used and triggered a floating point exception. That was just my guess. parr and garr are dummies, so it's difficult to trace which variable/calculation exactly caused the problem. Not sure how to do it, maybe I should look at the upper level code (histfilemod) and see what variables the function hist_update_hbuf_field_1d was dealing with. Yea the same experiments were successful for 1.9x2.5 and 0.9x1.25.
Now I am trying to do the same for CESM2.1.1/CLM5 and see if I could run the 0.47x0.63 simulations. If successful, I will try to figure this out or I may abandon this experiment. Thanks Erik!

problems with running 0.47x0.63 CESM

Danny Leung

Member

Attachments

erik

Erik Kluzek

CSEG and Liaisons

Danny Leung

Member