Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Troubleshooting floating point error in cesm.log for adding new diagnostics and tracers

shiannder

james
New Member
Hi,

I am running with CESM 2.2 compset G1850ECOIAF.

I have made several large changes based on the "Add a diagnostic" and "Add a tracer" guides for adding tracers related to ciso variables (below) for dissolved organic carbon 13 isotope and 14 radiocarbon for semi labile and refractory. Essentially, I want to convert DO13Ctot into DO13C and DO13Cr; DO14Ctot into DO14C and DO14Cr for semi labile and refractory ciso parts.
  • CISO_DOC_d13C
  • CISO_DOCr_d13C
  • CISO_DOC_d14C
  • CISO_DOCr_d14C
  • CISO_DO13C_prod
  • CISO_DO13Cr_prod
  • CISO_DO13C_remin
  • CISO_DO13Cr_remin
  • do13c
  • do13cr
  • do14c
  • do14cr
  • do13c_ind
  • do13cr_ind
  • do14c_ind
  • do14cr_ind
  • CISO_DOC_d13C
  • CISO_DOCr_d13C
  • CISO_DOC_d14C
  • CISO_DOCr_d14C
I made a new initial condition file with isotope carbon values stored in user_nl_pop:
  • init_ecosys_init_file = '/mnt/lustre/letscher/jsl1063/initcond/cisodocsl/jsl180_ciso_ecosys_jan_IC_gx3v7_20180308.nc'


My current model run (jsl.188) with "./xmlchange DEBUG=TRUE" reports back "Caught signal 8 (Floating point exception: floating-point divide by zero)" in the cesm.log.26371.250311-103711 file. My question is how can I find the location of this floating point error?

cesm and ocn logs in "/mnt/lustre/letscher/jsl1063/jsl.188/run"

Let me know if you have any questions. I can attach a pdf of my methods. In it has tracked code changes in github.

Thank you,
James
 

mlevy

Michael Levy
CSEG and Liaisons
Staff member
Your CESM log contains a backtrace:

Code:
==== backtrace (tid: 241744) ====
 0  /mnt/lustre/software/ucx/1.12.1/gcc/9.1.0/lib/libucs.so.0(ucs_handle_error+0x2a4) [0x2aaac248b8d4]
 1  /mnt/lustre/software/ucx/1.12.1/gcc/9.1.0/lib/libucs.so.0(+0x2bad7) [0x2aaac248bad7]
 2  /mnt/lustre/software/ucx/1.12.1/gcc/9.1.0/lib/libucs.so.0(+0x2bf6a) [0x2aaac248bf6a]
 3  /mnt/lustre/letscher/jsl1063/jsl.188/bld/cesm.exe() [0xd31f9e]
 4  /mnt/lustre/letscher/jsl1063/jsl.188/bld/cesm.exe() [0xd106c0]
 5  /mnt/lustre/letscher/jsl1063/jsl.188/bld/cesm.exe() [0xcbbb5f]
 6  /mnt/lustre/letscher/jsl1063/jsl.188/bld/cesm.exe() [0x426e10]
 7  /mnt/lustre/letscher/jsl1063/jsl.188/bld/cesm.exe() [0x40c9e7]
 8  /mnt/lustre/letscher/jsl1063/jsl.188/bld/cesm.exe() [0x425960]
 9  /mnt/lustre/letscher/jsl1063/jsl.188/bld/cesm.exe() [0x425a6c]
10  /lib64/libc.so.6(__libc_start_main+0xf5) [0x2aaaad7cc555]
11  /mnt/lustre/letscher/jsl1063/jsl.188/bld/cesm.exe() [0x4073a9]
=================================

Since you ran in debug mode, you can use addr2line to figure out which lines of code those hex codes are referring to:

Code:
$ addr2line -e /mnt/lustre/letscher/jsl1063/jsl.188/bld/cesm.exe 0xd31f9e 0xd106c0 0xcbbb5f 0x426e10 0x40c9e7 0x425960 0x425a6c
/mnt/lustre/letscher/jsl1063/my_cesm_sandbox_isotope/components/ww3/src/source/w3iogomd.f90:508
/mnt/lustre/letscher/jsl1063/my_cesm_sandbox_isotope/components/ww3/src/source/w3wavemd.f90:859 (discriminator 1)
/mnt/lustre/letscher/jsl1063/my_cesm_sandbox_isotope/components/ww3/src/cpl_mct/wav_comp_mct.F90:889
/mnt/lustre/letscher/jsl1063/my_cesm_sandbox_isotope/cime/src/drivers/mct/main/component_mod.F90:729 (discriminator 12)
/mnt/lustre/letscher/jsl1063/my_cesm_sandbox_isotope/cime/src/drivers/mct/main/cime_comp_mod.F90:2751
/mnt/lustre/letscher/jsl1063/my_cesm_sandbox_isotope/cime/src/drivers/mct/main/cime_driver.F90:126
/mnt/lustre/letscher/jsl1063/my_cesm_sandbox_isotope/cime/src/drivers/mct/main/cime_driver.F90:23

This is an issue in the wave model, and it's in a block of code we've modified to avoid other divide-by-zero issues... the line in question is

Code:
if ((LANGMT(JSEA)**2  &
    /0.4*LOG(MAX(ABS(HML(IX,IY)/4./HS(JSEA)),1.0))+COS(SWW)).eq.0.) then

So it's hard to pin down exactly but maybe we need to make sure HS(JSEA) is not 0? It's not clear to me what we should do if it is...
 

shiannder

james
New Member
Hi Mike,

Thank you for your insights. To me, it looks like the wave model is separate from my model changes in MARBL on dissolved organic carbon and isotope carbon.

I played around with the HS(JSEA) value by adding small constants to HS(JSEA). But it seems the MAX function inside the LOG will select 1.0 if the ABS(HML(IX,IY)/4/HS(JSEA) is less than 1.

My cesm.log also has errors related to:

after REGION_BOX3D print test
125 : (init_tidal_mixing1) ALL REGION_BOX2D values are zero
after REGION_BOX3D print test
Done setting default values!
NetCDF: Variable not found
NetCDF: Variable not found
NetCDF: Invalid dimension ID or name
NetCDF: Invalid dimension ID or name
NetCDF: Invalid dimension ID or name
NetCDF: Variable not found
NetCDF: Attribute not found
NetCDF: Attribute not found
NetCDF: Attribute not found
NetCDF: Variable not found
NetCDF: Attribute not found
NetCDF: Attribute not found
NetCDF: Attribute not found

Output requests :
--------------------------------------------------
no dedicated output process, any file system

I followed the directions from "Adding a diagnostics" and "Adding a Tracer".
Adding a Diagnostic — MARBL cesm2.1 documentation
Adding a Tracer — MARBL cesm2.1 documentation

Do you know if there are additional code I should add to read my new doc variables in my nc file? I think the namelist_defaults_pop.xml file may need an update for reading new ciso_tracer_init_ext(16, 17, 18, 19).

ciso_tracer_init_ext(1)%mod_varname>DI13C</ciso_tracer_init_ext(1)%mod_varname>
<ciso_tracer_init_ext(1)%file_varname>DIC</ciso_tracer_init_ext(1)%file_varname>
<ciso_tracer_init_ext(1)%scale_factor>1.025</ciso_tracer_init_ext(1)%scale_factor>
<ciso_tracer_init_ext(14)%mod_varname>diaz14C</ciso_tracer_init_ext(14)%mod_varname>
<ciso_tracer_init_ext(14)%file_varname>diazC</ciso_tracer_init_ext(14)%file_varname>
<ciso_tracer_init_ext(14)%scale_factor>1.0</ciso_tracer_init_ext(14)%scale_factor>
 

mlevy

Michael Levy
CSEG and Liaisons
Staff member
jsl.188 definitely aborted because of a problem in the wave model, and not anything in your CISO mods. It looks like jsl.220 might indicate a problem with your mods:

Code:
$ grep "MARBL ERROR" cesm.log.26845.250405-134916
(Task 1, block 1) MARBL ERROR (marbl_ciso_diagnostics_mod:store_diagnostics_ciso_interior): abs(CISO_Jint_14Ctot)= 0.317E-006 exceeds CISO_Jint_14Ctot_thres= 0.317E-006
(Task 1, block 1) MARBL ERROR (marbl_ciso_interior_tendency_mod:marbl_ciso_interior_tendency_compute): Error reported from store_diagnostics_ciso_interior
(Task 1, block 1) MARBL ERROR (marbl_interior_tendency_mod:marbl_interior_tendency_compute): Error reported from marbl_ciso_interior_tendency_compute()
(Task 1, block 1) MARBL ERROR (marbl_interface:interior_tendency_compute): Error reported from marbl_interior_tendency_compute()
(Task 1, block 1) MARBL ERROR (ecosys_driver:ecosys_driver_set_interior): Error reported from marbl_instances(1)%set_interior_forcing()

I also noticed that you've already changed the default threshold (Jint_Ctot_thres_molpm2pyr) by 5 orders of magnitude. I have a few comments about this:

1. Changing defaults/json/settings_latest.json directly, rather than changing defaults/settings_latest.yaml and running MARBL_tools/yaml_to_json.py (or just using user_nl_marbl) makes it really hard to figure out why marbl_in is showing Jint_Ctot_thres_molpm2pyr = 0.0001 instead of 1e-9
2. If you are expecting conservation check failures (maybe the conservation check itself needs to be updated to account for your changes), then you may as well comment out the checks in store_diagnostics_ciso_interior() (componets/pop/externals/MARBL/src/marbl_ciso_diagnostics_mod.F90) or just turn the error into a regular message:

Code:
    if (abs(diags(ind%CISO_Jint_13Ctot)%field_2d(1)) .gt. CISO_Jint_13Ctot_thres) then
       write(log_message,"(A,E11.3e3,A,E11.3e3)") &
            'abs(CISO_Jint_13Ctot)=', abs(diags(ind%CISO_Jint_13Ctot)%field_2d(1)), &
            ' exceeds CISO_Jint_13Ctot_thres=', CISO_Jint_13Ctot_thres
-       call marbl_status_log%log_error(log_message, subname, ElemInd=1)
-       return
+       call marbl_status_log%log_noerror(log_message, subname)
    end if

(do this for 14Ctot as well!)
 
Top