Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

SIGFPE errors in prealpha tests for CESM 2.1.3

cemac-ccs

Chris Symonds
New Member
I am trying to verify a port on the Archer2 computer in Edinburgh and am running the pre-alpha tests. After much trial and error I have got the number of failed tests down to "only" 29 out of 70 tests. A full 19 of those remaining 29 are floating point errors in one of three files:
  • /components/cam/src/physics/cam/micro_mg2_0.F90:1651 - 11 tests (Assorted seemingly unrelated tests)
  • /components/clm/src/biogeochem/ch4Mod.F90:3555 - 4 tests (Clm50Bgc tests)
  • /components/ww3/src/cpl_mct/wav_comp_mct.F90:761 - 4 tests (spacecurve tests)

I can't seem to find much (if any) information on the forums from others who have had similar problems so was hoping that someone would be able to let me know where I am going wrong.

My config xml files are attached if that helps, along with an example output for one of the failed tests (SMS_D_Ln9.f19_f19_mg17.FWsc2010climo.archer2_gnu.cam-outfrq9s)
 

Attachments

  • config_machines.txt
    5.2 KB · Views: 1
  • config_compilers.txt
    4.4 KB · Views: 1
  • cesm.log.1363100.220331-234045.txt
    373.5 KB · Views: 5
  • config_batch.txt
    3.3 KB · Views: 0
  • describe_version.txt
    7 KB · Views: 1

cemac-ccs

Chris Symonds
New Member
A solution for anyone finding this post by searching:

The problem in these 19 tests, and in the three files that are referenced by them, is that in all three files the conditionals use short-circuit logic.

This means that there are conditionals of the form
Code:
if (A .and. B) then....
where B cannot be properly evaluated when A fails, for example
Code:
if ( x /= 0 .and. y/x > c ) then....
which would result in a divide-by-zero error if the second condition was to be evaluated after the first condition had failed.

When using the gnu compiler at optimisation levels below -O1, both conditions will be evaluated in all cases, however if using the intel compiler or using gnu at -O1 or above, the compiler will `short-circuit' this conditional and return
Code:
.FALSE.
after the first condition has failed. When running the pre-alpha tests the flag 'DEBUG' is set to 'TRUE', and thus the optimisation is set low enough that this short-circuit logic becomes a problem.

The fix for this problem is to separate out the conditions in the affected files, resulting in a nested if statement.

This behaviour has been previously reported in a github issue here for the WW3-CESM component and a fix applied, however the fix was applied after the tagged release for CESM 2.1.3. The file in which the error appears in CAM does not exist in the 2.2 release branch, only the 2.1 release branch, and the code in question does not appear in a different file as far as I can see for 2.2. In the CTSM repo the problem code has been fixed as well in the master branch, but not in any of the current tagged releases (up to release 5.0.35)

It should be noted for anyone porting CESM 2.1.3 that as this behaviour only occurs when DEBUG=TRUE and running with the gnu compiler, it will not affect any production runs. However, if your users are likely to run any troubleshooting of cases by setting DEBUG=TRUE for their case then they may encounter this error.
 
Top