Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

CLM floating point error in multi-instance CAM run

raeder

Member
I'm having trouble debugging this problem because there's not much output about it.
The context is running CAM5+DART using the multi-instance capability. After
a half dozen successful short forecasts and assimilations, one of the CLM members
fails to finish initializing. The only message I can find is in the ccsm log file:
_pmii_daemon(SIGCHLD): [NID 00861] [c4-0c0s1n3] [Thu May 10 21:17:55 2012]
PE 110 exit signal Floating point exception
[NID 00861] 2012-05-10 21:17:55 Apid 7407109: initiated application termination
Application 7407109 exit codes: 136
Application 7407109 exit signals: Killed
Application 7407109 resources: utime ~5379s, stime ~32s

There's nothing unusual (compared to successful instances) in the clm instance 56 log file.
It just ends.

I've looked for NaNs in the CLM restart file, but see nothing unusual (irrig_rate is full of them,
but that's true for the other restart files).

I've set INFO_DBUG =2, and compiled it with no optimization.

I haven't built a single instance CAM and fed these ICs to it. Would that be worth the effort?

Is there anything else I can do to get more information about the death?

Thanks,
Kevin
 

raeder

Member
Using the ddt debugger on the core file on hopper I've found that
clm_l2a%eflx_lwrad_out(1353) = -24.07 which is a problem when the sqrt of it is taken.
This value is generated from pptr%pef%eflx_lwrad_out(16453:16469), which have values
that range from -37.29 to 63.28, plus 8 identical values of 319.58240749919998.
It seems that the '319' values are not weighted heavily in the calculation, since the end product
is negative.
pptr comes directly from clm3%g%l%c%p, which I don't think I can track in the core file
to it's source.

Any idea what's happening here?

Thanks,
Kevin
 
Top