CLM floating point error in multi-instance CAM run

raeder · May 10, 2012

I'm having trouble debugging this problem because there's not much output about it.
The context is running CAM5+DART using the multi-instance capability. After
a half dozen successful short forecasts and assimilations, one of the CLM members
fails to finish initializing. The only message I can find is in the ccsm log file:
_pmii_daemon(SIGCHLD): [NID 00861] [c4-0c0s1n3] [Thu May 10 21:17:55 2012]
PE 110 exit signal Floating point exception
[NID 00861] 2012-05-10 21:17:55 Apid 7407109: initiated application termination
Application 7407109 exit codes: 136
Application 7407109 exit signals: Killed
Application 7407109 resources: utime ~5379s, stime ~32s

There's nothing unusual (compared to successful instances) in the clm instance 56 log file.
It just ends.

I've looked for NaNs in the CLM restart file, but see nothing unusual (irrig_rate is full of them,
but that's true for the other restart files).

I've set INFO_DBUG =2, and compiled it with no optimization.

I haven't built a single instance CAM and fed these ICs to it. Would that be worth the effort?

Is there anything else I can do to get more information about the death?

Thanks,
Kevin

raeder · May 21, 2012

Using the ddt debugger on the core file on hopper I've found that
clm_l2a%eflx_lwrad_out(1353) = -24.07 which is a problem when the sqrt of it is taken.
This value is generated from pptr%pef%eflx_lwrad_out(16453:16469), which have values
that range from -37.29 to 63.28, plus 8 identical values of 319.58240749919998.
It seems that the '319' values are not weighted heavily in the calculation, since the end product
is negative.
pptr comes directly from clm3%g%l%c%p, which I don't think I can track in the core file
to it's source.

Any idea what's happening here?

Thanks,
Kevin

CLM floating point error in multi-instance CAM run

raeder

Member

raeder

Member