I'm having trouble debugging this problem because there's not much output about it.
The context is running CAM5+DART using the multi-instance capability. After
a half dozen successful short forecasts and assimilations, one of the CLM members
fails to finish initializing. The only message I can find is in the ccsm log file:
_pmii_daemon(SIGCHLD): [NID 00861] [c4-0c0s1n3] [Thu May 10 21:17:55 2012]
PE 110 exit signal Floating point exception
[NID 00861] 2012-05-10 21:17:55 Apid 7407109: initiated application termination
Application 7407109 exit codes: 136
Application 7407109 exit signals: Killed
Application 7407109 resources: utime ~5379s, stime ~32s
There's nothing unusual (compared to successful instances) in the clm instance 56 log file.
It just ends.
I've looked for NaNs in the CLM restart file, but see nothing unusual (irrig_rate is full of them,
but that's true for the other restart files).
I've set INFO_DBUG =2, and compiled it with no optimization.
I haven't built a single instance CAM and fed these ICs to it. Would that be worth the effort?
Is there anything else I can do to get more information about the death?
Thanks,
Kevin
The context is running CAM5+DART using the multi-instance capability. After
a half dozen successful short forecasts and assimilations, one of the CLM members
fails to finish initializing. The only message I can find is in the ccsm log file:
_pmii_daemon(SIGCHLD): [NID 00861] [c4-0c0s1n3] [Thu May 10 21:17:55 2012]
PE 110 exit signal Floating point exception
[NID 00861] 2012-05-10 21:17:55 Apid 7407109: initiated application termination
Application 7407109 exit codes: 136
Application 7407109 exit signals: Killed
Application 7407109 resources: utime ~5379s, stime ~32s
There's nothing unusual (compared to successful instances) in the clm instance 56 log file.
It just ends.
I've looked for NaNs in the CLM restart file, but see nothing unusual (irrig_rate is full of them,
but that's true for the other restart files).
I've set INFO_DBUG =2, and compiled it with no optimization.
I haven't built a single instance CAM and fed these ICs to it. Would that be worth the effort?
Is there anything else I can do to get more information about the death?
Thanks,
Kevin