dbhart_sandia@gmail_com
New Member
I am running the CESM1 model in the CLM only mode (the I* compsets). I am trying to run the I_1948-2004 compset, and every time I do so, the model crashes sometime during month 7 of year 6. This has occurred on two different generic_linux_intel systems, both with 8-PES per node. The problem has occurred with ICN, I4804, I4804CN and I compsets, and occurs regardles of the number of nodes used (tested with 8-64 PES).
I have also tried this by changing to a 5-year restart and resubmission. This appears to get around the failure, but will be a problem for us in our desired application of the code.
If anyone has any suggestions as to why this would be occurring, or where I should try to get more debug information to track down the problem, I would greatly appreciate it.
Thanks,
David Hart
Geoscience Research and Applications
Sandia National Laboratories
Machines are CentOS 5.5, Intel 11.1, 2x Intel Xeon Quad-Core, 48GB RAM, Infiniband connections.
The error output from ccsm.log is as follows (no other log files show the error):
[.....]
BalanceCheck: soil balance error nstep = 97378 point = 22492 imbalance = 0.000000 W/m2
BalanceCheck: soil balance error nstep = 97381 point = 12313 imbalance = -0.000001 W/m2
pio_support::pio_die:: myrank= 0 : ERROR: nf_mod.F90: 582 :
NetCDF: Not a valid ID
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
libc.so.6 000000377C6799B0 Unknown Unknown Unknown
ccsm.exe 0000000000D8BDF3 Unknown Unknown Unknown
ccsm.exe 0000000000C03203 Unknown Unknown Unknown
ccsm.exe 0000000000BC20C4 Unknown Unknown Unknown
ccsm.exe 0000000000BB6144 Unknown Unknown Unknown
ccsm.exe 0000000000A3B716 shr_dmodel_mod_mp 685 shr_dmodel_mod.F90
ccsm.exe 0000000000A38D7C shr_dmodel_mod_mp 527 shr_dmodel_mod.F90
ccsm.exe 0000000000AE7753 shr_strdata_mod_m 555 shr_strdata_mod.F90
ccsm.exe 00000000004C7FCE datm_comp_mod_mp_ 750 datm_comp_mod.F90
ccsm.exe 00000000004BFF9F atm_comp_mct_mp_a 96 atm_comp_mct.F90
ccsm.exe 0000000000424895 MAIN__ 2037 ccsm_driver.F90
ccsm.exe 000000000040E01C Unknown Unknown Unknown
libc.so.6 000000377C61D994 Unknown Unknown Unknown
ccsm.exe 000000000040DF29 Unknown Unknown Unknown
--------------------------------------------------------------------------
mpiexec has exited due to process rank 0 with PID 13782 on
node node19.cluster exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpiexec (as reported here).
--------------------------------------------------------------------------
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libmlx4-rdmav2.so 00002AAAAB0FC710 Unknown Unknown Unknown
I have also tried this by changing to a 5-year restart and resubmission. This appears to get around the failure, but will be a problem for us in our desired application of the code.
If anyone has any suggestions as to why this would be occurring, or where I should try to get more debug information to track down the problem, I would greatly appreciate it.
Thanks,
David Hart
Geoscience Research and Applications
Sandia National Laboratories
Machines are CentOS 5.5, Intel 11.1, 2x Intel Xeon Quad-Core, 48GB RAM, Infiniband connections.
The error output from ccsm.log is as follows (no other log files show the error):
[.....]
BalanceCheck: soil balance error nstep = 97378 point = 22492 imbalance = 0.000000 W/m2
BalanceCheck: soil balance error nstep = 97381 point = 12313 imbalance = -0.000001 W/m2
pio_support::pio_die:: myrank= 0 : ERROR: nf_mod.F90: 582 :
NetCDF: Not a valid ID
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
libc.so.6 000000377C6799B0 Unknown Unknown Unknown
ccsm.exe 0000000000D8BDF3 Unknown Unknown Unknown
ccsm.exe 0000000000C03203 Unknown Unknown Unknown
ccsm.exe 0000000000BC20C4 Unknown Unknown Unknown
ccsm.exe 0000000000BB6144 Unknown Unknown Unknown
ccsm.exe 0000000000A3B716 shr_dmodel_mod_mp 685 shr_dmodel_mod.F90
ccsm.exe 0000000000A38D7C shr_dmodel_mod_mp 527 shr_dmodel_mod.F90
ccsm.exe 0000000000AE7753 shr_strdata_mod_m 555 shr_strdata_mod.F90
ccsm.exe 00000000004C7FCE datm_comp_mod_mp_ 750 datm_comp_mod.F90
ccsm.exe 00000000004BFF9F atm_comp_mct_mp_a 96 atm_comp_mct.F90
ccsm.exe 0000000000424895 MAIN__ 2037 ccsm_driver.F90
ccsm.exe 000000000040E01C Unknown Unknown Unknown
libc.so.6 000000377C61D994 Unknown Unknown Unknown
ccsm.exe 000000000040DF29 Unknown Unknown Unknown
--------------------------------------------------------------------------
mpiexec has exited due to process rank 0 with PID 13782 on
node node19.cluster exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpiexec (as reported here).
--------------------------------------------------------------------------
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libmlx4-rdmav2.so 00002AAAAB0FC710 Unknown Unknown Unknown