scs_wy@yahoo_cn
New Member
Hi everyone
I run CCSM3 a few times and I replace my cluster recently.There are some strange problem happen......
I can run T85_gx1v3,T42_gx1v3 and T31_gx3v5 smoothly.But I can't run T42_gx3v5.The output log file show that it can be run to 00010101 and it stop.
---------------------------------------------------------------------------------------------
(tStamp_write) cpl model date 0001-01-01 00000s wall clock 2009-02-04 20:49:36 avg dt 0s dt
0s
(cpl_map_npFixNew3) compute bilinear weights & indicies for NP region.
(cpl_bundle_copy) WARNING: bundle aoflux_o has accum count = 0
(flux_atmOcn) FYI: this routine is not threaded
print_memusage iam 0 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 63097 33892 1577 1175 0
print_memusage iam 1 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 54770 25481 1524 1175 0
print_memusage iam 12 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 54197 24577 1539 1175 0
print_memusage iam 2 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 53863 24586 1541 1175 0
print_memusage iam 3 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 54472 25090 1541 1175 0
print_memusage iam 4 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 53774 24380 1541 1175 0
print_memusage iam 5 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 56325 24827 1541 1175 0
print_memusage iam 6 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 53466 24019 1541 1175 0
print_memusage iam 7 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 53160 23663 1541 1175 0
print_memusage iam 8 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 54064 24494 1541 1175 0
print_memusage iam 9 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 53600 24039 1541 1175 0
print_memusage iam 10 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 54198 24656 1541 1175 0
print_memusage iam 11 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 54146 24552 1539 1175 0
print_memusage iam 13 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 53497 23715 1539 1175 0
print_memusage iam 14 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 54205 23433 1500 1175 0
print_memusage iam 15 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 54072 23269 1500 1175 0
[node21:06615] MPI_ABORT invoked on rank 40 in communicator MPI_COMM_WORLD with errorcode 1
[node20:26437] *** Process received signal ***
[node20:26437] Signal: Segmentation fault (11)
[node20:26437] Signal code: Address not mapped (1)
[node20:26437] Failing at address: 0x16819fa8
[node20:26435] *** Process received signal ***
[node20:26435] Signal: Segmentation fault (11)
[node20:26435] Signal code: Address not mapped (1)
[node20:26435] Failing at address: 0x1684e528
[node20:26438] *** Process received signal ***
[node20:26436] *** Process received signal ***
[node20:26438] Signal: Segmentation fault (11)
[node20:26438] Signal code: Address not mapped (1)
[node20:26438] Failing at address: 0x1680ff68
[node20:26436] Signal: Segmentation fault (11)
[node20:26436] Signal code: Address not mapped (1)
[node20:26436] Failing at address: 0x1621b068
[node20:26437] [ 0] /lib64/libpthread.so.0 [0x2b65704a4c00]
[node20:26437] [ 1] /dcfs2/users/wy/case_0204_35/exe/case_0204_35/all/cam(sphdep_+0xc14) [0x6e7f14]
[node20:26437] *** End of error message ***
[node20:26434] *** Process received signal ***
[node20:26439] *** Process received signal ***
[node20:26434] Signal: Segmentation fault (11)
[node20:26434] Signal code: Address not mapped (1)
[node20:26434] Failing at address: 0x16844528
[node20:26439] Signal: Segmentation fault (11)
[node20:26439] Signal code: Address not mapped (1)
[node20:26439] Failing at address: 0x1621af68
[node20:26435] [ 0] /lib64/libpthread.so.0 [0x2b70d554bc00]
[node20:26435] [ 1] /dcfs2/users/wy/case_0204_35/exe/case_0204_35/all/cam(sphdep_+0xc14) [0x6e7f14]
[node20:26435] *** End of error message ***
[node20:26438] [ 0] /lib64/libpthread.so.0 [0x2b0c9a68bc00]
[node20:26438] [ 1] /dcfs2/users/wy/case_0204_35/exe/case_0204_35/all/cam(sphdep_+0xc14) [0x6e7f14]
[node20:26438] *** End of error message ***
[node20:26436] [ 0] /lib64/libpthread.so.0 [0x2b3214a43c00]
[node20:26436] [ 1] /dcfs2/users/wy/case_0204_35/exe/case_0204_35/all/cam(sphdep_+0xc14) [0x6e7f14]
[node20:26436] *** End of error message ***
[node20:26434] [ 0] /lib64/libpthread.so.0 [0x2b70d554bc00]
[node20:26434] [ 1] /dcfs2/users/wy/case_0204_35/exe/case_0204_35/all/cam(sphdep_+0xc14) [0x6e7f14]
[node20:26434] *** End of error message ***
[node20:26439] [ 0] /lib64/libpthread.so.0 [0x2b8b9c06fc00]
[node20:26439] [ 1] /dcfs2/users/wy/case_0204_35/exe/case_0204_35/all/cam(sphdep_+0xc14) [0x6e7f14]
[node20:26439] *** End of error message ***
[node20:26440] *** Process received signal ***
[node20:26440] Signal: Segmentation fault (11)
[node20:26440] Signal code: Address not mapped (1)
[node20:26440] Failing at address: 0x16842ad8
[node20:26440] [ 0] /lib64/libpthread.so.0 [0x2b0f16cb7c00]
[node20:26440] [ 1] /dcfs2/users/wy/case_0204_35/exe/case_0204_35/all/cam(sphdep_+0xc14) [0x6e7f14]
[node20:26440] *** End of error message ***
[node21:06616] MPI_ABORT invoked on rank 41 in communicator MPI_COMM_WORLD with errorcode 1
[node20:26428] MPI_ABORT invoked on rank 43 in communicator MPI_COMM_WORLD with errorcode 1
[node20:26429] MPI_ABORT invoked on rank 44 in communicator MPI_COMM_WORLD with errorcode 1
[node20:26430] MPI_ABORT invoked on rank 45 in communicator MPI_COMM_WORLD with errorcode 1
[node20:26431] MPI_ABORT invoked on rank 46 in communicator MPI_COMM_WORLD with errorcode 1
[node20:26432] MPI_ABORT invoked on rank 47 in communicator MPI_COMM_WORLD with errorcode 1
[node20:26433] MPI_ABORT invoked on rank 48 in communicator MPI_COMM_WORLD with errorcode 1
[node23:25877] [0,0,0] ORTE_ERROR_LOG: Timeout in file ../../../../orte/mca/pls/base/pls_base_orted_cmds
.c at line 275
[node23:25877] [0,0,0] ORTE_ERROR_LOG: Timeout in file ../../../../../orte/mca/pls/tm/pls_tm_module.c at
line 572
[node23:25877] [0,0,0] ORTE_ERROR_LOG: Timeout in file ../../../../../orte/mca/errmgr/hnp/errmgr_hnp.c a
t line 90
mpirun noticed that job rank 0 with PID 25879 on node node23 exited on signal 15 (Terminated).
[node23:25877] [0,0,0] ORTE_ERROR_LOG: Timeout in file ../../../../orte/mca/pls/base/pls_base_orted_cmds
.c at line 188
[node23:25877] [0,0,0] ORTE_ERROR_LOG: Timeout in file ../../../../../orte/mca/pls/tm/pls_tm_module.c at
line 603
--------------------------------------------------------------------------
mpirun was unable to cleanly terminate the daemons for this job. Returned value Timeout instead of ORTE_
SUCCESS.
--------------------------------------------------------------------------
34 additional processes aborted (not shown)
[node20:26426] OOB: Connection to HNP lost
[node21:06602] OOB: Connection to HNP lost
Wed Feb 4 20:49:57 CST 2009 -- CSM EXECUTION HAS FINISHED
Model did not complete - see cpl.log.090204-133502
-----------------------------------------------------------------------------------
It look likes Segmentation fault.Any one can give me some advice???
Thanks!!!
I run CCSM3 a few times and I replace my cluster recently.There are some strange problem happen......
I can run T85_gx1v3,T42_gx1v3 and T31_gx3v5 smoothly.But I can't run T42_gx3v5.The output log file show that it can be run to 00010101 and it stop.
---------------------------------------------------------------------------------------------
(tStamp_write) cpl model date 0001-01-01 00000s wall clock 2009-02-04 20:49:36 avg dt 0s dt
0s
(cpl_map_npFixNew3) compute bilinear weights & indicies for NP region.
(cpl_bundle_copy) WARNING: bundle aoflux_o has accum count = 0
(flux_atmOcn) FYI: this routine is not threaded
print_memusage iam 0 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 63097 33892 1577 1175 0
print_memusage iam 1 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 54770 25481 1524 1175 0
print_memusage iam 12 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 54197 24577 1539 1175 0
print_memusage iam 2 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 53863 24586 1541 1175 0
print_memusage iam 3 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 54472 25090 1541 1175 0
print_memusage iam 4 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 53774 24380 1541 1175 0
print_memusage iam 5 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 56325 24827 1541 1175 0
print_memusage iam 6 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 53466 24019 1541 1175 0
print_memusage iam 7 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 53160 23663 1541 1175 0
print_memusage iam 8 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 54064 24494 1541 1175 0
print_memusage iam 9 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 53600 24039 1541 1175 0
print_memusage iam 10 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 54198 24656 1541 1175 0
print_memusage iam 11 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 54146 24552 1539 1175 0
print_memusage iam 13 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 53497 23715 1539 1175 0
print_memusage iam 14 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 54205 23433 1500 1175 0
print_memusage iam 15 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 54072 23269 1500 1175 0
[node21:06615] MPI_ABORT invoked on rank 40 in communicator MPI_COMM_WORLD with errorcode 1
[node20:26437] *** Process received signal ***
[node20:26437] Signal: Segmentation fault (11)
[node20:26437] Signal code: Address not mapped (1)
[node20:26437] Failing at address: 0x16819fa8
[node20:26435] *** Process received signal ***
[node20:26435] Signal: Segmentation fault (11)
[node20:26435] Signal code: Address not mapped (1)
[node20:26435] Failing at address: 0x1684e528
[node20:26438] *** Process received signal ***
[node20:26436] *** Process received signal ***
[node20:26438] Signal: Segmentation fault (11)
[node20:26438] Signal code: Address not mapped (1)
[node20:26438] Failing at address: 0x1680ff68
[node20:26436] Signal: Segmentation fault (11)
[node20:26436] Signal code: Address not mapped (1)
[node20:26436] Failing at address: 0x1621b068
[node20:26437] [ 0] /lib64/libpthread.so.0 [0x2b65704a4c00]
[node20:26437] [ 1] /dcfs2/users/wy/case_0204_35/exe/case_0204_35/all/cam(sphdep_+0xc14) [0x6e7f14]
[node20:26437] *** End of error message ***
[node20:26434] *** Process received signal ***
[node20:26439] *** Process received signal ***
[node20:26434] Signal: Segmentation fault (11)
[node20:26434] Signal code: Address not mapped (1)
[node20:26434] Failing at address: 0x16844528
[node20:26439] Signal: Segmentation fault (11)
[node20:26439] Signal code: Address not mapped (1)
[node20:26439] Failing at address: 0x1621af68
[node20:26435] [ 0] /lib64/libpthread.so.0 [0x2b70d554bc00]
[node20:26435] [ 1] /dcfs2/users/wy/case_0204_35/exe/case_0204_35/all/cam(sphdep_+0xc14) [0x6e7f14]
[node20:26435] *** End of error message ***
[node20:26438] [ 0] /lib64/libpthread.so.0 [0x2b0c9a68bc00]
[node20:26438] [ 1] /dcfs2/users/wy/case_0204_35/exe/case_0204_35/all/cam(sphdep_+0xc14) [0x6e7f14]
[node20:26438] *** End of error message ***
[node20:26436] [ 0] /lib64/libpthread.so.0 [0x2b3214a43c00]
[node20:26436] [ 1] /dcfs2/users/wy/case_0204_35/exe/case_0204_35/all/cam(sphdep_+0xc14) [0x6e7f14]
[node20:26436] *** End of error message ***
[node20:26434] [ 0] /lib64/libpthread.so.0 [0x2b70d554bc00]
[node20:26434] [ 1] /dcfs2/users/wy/case_0204_35/exe/case_0204_35/all/cam(sphdep_+0xc14) [0x6e7f14]
[node20:26434] *** End of error message ***
[node20:26439] [ 0] /lib64/libpthread.so.0 [0x2b8b9c06fc00]
[node20:26439] [ 1] /dcfs2/users/wy/case_0204_35/exe/case_0204_35/all/cam(sphdep_+0xc14) [0x6e7f14]
[node20:26439] *** End of error message ***
[node20:26440] *** Process received signal ***
[node20:26440] Signal: Segmentation fault (11)
[node20:26440] Signal code: Address not mapped (1)
[node20:26440] Failing at address: 0x16842ad8
[node20:26440] [ 0] /lib64/libpthread.so.0 [0x2b0f16cb7c00]
[node20:26440] [ 1] /dcfs2/users/wy/case_0204_35/exe/case_0204_35/all/cam(sphdep_+0xc14) [0x6e7f14]
[node20:26440] *** End of error message ***
[node21:06616] MPI_ABORT invoked on rank 41 in communicator MPI_COMM_WORLD with errorcode 1
[node20:26428] MPI_ABORT invoked on rank 43 in communicator MPI_COMM_WORLD with errorcode 1
[node20:26429] MPI_ABORT invoked on rank 44 in communicator MPI_COMM_WORLD with errorcode 1
[node20:26430] MPI_ABORT invoked on rank 45 in communicator MPI_COMM_WORLD with errorcode 1
[node20:26431] MPI_ABORT invoked on rank 46 in communicator MPI_COMM_WORLD with errorcode 1
[node20:26432] MPI_ABORT invoked on rank 47 in communicator MPI_COMM_WORLD with errorcode 1
[node20:26433] MPI_ABORT invoked on rank 48 in communicator MPI_COMM_WORLD with errorcode 1
[node23:25877] [0,0,0] ORTE_ERROR_LOG: Timeout in file ../../../../orte/mca/pls/base/pls_base_orted_cmds
.c at line 275
[node23:25877] [0,0,0] ORTE_ERROR_LOG: Timeout in file ../../../../../orte/mca/pls/tm/pls_tm_module.c at
line 572
[node23:25877] [0,0,0] ORTE_ERROR_LOG: Timeout in file ../../../../../orte/mca/errmgr/hnp/errmgr_hnp.c a
t line 90
mpirun noticed that job rank 0 with PID 25879 on node node23 exited on signal 15 (Terminated).
[node23:25877] [0,0,0] ORTE_ERROR_LOG: Timeout in file ../../../../orte/mca/pls/base/pls_base_orted_cmds
.c at line 188
[node23:25877] [0,0,0] ORTE_ERROR_LOG: Timeout in file ../../../../../orte/mca/pls/tm/pls_tm_module.c at
line 603
--------------------------------------------------------------------------
mpirun was unable to cleanly terminate the daemons for this job. Returned value Timeout instead of ORTE_
SUCCESS.
--------------------------------------------------------------------------
34 additional processes aborted (not shown)
[node20:26426] OOB: Connection to HNP lost
[node21:06602] OOB: Connection to HNP lost
Wed Feb 4 20:49:57 CST 2009 -- CSM EXECUTION HAS FINISHED
Model did not complete - see cpl.log.090204-133502
-----------------------------------------------------------------------------------
It look likes Segmentation fault.Any one can give me some advice???
Thanks!!!