2008sdzzq@163_com
New Member
Hi:
I am porting CESM2.1.1 to the cluster in my university, everything is good so far.
When I run a test case for F2000climo with f09_f09_mg17, the default settting with one node, it is totally OK.
However, when I use more than one node, it is going to collapse while running in the first several days. the attached file is for two nodes. It collapse earlier with more nodes.
The cluster is: 80 CPUs per node, 2 nodes = 160 CPUs, with Intel complier.
Best,
Z-Q
Opening file test_160.cism.initial_hist.0001-01-01-00000.nc for output;
Write output at start of run and every 1.00000000000000 years
Creating variables internal_time, time, and tstep_count
Creating variable level
Creating variable lithoz
Creating variable staglevel
Creating variable stagwbndlevel
Creating variable x0
Creating variable x1
Creating variable y0
Creating variable y1
Creating variable artm
Creating variable smb
Creating variable thk
Creating variable topg
Creating variable usurf
Writing to file test_160.cism.initial_hist.0001-01-01-00000.nc at time 0.0
00000000000000E+000
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
calcsize j,iq,jac, lsfrm,lstoo 1 1 1 26 21
calcsize j,iq,jac, lsfrm,lstoo 1 1 2 26 21
calcsize j,iq,jac, lsfrm,lstoo 1 2 1 22 15
calcsize j,iq,jac, lsfrm,lstoo 1 2 2 22 15
calcsize j,iq,jac, lsfrm,lstoo 1 3 1 24 17
calcsize j,iq,jac, lsfrm,lstoo 1 3 2 24 17
calcsize j,iq,jac, lsfrm,lstoo 1 4 1 25 20
calcsize j,iq,jac, lsfrm,lstoo 1 4 2 25 20
calcsize j,iq,jac, lsfrm,lstoo 1 5 1 23 19
calcsize j,iq,jac, lsfrm,lstoo 1 5 2 23 19
calcsize j,iq,jac, lsfrm,lstoo 2 1 1 21 26
calcsize j,iq,jac, lsfrm,lstoo 2 1 2 21 26
calcsize j,iq,jac, lsfrm,lstoo 2 2 1 15 22
calcsize j,iq,jac, lsfrm,lstoo 2 2 2 15 22
calcsize j,iq,jac, lsfrm,lstoo 2 3 1 17 24
calcsize j,iq,jac, lsfrm,lstoo 2 3 2 17 24
calcsize j,iq,jac, lsfrm,lstoo 2 4 1 20 25
calcsize j,iq,jac, lsfrm,lstoo 2 4 2 20 25
calcsize j,iq,jac, lsfrm,lstoo 2 5 1 19 23
calcsize j,iq,jac, lsfrm,lstoo 2 5 2 19 23
max rss=549.1 MB
max rss=549.1 MB
max rss=549.1 MB
max rss=549.1 MB
max rss=549.1 MB
max rss=549.1 MB
max rss=549.1 MB
max rss=549.1 MB
max rss=549.1 MB
max rss=549.1 MB
max rss=549.1 MB
max rss=549.1 MB
max rss=549.1 MB
max rss=584.3 MB
cesm.exe:121149 terminated with signal 11 at PC=b75ee3 SP=7ffd66cda3a0. Backtrace:
./cesm.exe[0xb75ee3]
./cesm.exe[0xb46943]
./cesm.exe[0xb39915]
./cesm.exe[0x74cb67]
./cesm.exe[0x7312dd]
./cesm.exe[0x6e6a05]
./cesm.exe[0x6df962]
./cesm.exe[0x4fef7c]
./cesm.exe[0x4efec0]
./cesm.exe[0x4322ca]
./cesm.exe[0x4192ee]
./cesm.exe[0x431f6d]
./cesm.exe[0x41569e]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2ba26a9813d5]
./cesm.exe[0x4155a9]
I am porting CESM2.1.1 to the cluster in my university, everything is good so far.
When I run a test case for F2000climo with f09_f09_mg17, the default settting with one node, it is totally OK.
However, when I use more than one node, it is going to collapse while running in the first several days. the attached file is for two nodes. It collapse earlier with more nodes.
The cluster is: 80 CPUs per node, 2 nodes = 160 CPUs, with Intel complier.
Best,
Z-Q
Opening file test_160.cism.initial_hist.0001-01-01-00000.nc for output;
Write output at start of run and every 1.00000000000000 years
Creating variables internal_time, time, and tstep_count
Creating variable level
Creating variable lithoz
Creating variable staglevel
Creating variable stagwbndlevel
Creating variable x0
Creating variable x1
Creating variable y0
Creating variable y1
Creating variable artm
Creating variable smb
Creating variable thk
Creating variable topg
Creating variable usurf
Writing to file test_160.cism.initial_hist.0001-01-01-00000.nc at time 0.0
00000000000000E+000
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
calcsize j,iq,jac, lsfrm,lstoo 1 1 1 26 21
calcsize j,iq,jac, lsfrm,lstoo 1 1 2 26 21
calcsize j,iq,jac, lsfrm,lstoo 1 2 1 22 15
calcsize j,iq,jac, lsfrm,lstoo 1 2 2 22 15
calcsize j,iq,jac, lsfrm,lstoo 1 3 1 24 17
calcsize j,iq,jac, lsfrm,lstoo 1 3 2 24 17
calcsize j,iq,jac, lsfrm,lstoo 1 4 1 25 20
calcsize j,iq,jac, lsfrm,lstoo 1 4 2 25 20
calcsize j,iq,jac, lsfrm,lstoo 1 5 1 23 19
calcsize j,iq,jac, lsfrm,lstoo 1 5 2 23 19
calcsize j,iq,jac, lsfrm,lstoo 2 1 1 21 26
calcsize j,iq,jac, lsfrm,lstoo 2 1 2 21 26
calcsize j,iq,jac, lsfrm,lstoo 2 2 1 15 22
calcsize j,iq,jac, lsfrm,lstoo 2 2 2 15 22
calcsize j,iq,jac, lsfrm,lstoo 2 3 1 17 24
calcsize j,iq,jac, lsfrm,lstoo 2 3 2 17 24
calcsize j,iq,jac, lsfrm,lstoo 2 4 1 20 25
calcsize j,iq,jac, lsfrm,lstoo 2 4 2 20 25
calcsize j,iq,jac, lsfrm,lstoo 2 5 1 19 23
calcsize j,iq,jac, lsfrm,lstoo 2 5 2 19 23
max rss=549.1 MB
max rss=549.1 MB
max rss=549.1 MB
max rss=549.1 MB
max rss=549.1 MB
max rss=549.1 MB
max rss=549.1 MB
max rss=549.1 MB
max rss=549.1 MB
max rss=549.1 MB
max rss=549.1 MB
max rss=549.1 MB
max rss=549.1 MB
max rss=584.3 MB
cesm.exe:121149 terminated with signal 11 at PC=b75ee3 SP=7ffd66cda3a0. Backtrace:
./cesm.exe[0xb75ee3]
./cesm.exe[0xb46943]
./cesm.exe[0xb39915]
./cesm.exe[0x74cb67]
./cesm.exe[0x7312dd]
./cesm.exe[0x6e6a05]
./cesm.exe[0x6df962]
./cesm.exe[0x4fef7c]
./cesm.exe[0x4efec0]
./cesm.exe[0x4322ca]
./cesm.exe[0x4192ee]
./cesm.exe[0x431f6d]
./cesm.exe[0x41569e]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2ba26a9813d5]
./cesm.exe[0x4155a9]