zhongq@cma_gov_cn
New Member
Dear all,
I’m building the CAM5 standalone on an IBM AIX machine, and encountered a problems after submiting the job. Could you help to look at what’s the problem is. Thanks.
I Configure the model with default sets as in run-ibm.csh:
set ntasks = 16
setenv OMP_NUM_THREADS 4
$cfgdir/configure -dyn fv -hgrid 1.9x2.5 -spmd -smp -ntasks $ntasks -nthreads $OMP_NUM_THREADS -test
gmake -j8 >&! MAKE.out
and successfully got the namelists and execution file ‘cam’, and then I llsubmit "job.cmd" by loadleveler as follows:
#@node = 16
#@tasks_per_node= 4
#@job_type = parallel
#@network.MPI = sn_single,shared,us
#@node_usage = shared
#@queue
setenv OMP_NUM_THREADS 4
poe ./cam
then the error reported as follows:
Information from file.err:
INFO: 0031-364 Contacting LoadLeveler to query information for batch job
ATTENTION: 0031-408 64 tasks allocated by LoadLeveler, continuing...
INFO: 0031-119 Host d23n02 allocated for task 0
INFO: 0031-120 Host address 172.172.5.60 allocated for task 0
INFO: 0031-377 Using sn0 for MPI euidevice for task 0
INFO: 0031-373 Using MPI for messaging API
…………….
INFO: 0031-377 Using sn0 for MPI euidevice for task 63
60:INFO: 0031-724 Executing program:
24:INFO: 0031-724 Executing program:
12:INFO: 0031-724 Executing program:
32:INFO: 0031-724 Executing program:
4:ATTENTION: 0031-722 can't set priority to 0
4:INFO: 0031-724 Executing program:
………………
28:LAPI version #6.73 2006/6/05 1.143.1.14 src/rsct/lapi/lapi.c, lapi, rsct_rag2, rag2s008a 64bit(us) library compiled on Thu Nov 9 12:08:16 2006
28:.
………………….
28:LAPI is using lightweight lock.
………….
28:The MPI shared memory protocol is used for the job
……………….
28: Traceback:
28: Offset 0x0000018c in procedure __abortutils_MOD_endrun
28: Offset 0x00000408 in procedure __spmd_dyn_MOD_spmdinit_dyn
28: Offset 0x00000714 in procedure __dyn_comp_MOD_dyn_init
28: Offset 0x00000094 in procedure __inital_MOD_cam_initial
60: Offset 0x00000714 in procedure __dyn_comp_MOD_dyn_init
60: Offset 0x00000094 in procedure __inital_MOD_cam_initial
60: Offset 0x000000b8 in procedure __cam_comp_MOD_cam_init
……………
39: Offset 0x00004bd4 in procedure __ccsm_comp_mod_MOD_ccsm_init
39: Offset 0x00000034 in procedure ccsm_driver
39: --- End of call chain ---
INFO: 0031-656 I/O file STDERR closed by task 1
INFO: 0031-656 I/O file STDOUT closed by task 49
………………………
ERROR: 0031-250 task 47: Terminated
ERROR: 0031-250 task 21: Terminated
ERROR: 0031-250 task 37: Terminated
INFO: 0031-656 I/O file STDOUT closed by task 46
INFO: 0031-656 I/O file STDERR closed by task 46
ERROR: 0031-250 task 46: Terminated
INFO: 0031-639 Exit status from pm_respond = 0
Information from file.out:
……………..
0: Read in dyn_fv_inparm namelist from: atm_in
0: Read in spmd_fv_inparm namelist from: atm_in
0: WARNING : npr_yz not present - using 1-D domain decomposition
0: Decomposing tracers into 1 groups
0: non-transpose geopk communication method = F
0: Z-parallel non-transpose geopk communication method = F
0: decomposition is effectively 1D - skipping transposes
……………………
0: Mod_comm t1_win window size = 53568
0: Mod_comm r8_win window size = 179286
0: Mod_comm r4_win window size = 1
0: Mod_comm i4_win window size = 1
0: ENDRUN:SPMDINIT_DYN: less than 3 latitudes per subdomain
1: ENDRUN:SPMDINIT_DYN: less than 3 latitudes per subdomain
2: ENDRUN:SPMDINIT_DYN: less than 3 latitudes per subdomain
………………….
Where the problem is? Is there any wrong on MPI or the "job.cmd"? I also tried other number of tasks and threads in hybrid mode, and the same error reported.
Or should I give a value to “npr_yz”? if so ,how to value this variable? Is it sensitive to the number of MPI ?
I’m building the CAM5 standalone on an IBM AIX machine, and encountered a problems after submiting the job. Could you help to look at what’s the problem is. Thanks.
I Configure the model with default sets as in run-ibm.csh:
set ntasks = 16
setenv OMP_NUM_THREADS 4
$cfgdir/configure -dyn fv -hgrid 1.9x2.5 -spmd -smp -ntasks $ntasks -nthreads $OMP_NUM_THREADS -test
gmake -j8 >&! MAKE.out
and successfully got the namelists and execution file ‘cam’, and then I llsubmit "job.cmd" by loadleveler as follows:
#@node = 16
#@tasks_per_node= 4
#@job_type = parallel
#@network.MPI = sn_single,shared,us
#@node_usage = shared
#@queue
setenv OMP_NUM_THREADS 4
poe ./cam
then the error reported as follows:
Information from file.err:
INFO: 0031-364 Contacting LoadLeveler to query information for batch job
ATTENTION: 0031-408 64 tasks allocated by LoadLeveler, continuing...
INFO: 0031-119 Host d23n02 allocated for task 0
INFO: 0031-120 Host address 172.172.5.60 allocated for task 0
INFO: 0031-377 Using sn0 for MPI euidevice for task 0
INFO: 0031-373 Using MPI for messaging API
…………….
INFO: 0031-377 Using sn0 for MPI euidevice for task 63
60:INFO: 0031-724 Executing program:
24:INFO: 0031-724 Executing program:
12:INFO: 0031-724 Executing program:
32:INFO: 0031-724 Executing program:
4:ATTENTION: 0031-722 can't set priority to 0
4:INFO: 0031-724 Executing program:
………………
28:LAPI version #6.73 2006/6/05 1.143.1.14 src/rsct/lapi/lapi.c, lapi, rsct_rag2, rag2s008a 64bit(us) library compiled on Thu Nov 9 12:08:16 2006
28:.
………………….
28:LAPI is using lightweight lock.
………….
28:The MPI shared memory protocol is used for the job
……………….
28: Traceback:
28: Offset 0x0000018c in procedure __abortutils_MOD_endrun
28: Offset 0x00000408 in procedure __spmd_dyn_MOD_spmdinit_dyn
28: Offset 0x00000714 in procedure __dyn_comp_MOD_dyn_init
28: Offset 0x00000094 in procedure __inital_MOD_cam_initial
60: Offset 0x00000714 in procedure __dyn_comp_MOD_dyn_init
60: Offset 0x00000094 in procedure __inital_MOD_cam_initial
60: Offset 0x000000b8 in procedure __cam_comp_MOD_cam_init
……………
39: Offset 0x00004bd4 in procedure __ccsm_comp_mod_MOD_ccsm_init
39: Offset 0x00000034 in procedure ccsm_driver
39: --- End of call chain ---
INFO: 0031-656 I/O file STDERR closed by task 1
INFO: 0031-656 I/O file STDOUT closed by task 49
………………………
ERROR: 0031-250 task 47: Terminated
ERROR: 0031-250 task 21: Terminated
ERROR: 0031-250 task 37: Terminated
INFO: 0031-656 I/O file STDOUT closed by task 46
INFO: 0031-656 I/O file STDERR closed by task 46
ERROR: 0031-250 task 46: Terminated
INFO: 0031-639 Exit status from pm_respond = 0
Information from file.out:
……………..
0: Read in dyn_fv_inparm namelist from: atm_in
0: Read in spmd_fv_inparm namelist from: atm_in
0: WARNING : npr_yz not present - using 1-D domain decomposition
0: Decomposing tracers into 1 groups
0: non-transpose geopk communication method = F
0: Z-parallel non-transpose geopk communication method = F
0: decomposition is effectively 1D - skipping transposes
……………………
0: Mod_comm t1_win window size = 53568
0: Mod_comm r8_win window size = 179286
0: Mod_comm r4_win window size = 1
0: Mod_comm i4_win window size = 1
0: ENDRUN:SPMDINIT_DYN: less than 3 latitudes per subdomain
1: ENDRUN:SPMDINIT_DYN: less than 3 latitudes per subdomain
2: ENDRUN:SPMDINIT_DYN: less than 3 latitudes per subdomain
………………….
Where the problem is? Is there any wrong on MPI or the "job.cmd"? I also tried other number of tasks and threads in hybrid mode, and the same error reported.
Or should I give a value to “npr_yz”? if so ,how to value this variable? Is it sensitive to the number of MPI ?