Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

An error encountered in build and run CAM5 standalone

Dear all,
I’m building the CAM5 standalone on an IBM AIX machine, and encountered a problems after submiting the job. Could you help to look at what’s the problem is. Thanks.
I Configure the model with default sets as in run-ibm.csh:
set ntasks = 16
setenv OMP_NUM_THREADS 4
$cfgdir/configure -dyn fv -hgrid 1.9x2.5 -spmd -smp -ntasks $ntasks -nthreads $OMP_NUM_THREADS -test
gmake -j8 >&! MAKE.out

and successfully got the namelists and execution file ‘cam’, and then I llsubmit "job.cmd" by loadleveler as follows:
#@node = 16
#@tasks_per_node= 4
#@job_type = parallel
#@network.MPI = sn_single,shared,us
#@node_usage = shared
#@queue
setenv OMP_NUM_THREADS 4
poe ./cam

then the error reported as follows:
Information from file.err:
INFO: 0031-364 Contacting LoadLeveler to query information for batch job
ATTENTION: 0031-408 64 tasks allocated by LoadLeveler, continuing...
INFO: 0031-119 Host d23n02 allocated for task 0
INFO: 0031-120 Host address 172.172.5.60 allocated for task 0
INFO: 0031-377 Using sn0 for MPI euidevice for task 0
INFO: 0031-373 Using MPI for messaging API
…………….
INFO: 0031-377 Using sn0 for MPI euidevice for task 63
60:INFO: 0031-724 Executing program:
24:INFO: 0031-724 Executing program:
12:INFO: 0031-724 Executing program:
32:INFO: 0031-724 Executing program:
4:ATTENTION: 0031-722 can't set priority to 0
4:INFO: 0031-724 Executing program:
………………
28:LAPI version #6.73 2006/6/05 1.143.1.14 src/rsct/lapi/lapi.c, lapi, rsct_rag2, rag2s008a 64bit(us) library compiled on Thu Nov 9 12:08:16 2006
28:.
………………….
28:LAPI is using lightweight lock.
………….
28:The MPI shared memory protocol is used for the job
……………….
28: Traceback:
28: Offset 0x0000018c in procedure __abortutils_MOD_endrun
28: Offset 0x00000408 in procedure __spmd_dyn_MOD_spmdinit_dyn
28: Offset 0x00000714 in procedure __dyn_comp_MOD_dyn_init
28: Offset 0x00000094 in procedure __inital_MOD_cam_initial
60: Offset 0x00000714 in procedure __dyn_comp_MOD_dyn_init
60: Offset 0x00000094 in procedure __inital_MOD_cam_initial
60: Offset 0x000000b8 in procedure __cam_comp_MOD_cam_init
……………
39: Offset 0x00004bd4 in procedure __ccsm_comp_mod_MOD_ccsm_init
39: Offset 0x00000034 in procedure ccsm_driver
39: --- End of call chain ---
INFO: 0031-656 I/O file STDERR closed by task 1
INFO: 0031-656 I/O file STDOUT closed by task 49
………………………
ERROR: 0031-250 task 47: Terminated
ERROR: 0031-250 task 21: Terminated
ERROR: 0031-250 task 37: Terminated
INFO: 0031-656 I/O file STDOUT closed by task 46
INFO: 0031-656 I/O file STDERR closed by task 46
ERROR: 0031-250 task 46: Terminated
INFO: 0031-639 Exit status from pm_respond = 0

Information from file.out:
……………..
0: Read in dyn_fv_inparm namelist from: atm_in
0: Read in spmd_fv_inparm namelist from: atm_in
0: WARNING : npr_yz not present - using 1-D domain decomposition
0: Decomposing tracers into 1 groups
0: non-transpose geopk communication method = F
0: Z-parallel non-transpose geopk communication method = F
0: decomposition is effectively 1D - skipping transposes
……………………
0: Mod_comm t1_win window size = 53568
0: Mod_comm r8_win window size = 179286
0: Mod_comm r4_win window size = 1
0: Mod_comm i4_win window size = 1
0: ENDRUN:SPMDINIT_DYN: less than 3 latitudes per subdomain
1: ENDRUN:SPMDINIT_DYN: less than 3 latitudes per subdomain
2: ENDRUN:SPMDINIT_DYN: less than 3 latitudes per subdomain
………………….

Where the problem is? Is there any wrong on MPI or the "job.cmd"? I also tried other number of tasks and threads in hybrid mode, and the same error reported.
Or should I give a value to “npr_yz”? if so ,how to value this variable? Is it sensitive to the number of MPI ?
 

eaton

CSEG and Liaisons
The configure command is specifying that the job will run with 16 MPI tasks. This will work with the 1.9x2.5 grid using the default value for npr_yz. However the batch job appears to be requesting 64 tasks (16 nodes and 4 tasks per node). Running with 64 tasks will not work unless npr_yz is set to employ a 2D decomposition. npr_yz=32,2,2,32 would be a valid setting.
 
Dear eaton,
Thanks for your suggestion. And I revised the batch job as follows:
Configure remains default setting:
set ntasks = 16
# should be set equal to (CPUs-per-node / tasks_per_node)
setenv OMP_NUM_THREADS 4
$cfgdir/configure -dyn fv -hgrid 1.9x2.5 -spmd -smp -ntasks $ntasks -nthreads $OMP_NUM_THREADS
Revised job.cmd as:
#@node = 4
#@tasks_per_node= 4
poe ./cam

then, Question 1: why the value of “OMP_NUM_THREADS” must be set equal to (CPUs-per-node / tasks_per_node) ?

Question 2: new error occurred and reported as follows:
INFO: 0031-364 Contacting LoadLeveler to query information for batch job
ATTENTION: 0031-408 16 tasks allocated by LoadLeveler, continuing...
INFO: 0031-119 Host d38n13 allocated for task 0
INFO: 0031-120 Host address 172.172.4.213 allocated for task 0
INFO: 0031-377 Using sn1 for MPI euidevice for task 0
INFO: 0031-373 Using MPI for messaging API
………….
0:INFO: 0031-724 Executing program:
………….
0:INFO: 0031-619 64bit(us) ppe_rsan, rsan0537a MPCI shared object was compiled at Tue Feb 20 12:42:45 2007
………….
0:The MPI shared memory protocol is used for the job
…………..
INFO: 0031-656 I/O file STDOUT closed by task 6
INFO: 0031-656 I/O file STDERR closed by task 6
ERROR: 0031-250 task 6: Trace/BPT trap
………………..
What does this error means? Could you give me some advice on this problem.

Question 3:
In the case of using 64 tasks, you suggest setting “npr_yz=32,2,2,32”. Then, how to set this value in CAM5?
Question 4:
Once I change the number of tasks, how should I adjust the value of “npr_yz”?
Thanks for your help.
 
To Question3:
I tried 64 MPI tasks as follows:
set ntasks = 64
setenv OMP_NUM_THREADS 8
$cfgdir/configure -dyn fv -hgrid 1.9x2.5 -spmd -smp -ntasks $ntasks -nthreads $OMP_NUM_THREADS -test

Then, add in file atm_in:
&spmd_fv_inparm
npr_yz=32,2,2,32
/
In job.cmd:
#@node = 8
#@tasks_per_node= 8
setenv OMP_NUM_THREADS 8

However, “ERROR: 0031-250 task 6: Trace/BPT trap” still report.
What’s wrong here? any suggestion will be welcome. Thanks.
 
It still doesn't work when setting "OMP_NUM_THREADS=1". Neither in the case of run with "set ntasks = 64" nor "set ntasks=16", the same error "ERROR: 0031-250 task 36: Trace/BPT trap" was reported. :(
there must be a problem in common exists, i think. but how to find out ? I need your help, thanks !
 

eaton

CSEG and Liaisons
It's not clear whether you've been able to run an MPI job with any number of tasks. Can you run with 2 tasks?
 
Dear eaton, thanks for your suggestion, I do some experiments as follows.
Experiment 1:
I tried to run with 2 (or 4 or 8 ) tasks, However, “ERROR: 0031-250 task: Trace/BPT trap” still report. It means that MPI run doesn’t work in this case (dyn fv –hgrid 1.9x2.5). What the problem may it be?

Experiment 2:
Then, I tried to run in serial ways, but even can’t produce execute file ‘cam’. Error reported when ‘gmake’ as follows:
xlf90_r: 1501-230 Internal compiler error; please contact your Service Representative
1501-511 Compilation failed for file clm_mct_mod.F90.
gmake: *** [clm_mct_mod.o] Error 40

Experiment 3:
I changed the resolution to “-dyn fv –hgrid 10x15”,nthreads=1.
(1)Serial: failed in gmake:
gmake: *** [clm_mct_mod.o] Error 40
gmake: *** [mct_mod.o] Error 40

(2)run with 2 MPI tasks: succeed !

(3)run with 4 MPI tasks: succeed !

(4) run with 8 tasks:
If set npr_yz=4,1,1,4 then succeed.

(5) run with 16 tasks: configure failed:
Error: CICE decomposition generator returns:-1 ERROR( generate_cice_decomp.pl) No Decomp Created.
May need to explicitly specify the decomposition using the arguments –bsizex, -bsizey, and –maxblocks.

(6)run with 64 tasks:
Same error with (5).

(7) if set nthreads>=2, then
Ntasks=8,/16/64, same error with (5).

The experiments show that:
In case of “dyn fv –hgrid 1.9x2.5”, no matter run with what number of tasks, same error “ERROR: 0031-250 task: Trace/BPT trap” occurred.
In case of “dyn fv –hgrid 10x15”, when tasks>=8, CICE decomposition should specified artificially in configure.

Would you help to judge what problems may exist in case of “dyn fv –hgrid 1.9x2.5”?
And, how should I specify the CICE decomposition in case of “dyn fv –hgrid 10x15”?
Thanks for your help.
 

eaton

CSEG and Liaisons
It's hard to understand how you are able to run with 2 mpi tasks successfully at the 10x15 resolution, but not at the 1.9x2.5 resolution. There seems to be a system problem here. The 2 deg grid does not require a large memory to run so it's hard to imagine that that would be the problem. It would be really helpful to run in serial mode to get a baseline. The internal compiler error reported by xlf90_r needs to be resolved by the system administrator. There should be no problem running the code in serial mode. If that cannot be done successfully it indicates compiler and/or system level problems that need to be addressed.
 
I revised the compiler option because an error occurred when configure test:
xlf90_r -o test_fc test_fc.o -q64 -lmassv -lmass -lessl
ld: 0711-738 ERROR: Input file /usr/lib/libmassv.a:
XCOFF32 object files are not allowed in 64-bit mode.
So, I edit the file Makefile.in. In the AIX section edit the LDFLAGS setting by removing the options “lmassv”;

And, my compiler can’t recognize the “-b datapsize, -b stackpsize and -b textpsize” option, so I revised the LDFLAGS as follows:
LDFLAGS:= -q64 -stackpsize:64k -textpsize:32k
Does this compiler option possibly lead to the error I face now?
 

eaton

CSEG and Liaisons
I believe that CAM should be able to build and run without adding the massv, mass, or essl libraries. These are all ibm libraries containing optimized code which is meant to improve performance.

Similarly the datapsize, stackpsize, and textpsize settings are for optimal performance on the power6 platform at NCAR. You need to consult with your system administrators to find out what is best for your system. I would guess that you could just remove these settings and get the code to run even if it's not optimal.

The -q64 flags are to produce code for a 64-bit operating system. These are not necessary for low resolution models. The error message from ld indicated that the massv library was 32-bit, so perhaps you'll have more luck removing the -q64 flags. Again, the system admins should be able to help with this.
 
Top