Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

run CAM5 succeed in hgrid 10x15, but failed in resolution 1.9x2.5

Environment of the machine:
RedHat Enterprise Linux 5.3x86_64;
Intel compiler 11.1; ifort, mpif90;

Test 1: at resolution ” -dyn fv –hgrid 10x15”
(1) in serial way: “configure –fc ifort –nosmp –nospmd”
(2) only smp : “configure –fc mpif90 –nosmp –ntasks 6 “
succeed.

Test 2: at resolution “dyn fv –hgrid 1.9x2.5”
(1)in serial way:” configure –fc ifort –nosmp –nospmd”
and (2) only smp mode: : configure –fc mpif90 –nosmp –ntasks 16(or other number of tasks,2/8/…also be tried), the follow error occurs :
“forrtl: severe (71): integer divide by zero
Image PC Routine Line Source
cam 00000000004044AC Unknown Unknown Unknown
libc.so.6 00002B0714E258A4 Unknown Unknown Unknown
cam 00000000004043B9 Unknown Unknown Unknown
yhrun: error: cn803: task 0: Exited with exit code 71 “

(3)in hybrid mode: “configure –fc mpif90 –smp –spmd –ntasks 16 –nthreads 4”, then error report as follows:
yhrun: error: cn434: task 0: Aborted
yhrun: First task exited 60s ago
yhrun: tasks 1-15: running
yhrun: task 0: exited abnormally
yhrun: Terminating job step 997954.0
slurmd[cn434]: *** STEP 997954.0 KILLED AT 2012-11-20T23:33:01 WITH SIGNAL 9 ***
slurmd[cn435]: *** STEP 997954.0 KILLED AT 2012-11-20T23:33:01 WITH SIGNAL 9 ***
yhrun: Job step aborted: Waiting up to 2 seconds for job step to finish.
slurmd[cn434]: *** STEP 997954.0 KILLED AT 2012-11-20T23:33:01 WITH SIGNAL 9 ***
slurmd[cn435]: *** STEP 997954.0 KILLED AT 2012-11-20T23:33:01 WITH SIGNAL 9 ***

What can i judge from the error information. Is it the problem in enviroment or in procedure?
Any suggestion will be thankful!
 

eaton

CSEG and Liaisons
It's not clear whether the serial run at 1.9x2.5 was successful. Did it encounter the same failure as the pure MPI run? If you can successfully run the 10x15 case in serial mode, but the 1.9x2.5 case fails, that would indicate a memory problem. It will also be helpful to know what happens when you add the "-debug" flag to the configure command.
 
The serial run at 1.9x2.5 is not successful. yes, it encouters the same error as the pure MPI run, as showed above.
Is the memory the only difference between the two resolution? we have 24G memory for each nodes, which should enough for 1.9x2.5, and in job script, "limit stacksize unlimited" have been defined. Do any other possibilities exit?
and the -debug option has been add to configure, but find no clear information. Would you help to have a look. Thanks.

View attachment 99
 

eaton

CSEG and Liaisons
There's no obvious problem indicated by the Make output.

I agree that 24-GB is plenty of memory. And setting the stacksize to unlimited should deal with stack size issues.

There are different datasets used by the two resolutions. I can't tell from the output in your posts whether the failure is due to a missing dataset. Please post the log output from the run attempt. Also, if you run the build-namelist command with the -test option it will check that all required datasets are present from a locally accessible disk.
 
Yes, the data required has been checked:
$cfgdir/configure -fc ifort -dyn fv -hgrid 1.9x2.5 -nospmd -nosmp -debug || echo "configure failed" && exit 1
gmake -j8 >&! MAKE.out || echo "CAM build failed: see $blddir/MAKE.out" && exit 1
rm *.o *.mod
$cfgdir/build-namelist -test -s -config $blddir/config_cache.xml -case $case -runtype $runtype
101 -namelist "&camexp stop_option='ndays', stop_n=$stop_n /" |tee ./datacheck.log || echo "build-namelist failed" && exit 1
1
the log output from the run attempt when adding debug as the attachment slurm.out shows:
 

eaton

CSEG and Liaisons
The output log shows the failure coming from the land model.

Image PC Routine Line Source
cam 00000000009BC614 accumulmod_mp_upd 496 accumulMod.F90
cam 000000000099BB18 accfldsmod_mp_upd 473 accFldsMod.F90
cam 00000000010AAF43 clm_driver_mp_clm 756 clm_driver.F90
cam 00000000023B6F5D lnd_comp_mct_mp_l 693 lnd_comp_mct.F90
cam 0000000000D1B332 ccsm_comp_mod_mp_ 1784 ccsm_comp_mod.F90
cam 0000000000D25978 MAIN__ 91 ccsm_driver.F90

This doesn't make any sense to me. Since you appear to be setting things up correctly I suspect that this is some kind of a system problem. My recommendation would be to either update the intel compiler, or look for another compiler to use.
 
Since my Intel compiler version is 11.1, which should satisfy the requirement of CESM (ug requires ifort intel 10.1.018), --Had the model been tested at intel version 11.x?--
so I tested other resolution in serial mode, and successful run is done at hgrid 4x5. Resolution finer than 1.9x2.5 will report error in the land model. for example, at hgrid 0.9x1.25,it reports as follows:
surfrd_wtxy_veg_all ERROR: sum(pct) over numpft+1 is not = 100.
ENDRUN: called without a message string
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
libc.so.6 00002B16184EA520 Unknown Unknown Unknown
cam 0000000004043B9 Unknown Unknown Unknown
yhrun: error: cn3749: task 0: Exited with exit code 174
Does this error mean anything, or May it possible to avoid the land model problem when run CAM5 standalone ?
Thanks a lot!
 

eaton

CSEG and Liaisons
Unfortunately "compiler version X works" does not imply "compiler version Y where Y>X works". I don't know for sure the status of intel-11.1.

The error you report from CLM is coming from line 1439 of file surfrdMod.F90. The suggestion I got from a colleague was to comment out the "call endrun()" statement in line 1441 and see if that works. This check has been removed in the version of surfrdMod.F90 that was just released in CESM-1.1.

Generally "CAM5 standalone" implies a configuration which includes the active CLM land component. It is possible to run without land, in an aquaplanet configuration for example.
 
Top