Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

CESM won't run on Ranger with nonzero rootpe

bitz@uw_edu

New Member
I am able to run CESM (1deg CCSM4 physics with cesm1_0 code base) on Ranger with 128 and 256 processors provided the rootpe's are all zero. When I try to have the ocean run all the time (ROOTPE_OCN=448), the run gets to the ocn initialization and dies without a word in any of the log files. The environmental variables are all set correctly in the run script, so it produces the correct summary (see below). Happy to provide more info, if you tell me what is useful.

Here is some stuff I thought might be relevant

Ranger has opteron chips and I am using pgi and mvapich:
/opt/apps/pgi7_2/mvapich/1.0.1/bin/mpif90

The run command is "ibrun ./ccsm.exe"

/share/sge6.2/default/pe_scripts/getmode.sh
mvapich1_ssh

Do I need MPI-2?

set MODELS = ( cpl atm lnd ice ocn glc )
set COMPONENTS = ( cpl cam clm cice pop2 sglc )
set NTASKS = ( 320 448 128 320 64 1 )
set NTHRDS = ( 1 1 1 1 1 1 )
set ROOTPE = ( 0 0 320 0 448 0 )
set PSTRID = ( 1 1 1 1 1 1 )
 

eaton

CSEG and Liaisons
You don't need MPI-2.

The pe layout that you describe looks like it's designed for 512 processors. But you only stated that you could run with 128 and 256 procs with all the root pes set to zero. Can you also run with 512 procs and all the root pes set to 0? Can you run with a pe layout appropriate for 256 procs and run the ocn concurrently with atm/lnd/ice ?
 

bitz@uw_edu

New Member
I have confirmed that I can run with 512 processors with all rootpe=0. I cannot run with nonzero rootpe with either 256 or 512 processors. Finally, it is true that MPI2 did not help. In all cases with nonzero a rootpe for the ocn, the model fails just after the ocn.log says "Initializing diagnostic BSF variables ..." and the ccsm.log says "(init_tavg) tavg_streams ..." Then the ccsm.log has "MPI process terminated unexpectedly". I didn't add any statements to flush the output buffer, so maybe this is a red herring.

I'm not going to pursue this further. I'm satisfied to just be able to run at this point.
 
Top