Running CESM in parallel

TCNasa

Tom Caldwell
Member
Anytime I try to specify NTHREADS greater than 1. My CESM code crashes with errors like this.
Any suggestions?

--------------------------------------------------------------------------


Primary job terminated normally, but 1 process returned


a non-zero exit code. Per user-direction, the job has been aborted.


--------------------------------------------------------------------------


--------------------------------------------------------------------------


orterun noticed that process rank 2 with PID 0 on node bn12 exited on signal 11 (Segmentation fault).


--------------------------------------------------------------------------


[bn12:10058] PMIX ERROR: NO-PERMISSIONS in file dstore_base.c at line 237


[bn12:10058] PMIX ERROR: NO-PERMISSIONS in file dstore_base.c at line 246
 

jedwards

CSEG and Liaisons
Staff member
A couple of answers
1. CESM does not perform as well with openmp threading as with mpi - whenever possible you should use mpi for parallelism instead of openmp
2. How threading is set up in the config_batch.xml file is very machine dependent, if possible seek help from a local system administrator.

Is this happening before any cesm.log file is written? Can you run a simple test with threading - for example - hello world?
 

TCNasa

Tom Caldwell
Member
Is there a setting to switch from openmp to mpi?
I don't know precisely when it happens. The error messages are from the cesm log file.
 

TCNasa

Tom Caldwell
Member
bn12:/CERES/sarb/caldwell/CESM2.2 {187} ./describe_version


------------------------------------------------------------------------


git describe:


cesm2.2.0-0-g332937b


------------------------------------------------------------------------




These commands are used to set the task and thread values:
./xmlchange --id MAX_TASKS_PER_NODE --val 2
./xmlchange --id MAX_MPITASKS_PER_NODE --val 2
./xmlchange NTHRDS=2,NTASKS=6

The browser won't let me choose the log files for upload
 

TCNasa

Tom Caldwell
Member
Yes a run with NTHREADS=1 works, but I can't increase the number of TASKS over 12 without failure either.
 
Back
Top