CEM2.1.1 on Gaea of GFDL

Dear All:

After Gaea had the batch system change from PBS to SLURM and other upgrades, I am trying to run CESM2.1.1 on Gaea of NOAA GFDL. I used pgi COMPILER, mpt MPILIBS, and cray-netcdf-hdf5parallel, cray-hdf5-parallel, and cray-parallel-netcdf. The building is very smooth. However, the run failed and had the following error message:
...
0: 234 nid00047
0: 235 nid00047
0: 236 nid00047
0: 237 nid00047
1: Assertion failed in file /notbackedup/tmp/ulib/mpt_base/mpich2/src/mpi/romio/adio/ad_cray/ad_cray_adio_open.c at line 470: _mpiio_o_lov_delay_create != 0
1: Rank 1 [Thu Oct 3 12:50:20 2019] [c0-0c0s2n1] internal ABORT - process 1
srun: error: nid00009: task 1: Aborted
srun: Terminating job step 201571108.0
...

I tried a few different MPI settings, but couldn't figure it out. The following is my config_batch.xml and config_machines.xml (The forum does not allow me to attach the *.xml files):

***********************
* config_batch.xml *
***********************
<!-- gaea is SLURM -->
<batch_system MACH="gaea" type="slurm" >
<batch_submit>sbatch</batch_submit>
<submit_args>
<arg flag="--time" name="$JOB_WALLCLOCK_TIME"/>
<arg flag="-q" name="$JOB_QUEUE"/>
</submit_args>
<directives>
<directive>-p batch</directive>
<directive>--clusters c3</directive>
<directive>-A cpo_ngrr_e</directive>
</directives>
<queues>
<queue walltimemax="06:00:00" nodemin="1" nodemax="710">normal</queue>
<!-- <queue walltimemax="00:30:00" nodemin="1" nodemax="3072" default="true">debug</queue> -->
</queues>
</batch_system>


****************************
* config_machines.xml *
****************************
<machine MACH="gaea">
<DESC>NOAA XE6, os is CNL, 32 pes/node, batch system is SLURM</DESC>
<OS>CNL</OS>
<COMPILERS>pgi,intel</COMPILERS>
<MPILIBS>mpt,mpich</MPILIBS>
<CIME_OUTPUT_ROOT>/lustre/f2/dev/Aaron.Wang</CIME_OUTPUT_ROOT>
<DIN_LOC_ROOT>/lustre/f2/dev/Aaron.Wang/cesm_input</DIN_LOC_ROOT>
<DIN_LOC_ROOT_CLMFORC>/lustre/f2/dev/Aaron.Wang/cesm_input</DIN_LOC_ROOT_CLMFORC>
<DOUT_S_ROOT>/lustre/f2/dev/Aaron.Wang/archive/$CASE</DOUT_S_ROOT>
<BASELINE_ROOT>UNSET</BASELINE_ROOT>
<CCSM_CPRNC>UNSET</CCSM_CPRNC>
<GMAKE_J> 8</GMAKE_J>
<BATCH_SYSTEM>slurm</BATCH_SYSTEM>
<SUPPORTED_BY>Aaron.Wang -at- ucar.edu</SUPPORTED_BY>
<MAX_TASKS_PER_NODE>32</MAX_TASKS_PER_NODE>
<MAX_MPITASKS_PER_NODE>32</MAX_MPITASKS_PER_NODE>
<mpirun mpilib="default">
<executable>srun</executable>
<arguments>
<arg name="label"> --label</arg>
<arg name="num_tasks" > -n {{ total_tasks }}</arg>
<arg name="binding"> -c {{ srun_binding }} --cpu_bind=cores</arg>
</arguments>
</mpirun>
<module_system type="module">
<init_path lang="perl">/opt/modules/default/init/perl.pm</init_path>
<init_path lang="python">/opt/modules/default/init/python.py</init_path>
<init_path lang="csh">/opt/modules/default/init/csh</init_path>
<init_path lang="sh">/opt/modules/default/init/sh</init_path>
<cmd_path lang="perl">/opt/modules/default/bin/modulecmd perl</cmd_path>
<cmd_path lang="python">/opt/modules/default/bin/modulecmd python</cmd_path>
<cmd_path lang="csh">module</cmd_path>
<cmd_path lang="sh">module</cmd_path>
<modules>
<command name="rm">PrgEnv-pgi</command>
<command name="rm">PrgEnv-cray</command>
<command name="rm">PrgEnv-gnu</command>
<command name="rm">PrgEnv-intel</command>
<command name="rm">pgi</command>
<command name="rm">cray</command>
</modules>
<modules compiler="pgi">
<command name="load">PrgEnv-pgi</command>
<command name="load">pgi</command>
</modules>
<modules compiler="gnu">
<command name="load">PrgEnv-gnu</command>
<command name="load">torque</command>
</modules>
<modules compiler="cray">
<command name="load">PrgEnv-cray/4.0.36</command>
<command name="load">cce/8.0.2</command>
</modules>
<modules>
<command name="load">cmake</command>
</modules>
<modules mpilib="!mpi-serial">
<command name="rm">netcdf</command>
<command name="rm">hdf5</command>
<command name="rm">cray-netcdf</command>
<command name="load">cray-netcdf-hdf5parallel</command>
<command name="load">cray-hdf5-parallel</command>
<command name="load">cray-parallel-netcdf</command>
</modules>
<modules mpilib="mpi-serial">
<command name="rm">netcdf</command>
<command name="rm">hdf5</command>
<command name="load">cray-hdf5</command>
<command name="load">cray-netcdf</command>
</modules>
</module_system>
<environment_variables>
<env name="OMP_STACKSIZE">128M</env>
<env name="MPICH_ENV_DISPLAY">1</env>
<env name="MPICH_VERSION_DISPLAY">1</env>
</environment_variables>
</machine>


Does anyone know how to solve the problem? I'll really appreciate your help. Thank you very much.

Best Regards,
Aaron
 
Back
Top