Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

CEM2.1.1 on Gaea of GFDL

Dear All:

After Gaea had the batch system change from PBS to SLURM and other upgrades, I am trying to run CESM2.1.1 on Gaea of NOAA GFDL. I used pgi COMPILER, mpt MPILIBS, and cray-netcdf-hdf5parallel, cray-hdf5-parallel, and cray-parallel-netcdf. The building is very smooth. However, the run failed and had the following error message:
...
0: 234 nid00047
0: 235 nid00047
0: 236 nid00047
0: 237 nid00047
1: Assertion failed in file /notbackedup/tmp/ulib/mpt_base/mpich2/src/mpi/romio/adio/ad_cray/ad_cray_adio_open.c at line 470: _mpiio_o_lov_delay_create != 0
1: Rank 1 [Thu Oct 3 12:50:20 2019] [c0-0c0s2n1] internal ABORT - process 1
srun: error: nid00009: task 1: Aborted
srun: Terminating job step 201571108.0
...

I tried a few different MPI settings, but couldn't figure it out. The following is my config_batch.xml and config_machines.xml (The forum does not allow me to attach the *.xml files):

***********************
* config_batch.xml *
***********************
<!-- gaea is SLURM -->
<batch_system MACH="gaea" type="slurm" >
<batch_submit>sbatch</batch_submit>
<submit_args>
<arg flag="--time" name="$JOB_WALLCLOCK_TIME"/>
<arg flag="-q" name="$JOB_QUEUE"/>
</submit_args>
<directives>
<directive>-p batch</directive>
<directive>--clusters c3</directive>
<directive>-A cpo_ngrr_e</directive>
</directives>
<queues>
<queue walltimemax="06:00:00" nodemin="1" nodemax="710">normal</queue>
<!-- <queue walltimemax="00:30:00" nodemin="1" nodemax="3072" default="true">debug</queue> -->
</queues>
</batch_system>


****************************
* config_machines.xml *
****************************
<machine MACH="gaea">
<DESC>NOAA XE6, os is CNL, 32 pes/node, batch system is SLURM</DESC>
<OS>CNL</OS>
<COMPILERS>pgi,intel</COMPILERS>
<MPILIBS>mpt,mpich</MPILIBS>
<CIME_OUTPUT_ROOT>/lustre/f2/dev/Aaron.Wang</CIME_OUTPUT_ROOT>
<DIN_LOC_ROOT>/lustre/f2/dev/Aaron.Wang/cesm_input</DIN_LOC_ROOT>
<DIN_LOC_ROOT_CLMFORC>/lustre/f2/dev/Aaron.Wang/cesm_input</DIN_LOC_ROOT_CLMFORC>
<DOUT_S_ROOT>/lustre/f2/dev/Aaron.Wang/archive/$CASE</DOUT_S_ROOT>
<BASELINE_ROOT>UNSET</BASELINE_ROOT>
<CCSM_CPRNC>UNSET</CCSM_CPRNC>
<GMAKE_J> 8</GMAKE_J>
<BATCH_SYSTEM>slurm</BATCH_SYSTEM>
<SUPPORTED_BY>Aaron.Wang -at- ucar.edu</SUPPORTED_BY>
<MAX_TASKS_PER_NODE>32</MAX_TASKS_PER_NODE>
<MAX_MPITASKS_PER_NODE>32</MAX_MPITASKS_PER_NODE>
<mpirun mpilib="default">
<executable>srun</executable>
<arguments>
<arg name="label"> --label</arg>
<arg name="num_tasks" > -n {{ total_tasks }}</arg>
<arg name="binding"> -c {{ srun_binding }} --cpu_bind=cores</arg>
</arguments>
</mpirun>
<module_system type="module">
<init_path lang="perl">/opt/modules/default/init/perl.pm</init_path>
<init_path lang="python">/opt/modules/default/init/python.py</init_path>
<init_path lang="csh">/opt/modules/default/init/csh</init_path>
<init_path lang="sh">/opt/modules/default/init/sh</init_path>
<cmd_path lang="perl">/opt/modules/default/bin/modulecmd perl</cmd_path>
<cmd_path lang="python">/opt/modules/default/bin/modulecmd python</cmd_path>
<cmd_path lang="csh">module</cmd_path>
<cmd_path lang="sh">module</cmd_path>
<modules>
<command name="rm">PrgEnv-pgi</command>
<command name="rm">PrgEnv-cray</command>
<command name="rm">PrgEnv-gnu</command>
<command name="rm">PrgEnv-intel</command>
<command name="rm">pgi</command>
<command name="rm">cray</command>
</modules>
<modules compiler="pgi">
<command name="load">PrgEnv-pgi</command>
<command name="load">pgi</command>
</modules>
<modules compiler="gnu">
<command name="load">PrgEnv-gnu</command>
<command name="load">torque</command>
</modules>
<modules compiler="cray">
<command name="load">PrgEnv-cray/4.0.36</command>
<command name="load">cce/8.0.2</command>
</modules>
<modules>
<command name="load">cmake</command>
</modules>
<modules mpilib="!mpi-serial">
<command name="rm">netcdf</command>
<command name="rm">hdf5</command>
<command name="rm">cray-netcdf</command>
<command name="load">cray-netcdf-hdf5parallel</command>
<command name="load">cray-hdf5-parallel</command>
<command name="load">cray-parallel-netcdf</command>
</modules>
<modules mpilib="mpi-serial">
<command name="rm">netcdf</command>
<command name="rm">hdf5</command>
<command name="load">cray-hdf5</command>
<command name="load">cray-netcdf</command>
</modules>
</module_system>
<environment_variables>
<env name="OMP_STACKSIZE">128M</env>
<env name="MPICH_ENV_DISPLAY">1</env>
<env name="MPICH_VERSION_DISPLAY">1</env>
</environment_variables>
</machine>


Does anyone know how to solve the problem? I'll really appreciate your help. Thank you very much.

Best Regards,
Aaron
 
Top