Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

CESM freeze issue on Perlmutter when running on multiple nodes

jray

Ray Shi
New Member
Hi,

I'm currently working on porting CESM to the Perlmutter system at NERSC. While the model builds and runs successfully on a single node, it consistently freezes without any error message during runtime when I attempt to use multiple nodes. This issue persists across multiple CESM versions that I've tested, including:
  • cesm2_3_beta17
  • cesm3_0_beta3
Has anyone encountered similar problems when running CESM on Perlmutter with more than one node? I would greatly appreciate any suggestions, debugging tips, or working configurations known to the community.

Thanks,
 

jray

Ray Shi
New Member
Hi Jim, Thank you for your reply.

It seems that ESMF might be the key difference here. In your config_machines.xml, the module esmf/8.7.0 is listed, along with parallelio/2.6.3, but I had installed ESMF manually. When I try module load esmf/8.7.0 or module load parallelio/2.6.3, the system cannot find these modules.

Could you let me know how are these modules made available on your system?
 

jray

Ray Shi
New Member
I attempted to access the module path /global/cfs/cdirs/ccsm1/modulefiles/perlmutter/, but it returned a permission denied error.
 

jray

Ray Shi
New Member
In addition, it also appears that access to the following directory is required:
/global/cfs/cdirs/ccsm1/sw/perlmutter/

This is because ESMFMKFILE is defined as:
/global/cfs/cdirs/ccsm1/sw/perlmutter/modules/cray-mpich/8.1.28/intel/2023.2.0/esmf/8.7.0/lib/libO/Unicos.intel.64.mpi.default//esmf.mk

Thank you
 
Top