Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

MPI error in CESM1.2.2

Hi all,

I am trying to run an experiment where I am impose an atmospheric heating anomaly in the radiative heating subroutine of CAM5 and I am having some issues. I have previously successfully run hundreds of similar experiments, but I am now trying to do some additional experiments for the first time in 6 months and I am getting error messages. I am using the FC5 compset of CESM1.2.2 on the Australian supercomputer. I can run a 9 month control experiment without the imposed anomaly, but when I include the anomaly, the job fails almost immediately. I have tried re-submitting the job, but the error message changes each time.

The first and second time I submit the job, I get the following error:
Code:
CalcWorkPerBlock: Total blocks:  1152 Ice blocks:  1152 IceFree blocks:     0 Land blocks:     0
malloc(): invalid size (unsorted)

On the third attempt, I get this error:
Code:
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
(seq_frac_check) [ice set] ERROR aborting
(shr_sys_abort) WARNING: calling shr_mpi_abort() and stopping
(seq_frac_check) [ice set] ERROR aborting
(shr_sys_abort) WARNING: calling shr_mpi_abort() and stopping
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 137 in communicator MPI_COMM_WORLD
with errorcode 1001.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

And on the fourth attempt, I get this error:
Code:
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
MCT::m_Rearranger::Rearrange_: TargetAV size is not appropriate for this Rearranger
MCT::m_Rearranger::Rearrange_: error, InRearranger%RecvRouter%lAvsize=121, AttrVect_lsize(TargetAV)=0.
022.MCT(MPEU)::die.: from MCT::m_Rearranger::Rearrange_()
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 34 in communicator MPI_COMM_WORLD
with errorcode 2.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[gadi-cpu-clx-1039:1706032] *** An error occurred in MPI_Isend
[gadi-cpu-clx-1039:1706032] *** reported by process [2884501505,28]
[gadi-cpu-clx-1039:1706032] *** on communicator MPI COMMUNICATOR 33 DUP FROM 0
[gadi-cpu-clx-1039:1706032] *** MPI_ERR_OTHER: known error not in list
[gadi-cpu-clx-1039:1706032] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[gadi-cpu-clx-1039:1706032] ***    and potentially your MPI job)

I have attached the CESM log files, Macros, env_mach_specific and env_mach_pes.xml.

I have never received a single error when running these experiments nor have I made any changes to the experimental setup/source code etc., so I am not sure what is the problem. Do you have any suggestions for me to try?

Thanks very much for your help,
Zoe
 

Attachments

  • cesm.log.220527-141415.txt
    377.9 KB · Views: 1
  • cesm.log.220527-144203.txt
    401.9 KB · Views: 3
  • cesm.log.220527-145447.txt
    408.3 KB · Views: 1
  • env_mach_pes.txt
    5.8 KB · Views: 5
  • env_mach_pes.xml.txt
    6.8 KB · Views: 3
  • env_mach_specific.txt
    965 bytes · Views: 4

cacraig

Cheryl Craig
CSEG and Liaisons
Staff member
Unfortunately, CESM1 is no longer a supported release. See: CESM Support Policy

That said, I can give you some general advice which may or may not be useful.

The CICE issue was probably resolved with changes to PE layout. I have a vague memory of CICE having issues if the PE layout versus grid size was not appropriate. Unfortunately, I don't remember the specifics.

For the third and fourth attempts, the MPI errors may not be the true problem, but rather indicative that the model is erroring out and MPI is being halted. I would look in the other log files and see if there is any indication of what might be occurring.

If I were going to get this to work, I would go back to the experiment without the modification, create it from scratch and make sure it still runs. Once I verified that it still runs, I would apply a single change at a time to see if it continues to run. If not, then I would compare the files in my two sandboxes (the one which worked and the one which didn't) and see what differs. Sometimes PE layout changes can be a problem, but that is not the only possibility.
 
Top