Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Run failure in previously successful cases

polly

Polly Thornton
New Member
I'm getting errors in CLM-FATES cases on Derecho that ran successfully on Feb 23. The land and datm models have initialized successfully. The cesm log files are hard (impossible for me) to interpret:

dec1786.hsn.de.hpc.ucar.edu 17: MPICH ERROR [Rank 17] [job id d19fa325-7722-4290-a09a-5182cd0be3ca] [Thu Feb 29 16:48:12 2024] [dec1786] - Abort(1) (rank 17 in comm 496): application called MPI_Abort(comm=0x84000001, 1) - process 17
dec1786.hsn.de.hpc.ucar.edu 17:
dec1786.hsn.de.hpc.ucar.edu 16: forrtl: severe (174): SIGSEGV, segmentation fault occurred
dec1786.hsn.de.hpc.ucar.edu 16: Image PC Routine Line Source
dec1786.hsn.de.hpc.ucar.edu 16: libpthread-2.31.s 000014C115F3B8C0 Unknown Unknown Unknown
dec1786.hsn.de.hpc.ucar.edu 16: libmpi_intel.so.1 000014C113EFAE7E Unknown Unknown Unknown
dec1786.hsn.de.hpc.ucar.edu 16: libmpi_intel.so.1 000014C113D0922F Unknown Unknown Unknown
dec1786.hsn.de.hpc.ucar.edu 16: libmpi_intel.so.1 000014C1123366A8 MPI_Abort Unknown Unknown
dec1786.hsn.de.hpc.ucar.edu 16: libesmf.so 000014C11DF1F1D7 _ZN5ESMCI3VMK5abo Unknown Unknown
dec1786.hsn.de.hpc.ucar.edu 16: libesmf.so 000014C11DF1D9F4 _ZN5ESMCI2VM5abor Unknown Unknown
dec1786.hsn.de.hpc.ucar.edu 16: libesmf.so 000014C11DF32E45 c_esmc_vmabort_ Unknown Unknown
dec1786.hsn.de.hpc.ucar.edu 16: libesmf.so 000014C11E720868 esmf_vmmod_mp_esm Unknown Unknown
dec1786.hsn.de.hpc.ucar.edu 16: libesmf.so 000014C11E5A751A esmf_initmod_mp_e Unknown Unknown
dec1786.hsn.de.hpc.ucar.edu 16: cesm.exe 00000000004329FB MAIN__ 145 esmApp.F90
dec1786.hsn.de.hpc.ucar.edu 16: cesm.exe 000000000042217D Unknown Unknown Unknown
dec1786.hsn.de.hpc.ucar.edu 16: libc-2.31.so 000014C11182D29D __libc_start_main Unknown Unknown
dec1786.hsn.de.hpc.ucar.edu 16: cesm.exe 00000000004220AA Unknown Unknown Unknown

One case is here
/glade/u/home/pbuotte/Earthshot/fates_cases/derecho/DryBrazil_7BET_2BDT_coffee
with log files here
/glade/derecho/scratch/pbuotte/glade/derecho/scratch/pbuotte/run

I appreciate any insights on what these errors mean.
 

oleson

Keith Oleson
CSEG and Liaisons
Staff member
There was this leap day bug going on today that stopped everyone that was running with a NO_LEAP calendar:


Although I don't see the relevant errors in your PET log, so that might not necessarily be it.
The pull request to fix this is here:

 

oleson

Keith Oleson
CSEG and Liaisons
Staff member
The errors in the PET log file of this type:

.0b04-un2qwjvc54ac5lwa63x62gwgaxfhswp5/spack-src/src/Infrastructure/Trace/src/ESMCI_Trace.C:1797 ESMCI::TraceEventRegionExit() Wrong argument specified - Trace regions not properly nested. Attempt to exit region: CNZero-vegbgc-nflux that was never entered.

are caused by this:


but I don't think this actually stops the model.
 

polly

Polly Thornton
New Member
I am running with a NO_LEAP calendar, so maybe that's it. None of the cases I've tried today can run successfully.
 

Yuan Sun

Yuan Sun
Member
Interesting. I also found this issue in my run/PET000.ESMF_LogFile.
 

Attachments

  • PET000.ESMF_LogFile.txt
    6.7 KB · Views: 4

slevis

Moderator
Link to related post
 
Top