Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

cesm2.1.4 pacemaker experiments interrupt without clear reason on Derecho

Jingyi zhuo

Jingyi zhuo
New Member
Hi all,

After porting to Derecho, my pacemaker experiments would stop after iterating for some years or months with not very clear error information. Could anyone give me some hints and help? Thank you in advance!

[1] Case Status log:
2023-10-15 10:39:33: case.run error ERROR: RUN FAIL: Command 'mpiexec --label -n 768 /glade/derecho/scratch/jingyiz/b.e21.f19_g17.BHIST.H.nudge3/bld/cesm.exe >> cesm.log.$LID 2>&1' failed. See the log file for details: /glade/derecho/scratch/jingyiz/b.e21.f19_g17.BHIST.H.nudge3/run/cesm.log.1458915.desched1.231015-102911

What does the 'mpiexec' error signify? Is this related to the machine settings? I attempted to modify the 'env_mach_specific.xml,' and although the experiments can run for a longer time, they still get interrupted at some point.

[2] The cesm.log near the 1st error is:
dec0018.hsn.de.hpc.ucar.edu 15: forrtl: error (73): floating divide by zero
Image PC Routine Line Source
libpthread-2.31.s 0000145E2D2C58C0 Unknown Unknown Unknown
cesm.exe 0000000009EDFA2C w3iogomd_mp_w3out 508 w3iogomd.f90
cesm.exe 0000000009E80DB3 w3wavemd_mp_w3wav 859 w3wavemd.f90
cesm.exe 0000000009C7147F wav_comp_mct_mp_w 884 wav_comp_mct.F90
cesm.exe 000000000046CE71 component_mod_mp_ 728 component_mod.F90
cesm.exe 000000000043C9CA cime_comp_mod_mp_ 2746 cime_comp_mod.F90
cesm.exe 000000000045511E MAIN__ 125 cime_driver.F90
cesm.exe 00000000004142BD Unknown Unknown Unknown
libc-2.31.so 0000145E2957829D __libc_start_main Unknown Unknown
cesm.exe 00000000004141EA Unknown Unknown Unknown
dec0018.hsn.de.hpc.ucar.edu: rank 15 died from signal 6

[3] env_mach_specific.xml are also attached. There are no other errors in each component's log files.

[4] The case path is /glade/work/jingyiz/cases/cesm2_derecho/b.e21.f19_g17.BHIST.H.nudge3
 

Attachments

  • env_mach_specific.xml.txt
    3.3 KB · Views: 4

jedwards

CSEG and Liaisons
Staff member
This error: floating divide by zero indicates that you are generating bad values.

I think that the wav component has too many tasks,
change NTASKS_WAV=64

Other than that I don't have any suggestions.
 
Top