Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

pop segmentation fault for certain processor counts

l_vankampenhout@uu_nl

Leo van Kampenhout
Member

When the OCN component POP is configured to run at a processor count which is not a multiple of 8, a segmentation fault may occur.

Symptom:
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
cesm.exe 0000000001E6BA2A pop_spacecurvemod 1690 POP_SpaceCurveMod.F90
cesm.exe 00000000020B77E8 distribution_mp_c 607 distribution.F90
cesm.exe 00000000020B316A distribution_mp_c 139 distribution.F90
cesm.exe 0000000001EDA4A7 domain_mp_init_do 438 domain.F90
cesm.exe 0000000001FCD32E initial_mp_pop_in 253 initial.F90
cesm.exe 0000000001E3854C pop_initmod_mp_po 102 POP_InitMod.F90
cesm.exe 0000000001D5F4EE ocn_comp_mct_mp_o 261 ocn_comp_mct.F90
cesm.exe 000000000042EB77 ccsm_comp_mod_mp_ 1130 ccsm_comp_mod.F90
cesm.exe 000000000043627C MAIN__ 90 ccsm_driver.F90
cesm.exe 0000000000411E2C Unknown Unknown Unknown
libc.so.6 0000003CB701ECDD Unknown Unknown Unknown
cesm.exe 0000000000411D29 Unknown Unknown Unknown

Versions affected:
I tried CESM 1.1.2 and the latest CESM 1.2.1 (rev 61100). Both suffer from the problem.

How to reproduce:

/create_newcase -case tw.r01.B1850.C5CN.f09_g16.032 -compset B1850C5CN -res f09_g16 -mach cartesius

Set processor count for OCN to something else than a multiple of 8. In my case, I used PES = 28.
Build & run the case.
Log and settings attached.

Workaround
Set OCN pes to a multiple of 8, this seems to work.

Proposed solution
From what I heard, there used to be a fatal error in the POP configure script regarding the number of processors in the past [CCSM4]. This check seems to have been removed, a likely cause of the bug. The current segmentation fault is quite nasty as it does not provide hints regarding the origin of the problem. Perhaps the check at configure time needs to be brought back.
 
I've had a similar segmentation error when I configured POP to run with 24 cores, as well (although no other multiples of 8). I have managed to reproduce this behaviour on four different machines. This happened on both CESM1.2.0 and CESM1.2.1. 
 
I've had a similar segmentation error when I configured POP to run with 24 cores, as well (although no other multiples of 8). I have managed to reproduce this behaviour on four different machines. This happened on both CESM1.2.0 and CESM1.2.1. 
 
Top