Affected releases - CESM1.2.1 and earlier
UPDATED - fixed in CESM1.2.2 (release target 01-June-2014)
(Bugzilla 1919) - fixed the one line which was reported
On Sat, Feb 08, 2014 at 08:02:01AM -0700, Jim Edwards wrote:
> A bug report on pilgrim code. The report is on 1.1.1 but the code is
> still there on the trunk. I think that the proposed change makes sense
> (if having a global index not found makes sense). What do you think?
>
> Jim
>
> ---------- Forwarded message ----------
> From: Marcus Wagner
> Date: Sat, Feb 8, 2014 at 3:44 AM
> Subject: possible bug in cesm1_1_1 seems to trigger core dump with built
> with intel/14.* but not when built with intel/13.*
> To: Jim Edwards
>
>
> Hi Jim,
> After some debugging, I think that I have found
> what tiggers the core dump and the error messages
> *** glibc detected ***
> /lus/scratch/marcus/CESM/cesm_work/B1850C5CN.f19_g16.intel/bld/
> cesm.exe: double free or corruption (out)
> when I build cesm1_1_1 with intel/14.0.1.106
> and try to run "-compset B1850C5CN -res f19_g16",
> even though I had no problems with I built and
> ran that test case months ago with intel/13.* .
> ------------------------------------------------------------------
> cesm1_1_1/models/atm/cam/src/utils/pilgrim/parutilitiesmodule.F90
> 4921 subroutine ParCalcInfoGhostToDecomp(InComm, GA,DB,Info)
> ...
> 4963 allocate(sCount(npes),rCount(npes))
> ...
> 4978 call DecompGlobalToLocal(GA%Decomp,tag,Local,Pe)
> 4979 !
> 4980 ! If ipe-1 is my id, then this is an entry ipe will receive from Pe
> 4981 !
> 4982 if( pe /= oldpe .or. local /= OldLocal+1 ) then
> 4983 sCount(pe+1) = sCount(pe+1) + 1
> 4984 endif
> ...
> 5001 deallocate(sCount,rCount)
> ==================================================================
> The trouble is that DecompGlobalToLocal(GA%Decomp,tag,Local,Pe)
> returns "pe= -1" if the Global index (=tag) is not found, which
> triggers a core dump and the above error message when sCount is
> being deallocated.
> The subroutine DecompGlobalToLocal is in
> cesm1_1_1/models/atm/cam/src/utils/pilgrim/decompmodule.F90
> 127 INTERFACE DecompLocalToGlobal
> 128 MODULE PROCEDURE DecompL2G
>
> Do you think that
> 1. I am likely doing something wrong, or
> 2. found a bug in cesm1_1_1 which might be fixed by, e.g., changing
> 4982 if( pe /= oldpe .or. local /= OldLocal+1 ) then
> to
> 4982 if((pe /= -1).and.(pe /= oldpe .or. local /= OldLocal+1)) then
>
> thanks,
> marcus
>
>
> Marcus Wagner, Ph.D.
> Performance Engineer
> Cray Inc.
> Work: (909) 623-7827
> Cell: (310) 902-0676
> marcus@cray.com
============================================================================
I agree with the above analysis. We'll test this fix with the cam regression
tests and if no problems will commit it to the trunk and add it to future bug
fix releases of CESM
[reply] [-] Comment 1 Cheryl Craig 2014-04-14 14:42:54 MDT
Fixed in cesm1_2_2_n17_cam5_3_01 and cam5_3_31
[reply] [-] Comment 2 Sean Santos 2014-04-14 15:01:49 MDT
Should this fix be applied to the other half-dozen similar conditionals in
parutilitiesmod.F90?
[reply] [-] Comment 3 Cheryl Craig 2014-04-14 15:35:21 MDT
After talking with Brian Eaton, it was decided to leave the fix with just the
one line specified.
UPDATED - fixed in CESM1.2.2 (release target 01-June-2014)
(Bugzilla 1919) - fixed the one line which was reported
On Sat, Feb 08, 2014 at 08:02:01AM -0700, Jim Edwards wrote:
> A bug report on pilgrim code. The report is on 1.1.1 but the code is
> still there on the trunk. I think that the proposed change makes sense
> (if having a global index not found makes sense). What do you think?
>
> Jim
>
> ---------- Forwarded message ----------
> From: Marcus Wagner
> Date: Sat, Feb 8, 2014 at 3:44 AM
> Subject: possible bug in cesm1_1_1 seems to trigger core dump with built
> with intel/14.* but not when built with intel/13.*
> To: Jim Edwards
>
>
> Hi Jim,
> After some debugging, I think that I have found
> what tiggers the core dump and the error messages
> *** glibc detected ***
> /lus/scratch/marcus/CESM/cesm_work/B1850C5CN.f19_g16.intel/bld/
> cesm.exe: double free or corruption (out)
> when I build cesm1_1_1 with intel/14.0.1.106
> and try to run "-compset B1850C5CN -res f19_g16",
> even though I had no problems with I built and
> ran that test case months ago with intel/13.* .
> ------------------------------------------------------------------
> cesm1_1_1/models/atm/cam/src/utils/pilgrim/parutilitiesmodule.F90
> 4921 subroutine ParCalcInfoGhostToDecomp(InComm, GA,DB,Info)
> ...
> 4963 allocate(sCount(npes),rCount(npes))
> ...
> 4978 call DecompGlobalToLocal(GA%Decomp,tag,Local,Pe)
> 4979 !
> 4980 ! If ipe-1 is my id, then this is an entry ipe will receive from Pe
> 4981 !
> 4982 if( pe /= oldpe .or. local /= OldLocal+1 ) then
> 4983 sCount(pe+1) = sCount(pe+1) + 1
> 4984 endif
> ...
> 5001 deallocate(sCount,rCount)
> ==================================================================
> The trouble is that DecompGlobalToLocal(GA%Decomp,tag,Local,Pe)
> returns "pe= -1" if the Global index (=tag) is not found, which
> triggers a core dump and the above error message when sCount is
> being deallocated.
> The subroutine DecompGlobalToLocal is in
> cesm1_1_1/models/atm/cam/src/utils/pilgrim/decompmodule.F90
> 127 INTERFACE DecompLocalToGlobal
> 128 MODULE PROCEDURE DecompL2G
>
> Do you think that
> 1. I am likely doing something wrong, or
> 2. found a bug in cesm1_1_1 which might be fixed by, e.g., changing
> 4982 if( pe /= oldpe .or. local /= OldLocal+1 ) then
> to
> 4982 if((pe /= -1).and.(pe /= oldpe .or. local /= OldLocal+1)) then
>
> thanks,
> marcus
>
>
> Marcus Wagner, Ph.D.
> Performance Engineer
> Cray Inc.
> Work: (909) 623-7827
> Cell: (310) 902-0676
> marcus@cray.com
============================================================================
I agree with the above analysis. We'll test this fix with the cam regression
tests and if no problems will commit it to the trunk and add it to future bug
fix releases of CESM
[reply] [-] Comment 1 Cheryl Craig 2014-04-14 14:42:54 MDT
Fixed in cesm1_2_2_n17_cam5_3_01 and cam5_3_31
[reply] [-] Comment 2 Sean Santos 2014-04-14 15:01:49 MDT
Should this fix be applied to the other half-dozen similar conditionals in
parutilitiesmod.F90?
[reply] [-] Comment 3 Cheryl Craig 2014-04-14 15:35:21 MDT
After talking with Brian Eaton, it was decided to leave the fix with just the
one line specified.