Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

error in testcase

I am porting the CESM1.0.2 to local HPC. some testing cases passed.

ERS_D.T31.g37.A failed first but passed by setting DFBFLAG true,


(1) ERS_D.f19.g16.B1850CN failed:

error information in TestStatus.out


doing a 10 day initial test
doing a 5 day restart test
Initial Test log is
***/ERS_D.f19_g16.B1850CN.***.t01/run/cpl.log.110411-183547
Restart Test log is
***/ERS_D.f19_g16.B1850CN.***.t01/run/cpl.log.110411-183905
Comparing initial log file with second log file
Error: output failed in
***/ERS_D.f19_g16.B1850CN.***.t01/run/cpl.log.110411-183905
FAIL

=======================
comm_diag xxx sorr 17 2.9138769163793648000E+16 send ice Faxa_swvdr
comm_diag xxx sorr 18 8.3728689914170350000E+15 send ice Faxa_swndf
comm_diag xxx sorr 19 1.4581980482047156000E+16 send ice Faxa_swvdf
comm_diag xxx sorr 20 1.2722476995354657600E+17 send ice Faxa_lwdn
comm_diag xxx sorr 21 1.0435867259904390335E+10 send ice Faxa_rain
comm_diag xxx sorr 22 4.1520101014284741879E+08 send ice Faxa_snow
comm_diag xxx sorr 23 4.4729977633970623430E+00 send ice Faxa_bcphidry
comm_diag xxx sorr 24 1.4881726069932079692E+00 send ice Faxa_bcphodry
comm_diag xxx sorr 25 3.5312902110328387550E+01 send ice Faxa_bcphiwet
comm_diag xxx sorr 26 3.0994728568099031207E+01 send ice Faxa_ocphidry
comm_diag xxx sorr 27 5.3935057830377415300E+00 send ice Faxa_ocphodry
comm_diag xxx sorr 28 2.2053728728725909036E+02 send ice Faxa_ocphiwet
comm_diag xxx sorr 29 1.4520152681898164246E+03 send ice Faxa_dstwet1
comm_diag xxx sorr 30 3.5924732432649670955E+03 send ice Faxa_dstwet2
comm_diag xxx sorr 31 1.6860603250195511009E+03 send ice Faxa_dstwet3
comm_diag xxx sorr 32 1.1071326539462259007E+03 send ice Faxa_dstwet4
comm_diag xxx sorr 33 4.8467222923425843817E+01 send ice Faxa_dstdry1
comm_diag xxx sorr 34 5.2026050851909315043E+02 send ice Faxa_dstdry2
comm_diag xxx sorr 35 1.5744626083906414351E+03 send ice Faxa_dstdry3
comm_diag xxx sorr 36 5.4303896067308278361E+03 send ice Faxa_dstdry4
=======================


(2) ERI.f19.g16.B1850CN also failed:

in TestStatus.out


ref1: doing a 7 day initial startup from 0001-12-27
Checking successful completion of init cpl log file
PASS ERI.f19_g16.B1850.***.1
ref2: doing a 15 day hybrid startup from 0001-01-01 using ref1 0002-01-01
Checking successful completion of hybr cpl log file
PASS ERI.f19_g16.B1850.***.2
doing a 9 day branch startup from ref2 0001-01-06
Checking successful completion of brch cpl log file
PASS ERI.f19_g16.B1850.***.3
doing a 5 day continue restart test from 0001-01-11
Initial Test log is ***/archive/ERI.f19_g16.B1850.***.t02.ref1/cpl/logs/cpl.log.110412-154818
Hybrid Test log is ***/archive/ERI.f19_g16.B1850.***.t02.ref2/cpl/logs/cpl.log.110412-160201
Branch Test log is ***/archive/ERI.f19_g16.B1850.***.t02/cpl/logs/cpl.log.110412-161722Restart Test log is ***/archive/ERI.f19_g16.B1850.***.t02/cpl/logs/cpl.log.110412-164618
Checking successful completion in init cpl log file
Comparing initial log file with second log fileError: ***/archive/ERI.f19_g16.B1850.***.t02.ref2/cpl/logs/cpl.log.110412-160201 and ***/archive/ERI.f19_g16.B1850.***.t02/cpl/logs/cpl.log.110412-164618 are different.
>comm_diag xxx sorr 10 1.0449232266695065600E+17 send ice Sa_ptem

 
The setting for C and fortran compiler (intel ifortran 10.1.*, netcdf also the latest version) I used is :

----------------------------------------
FC := mpif90
CC := mpicc

CFLAGS :=
CFLAGS := $(CPPDEFS) -w -O2 -ftz -tpp2 -fno-alias -fno-fnalias -ip -g
FIXEDFLAGS :=
FREEFLAGS := -FR
FFLAGS := $(CPPDEFS) -w -cm -cpp -WB -fpp -ftz -fpconstant -mtune=itanium2 -autodouble -tpp2 -fno-alias -fno-fnalias -stack_temps -ip -assume byterecl -convert big_endian -g
FFLAGS_OPT := -O2
FFLAGS_NOOPT := $(FFLAGS)
#LDFLAGS := -Wl,--noinhibit-exec -Vaxlib -posixlib
LDFLAGS :=
AR := ar
MOD_SUFFIX := mod
CONFIG_SHELL :=
--------------------------------------------------

Test case ERS_D.f19_g16.B1850CN also failed, error information:

forrtl: severe (408): fort: (3): Subscript #3 of the array TIDAL_DIFF has value 0 which is less than the lower bound of 1

Image PC Routine Line Source
ccsm.exe 00000000025A000D Unknown Unknown Unknown
ccsm.exe 000000000259EB15 Unknown Unknown Unknown
ccsm.exe 0000000002538D30 Unknown Unknown Unknown
ccsm.exe 00000000024CA91F Unknown Unknown Unknown
ccsm.exe 00000000024CAD22 Unknown Unknown Unknown
ccsm.exe 0000000001C9CCA2 Unknown Unknown Unknown
ccsm.exe 0000000001C809AD Unknown Unknown Unknown
ccsm.exe 0000000001C602A9 Unknown Unknown Unknown
ccsm.exe 0000000001A0F466 Unknown Unknown Unknown
ccsm.exe 00000000018F1BEC Unknown Unknown Unknown
ccsm.exe 0000000001856804 Unknown Unknown Unknown
ccsm.exe 000000000042465A Unknown Unknown Unknown
ccsm.exe 000000000041840C Unknown Unknown Unknown
libc.so.6 000000337181D994 Unknown Unknown Unknown
ccsm.exe 0000000000418319 Unknown Unknown Unknown

...

forrtl: severe (29): file not found, unit 98, file /.../ERS_D.f19_g16.B1850CN.cpus.t01/run/rpointer.drv

I tried to fix the bound for TIDAL_DIFF, but then found new array bound error for ocean horizontal mixing, the upper bound should be 60, but 61 appears in running the model.

Is it possible the errors are related to the compiler setting?

Thanks,
 
The array bound error disappeared after I removed the "-convert big_endian" flag in compiler, now the error formation is:

--------------
rm: No match.
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 4
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 0
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 56
gunzip: No match.
rm: No match.
forrtl: No such file or directory
forrtl: severe (29): file not found, unit 98, file /.../ERS_D.f19_g16.B1850CN.cpus.t03/run/rpointer.drv
Image PC Routine Line Source
ccsm.exe 00000000024F884D Unknown Unknown Unknown
ccsm.exe 00000000024F7355 Unknown Unknown Unknown
ccsm.exe 0000000002491570 Unknown Unknown Unknown
ccsm.exe 000000000242315F Unknown Unknown Unknown
ccsm.exe 0000000002422992 Unknown Unknown Unknown
ccsm.exe 0000000002436C6D Unknown Unknown Unknown
ccsm.exe 0000000001E64198 Unknown Unknown Unknown
ccsm.exe 0000000000418AD6 Unknown Unknown Unknown
ccsm.exe 000000000041840C Unknown Unknown Unknown
libc.so.6 0000003BD8C1D994 Unknown Unknown Unknown
ccsm.exe 0000000000418319 Unknown Unknown Unknown
gunzip: No match.
ls: No match.
/.../archive/ERS_D.f19_g16.B1850CN.cpus.t03/cpl: No such file or directory.

---------------------

in TestStatus.out
-----------------------
doing a 10 day initial test
doing a 5 day restart test
Initial Test log is /.../ERS_D.f19_g16.B1850CN.cpus.t03/run/cpl.log.110512-170244
Restart Test log is /.../ERS_D.f19_g16.B1850CN.cpus.t03/run/cpl.log.110512-170314
Comparing initial log file with second log file
Error: output failed in /.../ERS_D.f19_g16.B1850CN.cpus.t03/run/cpl.log.110512-170244
FAIL
------------------------
 
I tried two exact same tests for ERI.f19_g16.B1850.

The error for the first test:

---------------------


------------------------------------------------------------------------

POP aborting...
ERROR kmt inconsistency for overflows

------------------------------------------------------------------------
init_overflows_kmt: KMT = ***** at global (i,j) = 198 7 changed to 33
init_overflows_kmt: KMT = ***** at global (i,j) = 199 7 changed to 33
init_overflows_kmt: KMT = ***** at global (i,j) = 200 7 changed to 33
init_overflows_kmt: kmt inconsistencies for 3 points
original kmt not equal to actual kmt
------------------------------------------------------------------------

POP aborting...
ERROR kmt inconsistency for overflows

------------------------------------------------------------------------
init_overflows_kmt: KMT = ***** at global (i,j) = 38 349 changed to 37
init_overflows_kmt: KMT = ***** at global (i,j) = 38 350 changed to 37
init_overflows_kmt: KMT = ***** at global (i,j) = 38 351 changed to 37
init_overflows_kmt: KMT = ***** at global (i,j) = 19 370 changed to 32
init_overflows_kmt: KMT = ***** at global (i,j) = 19 371 changed to 32
init_overflows_kmt: KMT = ***** at global (i,j) = 19 372 changed to 32
init_overflows_kmt: kmt inconsistencies for 6 points
original kmt not equal to actual kmt
------------------------------------------------------------------------

POP aborting...
ERROR kmt inconsistency for overflows

------------------------------------------------------------------------
rank 56 in job 1 scs-2-9.local_54765 caused collective abort of all ranks
exit status of rank 56: killed by signal 9
rank 0 in job 1 scs-2-9.local_54765 caused collective abort of all ranks
exit status of rank 0: killed by signal 9



The error for the second test ( almost exactly same setting as the first except the batch-job wall-time):

------------------------------------------------------------------------

POP aborting...
ERROR kmt inconsistency for overflows

------------------------------------------------------------------------
init_overflows_kmt: KMT = ***** at global (i,j) = 38 349 changed to 37
init_overflows_kmt: KMT = ***** at global (i,j) = 38 350 changed to 37
init_overflows_kmt: KMT = ***** at global (i,j) = 38 351 changed to 37
init_overflows_kmt: KMT = ***** at global (i,j) = 19 370 changed to 32
init_overflows_kmt: KMT = ***** at global (i,j) = 19 371 changed to 32
init_overflows_kmt: KMT = ***** at global (i,j) = 19 372 changed to 32
init_overflows_kmt: kmt inconsistencies for 6 points
original kmt not equal to actual kmt
------------------------------------------------------------------------

POP aborting...
ERROR kmt inconsistency for overflows

------------------------------------------------------------------------
rank 56 in job 1 scs-2-5.local_53139 caused collective abort of all ranks
exit status of rank 56: killed by signal 9
rank 0 in job 1 scs-2-5.local_53139 caused collective abort of all ranks
exit status of rank 0: killed by signal 9


There were less errors in the second test than in the first. Is there any clue for the error and the cause of the difference?

P.S. The complier I am using is Intel compiler and the flags setting as follows:

CFLAGS := $(CPPDEFS) -w -O2 -ftz -fno-alias -fno-fnalias
FIXEDFLAGS :=
FREEFLAGS := -FR
FFLAGS := $(CPPDEFS) -cpp -WB -fpp -ftz -fpconstant -fno-alias -fno-fnalias -assume byterecl -autodouble
FFLAGS_OPT := -O2

Thanks.
 
Top