jenny@misu_su_se
New Member
Dear all,
I am running CCSM on a linux cluster with Intel compilers. I would be most grateful
to get your ideas or experience on the following:
The problem: CCSM3 crashes when I try to do branch runs from NCAR's restart files at
T42_gx1v3 and T85_gx1v3 resolution.
Background: CCSM3 runs nicely when I do a branch run from NCAR's restart files at
T31_gx3v5 resolution. I have run a 200 year branch run (starting from the
date 0501-01-01 in NCARS control run b30.031). My simulated climate is
very similar to the control climate:
http://www.ccsm.ucar.edu/experiments/ccsm3.0/atm/b30.031-obs_801-820/
Startup runs at T31_gx3v5, T42_gx1v3 and T85_gx1v3 run and finish
nicely. Further, I can do branch runs from these startup runs at all
these resolutions.
Machine: ~120 dual Intel Xeon nodes with 2GB RAM each,
64 bit,
CentOS Linux,
Infiniband,
Intel compilers (we have settled for version 9.0 after testing 8.1 and 9.0
for both CAM and CCSM)
Details: I do / try to do my branch runs from NCARs files (from ESG):
b30.004.ccsm.r.0500-01-01-00000.040129-054016.tar
b30.009.ccsm.r.0500-01-01-00000.040324-161431.tar
b30.031.ccsm.r.0501-01-01-00000.040508-071805.tar
All branch runs from the b30.004 and the b30.009 runs crash at the
same point (tail of the output file):
(cpl_control_readNList) ------------------------------------------------------------
(cpl_control_readNList) orbit based on orb_year = 1990
(shr_orb_params) Calculate characteristics of the orbit:
(shr_orb_params) CVS revision: $Revision: 1.2 $
(shr_orb_params) CVS Tag : $Name: ccsm3_0_rel04 $
(shr_orb_params) Calculate orbit for year: 1990
(shr_msg_chStdIn) read cpl_stdio.nml, unit 5 connected to cpl.stdin
(shr_orb_params) ------ Computed Orbital Parameters ------
(shr_orb_params) Eccentricity = 1.670772E-02
(shr_orb_params) Obliquity (deg) = 2.344107E+01
(shr_orb_params) Obliquity (rad) = 4.091238E-01
(shr_orb_params) Long of perh(deg) = 1.027242E+02
(shr_orb_params) Long of perh(rad) = 4.934468E+00
(shr_orb_params) Long at v.e.(rad) = -3.250364E-02
(shr_orb_params) -----------------------------------------
(main) ---------------------------------------------------------------------- ---
(main) get simulation start date
(main) ---------------------------------------------------------------------- ---
(restart_readDate) restart type = continue => restart file specified by pointer file
(restart_readDate) reading start date from restart file: b30.004.cpl6.r.0500-01-01-00000
(main) simulation start date is 05000101
(main) ---------------------------------------------------------------------- ---
(main) contract init: establishes domains & routers (excluding lnd)
(main) ---------------------------------------------------------------------- ---
(cpl_contract_init) cpl-recv-atm
***** MPI-error in rank 12 Routine MPI_Abort : Terminating after call to MPI_Abort *****
--- mpimon --- Aborting run after process-12 terminated abnormally Childprocess 12666 exited with exitcode 2 ---
Thu Jan 26 21:49:29 CET 2006 -- CSM EXECUTION HAS FINISHED
Model did not complete - see cpl.log.060126-214036
tornado:ccsm3.0>tail -30 cpl.log.060126-214036
(cpl_control_readNList) ------------------------------------------------------------
(cpl_control_readNList) orbit based on orb_year = 1990
(shr_orb_params) Calculate characteristics of the orbit:
(shr_orb_params) CVS revision: $Revision: 1.2 $
(shr_orb_params) CVS Tag : $Name: ccsm3_0_rel04 $
(shr_orb_params) Calculate orbit for year: 1990
(shr_orb_params) ------ Computed Orbital Parameters ------
(shr_orb_params) Eccentricity = 1.670772E-02
(shr_orb_params) Obliquity (deg) = 2.344107E+01
(shr_orb_params) Obliquity (rad) = 4.091238E-01
(shr_orb_params) Long of perh(deg) = 1.027242E+02
(shr_orb_params) Long of perh(rad) = 4.934468E+00
(shr_orb_params) Long at v.e.(rad) = -3.250364E-02
(shr_orb_params) -----------------------------------------
(main) ---------------------------------------------------------------------- ---
(main) get simulation start date
(main) ---------------------------------------------------------------------- ---
(restart_readDate) restart type = continue => restart file specified by pointer file
(shr_file_get) rcode = 0 cmd = cp rpointer.cpl rpointer
(restart_readDate) reading start date from restart file: b30.004.cpl6.r.0500-01-01-00000
(cpl_iobin_open) format: format- str256:name:F90_date_and_time:caseDesc:cvsId
(cpl_iobin_open) title: Header for b30.004.cpl6.r.0500-01-01-00000
(cpl_iobin_open) File created: 2004-01-29 09:15:25
(cpl_iobin_open) comment: b30.004 fully coupled b30.004 T42_gx1v3
(cpl_iobin_open) CVS Id : @(#) CVS: $RCSfile: cpl_iobin_mod.F90,v $ $Revision: 1.2 $
(main) simulation start date is 05000101
(main) ---------------------------------------------------------------------- ---
(main) contract init: establishes domains & routers (excluding lnd)
(main) ---------------------------------------------------------------------- ---
(cpl_contract_init) cpl-recv-atm
tornado:ccsm3.0>
Inserting print statements in the routines I have found that the model
crashes when trying to do this:
call MPI_RECV(lvec,lsize,MPI_INTEGER,pid,tag,comm,status,ierr)
in ccsm3_0/models/csm_share/shr/shr_mpi_mod.F90
So, I suppose it is a problem with the MPI, but why is this not a problem
for the lower resolution or for branch runs from my own startup runs?
If You have ANY ideas of what might be happening here, I am VERY
interested!
Thanks in advance,
Jenny
I am running CCSM on a linux cluster with Intel compilers. I would be most grateful
to get your ideas or experience on the following:
The problem: CCSM3 crashes when I try to do branch runs from NCAR's restart files at
T42_gx1v3 and T85_gx1v3 resolution.
Background: CCSM3 runs nicely when I do a branch run from NCAR's restart files at
T31_gx3v5 resolution. I have run a 200 year branch run (starting from the
date 0501-01-01 in NCARS control run b30.031). My simulated climate is
very similar to the control climate:
http://www.ccsm.ucar.edu/experiments/ccsm3.0/atm/b30.031-obs_801-820/
Startup runs at T31_gx3v5, T42_gx1v3 and T85_gx1v3 run and finish
nicely. Further, I can do branch runs from these startup runs at all
these resolutions.
Machine: ~120 dual Intel Xeon nodes with 2GB RAM each,
64 bit,
CentOS Linux,
Infiniband,
Intel compilers (we have settled for version 9.0 after testing 8.1 and 9.0
for both CAM and CCSM)
Details: I do / try to do my branch runs from NCARs files (from ESG):
b30.004.ccsm.r.0500-01-01-00000.040129-054016.tar
b30.009.ccsm.r.0500-01-01-00000.040324-161431.tar
b30.031.ccsm.r.0501-01-01-00000.040508-071805.tar
All branch runs from the b30.004 and the b30.009 runs crash at the
same point (tail of the output file):
(cpl_control_readNList) ------------------------------------------------------------
(cpl_control_readNList) orbit based on orb_year = 1990
(shr_orb_params) Calculate characteristics of the orbit:
(shr_orb_params) CVS revision: $Revision: 1.2 $
(shr_orb_params) CVS Tag : $Name: ccsm3_0_rel04 $
(shr_orb_params) Calculate orbit for year: 1990
(shr_msg_chStdIn) read cpl_stdio.nml, unit 5 connected to cpl.stdin
(shr_orb_params) ------ Computed Orbital Parameters ------
(shr_orb_params) Eccentricity = 1.670772E-02
(shr_orb_params) Obliquity (deg) = 2.344107E+01
(shr_orb_params) Obliquity (rad) = 4.091238E-01
(shr_orb_params) Long of perh(deg) = 1.027242E+02
(shr_orb_params) Long of perh(rad) = 4.934468E+00
(shr_orb_params) Long at v.e.(rad) = -3.250364E-02
(shr_orb_params) -----------------------------------------
(main) ---------------------------------------------------------------------- ---
(main) get simulation start date
(main) ---------------------------------------------------------------------- ---
(restart_readDate) restart type = continue => restart file specified by pointer file
(restart_readDate) reading start date from restart file: b30.004.cpl6.r.0500-01-01-00000
(main) simulation start date is 05000101
(main) ---------------------------------------------------------------------- ---
(main) contract init: establishes domains & routers (excluding lnd)
(main) ---------------------------------------------------------------------- ---
(cpl_contract_init) cpl-recv-atm
***** MPI-error in rank 12 Routine MPI_Abort : Terminating after call to MPI_Abort *****
--- mpimon --- Aborting run after process-12 terminated abnormally Childprocess 12666 exited with exitcode 2 ---
Thu Jan 26 21:49:29 CET 2006 -- CSM EXECUTION HAS FINISHED
Model did not complete - see cpl.log.060126-214036
tornado:ccsm3.0>tail -30 cpl.log.060126-214036
(cpl_control_readNList) ------------------------------------------------------------
(cpl_control_readNList) orbit based on orb_year = 1990
(shr_orb_params) Calculate characteristics of the orbit:
(shr_orb_params) CVS revision: $Revision: 1.2 $
(shr_orb_params) CVS Tag : $Name: ccsm3_0_rel04 $
(shr_orb_params) Calculate orbit for year: 1990
(shr_orb_params) ------ Computed Orbital Parameters ------
(shr_orb_params) Eccentricity = 1.670772E-02
(shr_orb_params) Obliquity (deg) = 2.344107E+01
(shr_orb_params) Obliquity (rad) = 4.091238E-01
(shr_orb_params) Long of perh(deg) = 1.027242E+02
(shr_orb_params) Long of perh(rad) = 4.934468E+00
(shr_orb_params) Long at v.e.(rad) = -3.250364E-02
(shr_orb_params) -----------------------------------------
(main) ---------------------------------------------------------------------- ---
(main) get simulation start date
(main) ---------------------------------------------------------------------- ---
(restart_readDate) restart type = continue => restart file specified by pointer file
(shr_file_get) rcode = 0 cmd = cp rpointer.cpl rpointer
(restart_readDate) reading start date from restart file: b30.004.cpl6.r.0500-01-01-00000
(cpl_iobin_open) format: format- str256:name:F90_date_and_time:caseDesc:cvsId
(cpl_iobin_open) title: Header for b30.004.cpl6.r.0500-01-01-00000
(cpl_iobin_open) File created: 2004-01-29 09:15:25
(cpl_iobin_open) comment: b30.004 fully coupled b30.004 T42_gx1v3
(cpl_iobin_open) CVS Id : @(#) CVS: $RCSfile: cpl_iobin_mod.F90,v $ $Revision: 1.2 $
(main) simulation start date is 05000101
(main) ---------------------------------------------------------------------- ---
(main) contract init: establishes domains & routers (excluding lnd)
(main) ---------------------------------------------------------------------- ---
(cpl_contract_init) cpl-recv-atm
tornado:ccsm3.0>
Inserting print statements in the routines I have found that the model
crashes when trying to do this:
call MPI_RECV(lvec,lsize,MPI_INTEGER,pid,tag,comm,status,ierr)
in ccsm3_0/models/csm_share/shr/shr_mpi_mod.F90
So, I suppose it is a problem with the MPI, but why is this not a problem
for the lower resolution or for branch runs from my own startup runs?
If You have ANY ideas of what might be happening here, I am VERY
interested!
Thanks in advance,
Jenny