Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

CESM 1.1 Hangup Error

Dear All, I got an error while trying to run cesm 1.1 compset B (after trying compset X) on a single computer with several nodes (of which I am using only 4). I am using PGI compilers for fortran and mpich MPI implementation. My env_mach_pes.xml file sets all the NTASKS, NTHRDS, ROOTPE to be 4,1 and 0 (respectively).  The error I got in standard output file (ccsm.log.*) is (seems to be MPI related): APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1) It seems the error is in the subroutine initialize1 (in clm_initializeMod.F90) called by subroutine lnd_init_mcf.  Last line of lnd.log.* is Attempting to read ldomain from $din_loc_root_csmdata/share/domains/domain.lnd.fv4x5_gx3v7.091218.nc Last line of cpl.log.*   is (seq_mct_drv) : Initialize lnd component LND Does anyone see any obvious error or bug related to mpi or otherwise (which cause the hanging error further in above subroutines). (The error occurs at this very step for repeated unsuccessful trial runs).  Aside: (while building the code). Since mpich/pgi is now allowing m_list to be treated as variable and module name, I have subtituted the variable m_list with m_list_new in rad_constituents.F90. Could someone suggest how to make mpich/pgi deal with these double declarations.  Regards,Ankur
 

jedwards

CSEG and Liaisons
Staff member
Look again at the ccsm.log file.  Look at lines precedeing this hangup message for anything significant.   The issue may also be that the script variable  $din_loc_root_csmdata was not resolved for some reason.   This may be an issue that is resolved in the cesm1.1.1 update to cesm1.1, please get the newer code and try again.
 
Hello, As per your suggestion, I downloaded the CESM 1.1.1 and build it using mpich2-1.3-pgi (which uses PGI 10.9 pgf90). However the problem reoccurs at same point during cesm run. From the last line of cpl.log.* (seq_mct_drv) : Initialize lnd component LND(and as suggested in cesm1_1_1/models/lnd/clm/doc documentation) it seems that the error is during initializing the land model. This is also clear from lnd.log* files: Attempting to read ldomain from  /storage/ankurgupta/cesm/inputdata/share/domains/domain.lnd.fv4x5_gx3v7.091218.ncwhich suggest that the error is in subroutine initialize1 (defined in clm_initializeMod.F90) during its call by subroutine lnd_init_mct (defined in lnd_comp_mct.F90). The model seems to be hanging while reading ldomain from this file. The error seems to be MPI related (the last line of ccsm.log.* is suggested on several forums to be mpi related). I am attaching lnd.log.* I have also tried setting DEBUG to be TRUE in env_build.xml and stacksize (unlimited) and coredumpsize (1000000) are also set in run script. However the error reoccurs at same point. Somewhere it is suggested to use PGI 11.x (and none of PGI 10.x). Since the netcdf on the machine is using PGI 10.x I it would be good to know what went wrong with the subroutine initialize1 above. The netcdf input/output seems to working fine, since the model reads other .nc files correctly. Let me know if you would need more information to debug this..  Regards,Ankur
 
Top