Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

cesm.exe terminated by errors

yangx2

xinyi yang
Member
Hi everyone,
I am trying to port CESM 2.1.0 to a cluster with SLURM. So first I built a simple case (--compset X), everything goes well. Below shows CaseState:

**********************************************************************************************
2020-09-11 06:48:50: case.run starting
---------------------------------------------------
2020-09-11 06:48:54: model execution starting
---------------------------------------------------
2020-09-11 06:50:16: model execution success
---------------------------------------------------
2020-09-11 06:50:16: case.run success
---------------------------------------------------
2020-09-11 06:50:23: st_archive starting
---------------------------------------------------
2020-09-11 06:50:26: st_archive success

**********************************************************************************************

Then I try to build another case (--compset B1850), first, it goes well. After submitting it successfully, it gets errors during the running the case. Below shows CaseState for case (-- compset B1850):

**********************************************************************************************
2020-09-11 07:24:32: case.submit success case.run:10807627, case.st_archive:10807628
---------------------------------------------------
2020-09-11 07:33:13: case.run starting
---------------------------------------------------
2020-09-11 07:33:28: model execution starting
---------------------------------------------------
2020-09-11 07:48:24: model execution success
---------------------------------------------------
2020-09-11 07:48:24: case.run error
ERROR: RUN FAIL: Command 'srun -n 108 /mnt/scratch/nfs_fs02/yangx2/b1850.test/bld/cesm.exe >> cesm.log.$LID 2>&1 ' failed
See log file for details: /mnt/scratch/nfs_fs02/yangx2/b1850.test/run/cesm.log.10807627.200911-073313

**********************************************************************************************

The corresponding "/mnt/scratch/nfs_fs02/yangx2/b1850.test/run/cesm.log.10807627.200911-073313" file is truncated shown below:

**********************************************************************************************
Creating variable thk
Creating variable topg
Creating variable usurf
Writing to file b1850.test.cism.initial_hist.0001-01-01-00000.nc at time 0
.000000000000000E+000
starttype: initial
starttype: initial

Output requests :
--------------------------------------------------
no dedicated output process, any file system
starttype: initial

Output requests :
--------------------------------------------------
no dedicated output process, any file system
starttype: initial

Output requests :
--------------------------------------------------

no dedicated output process, any file system
starttype: initial

Output requests :
--------------------------------------------------
no dedicated output process, any file system
1 IMOD, NAPROC, NBLKRS, NSPEC, RSBLKS= 1 18 35
600 35
2 IMOD, NAPROC, NBLKRS, NSPEC, RSBLKS= 1 18 10
600 17
1 IMOD, NAPROC, NBLKRS, NSPEC, RSBLKS= 1 18 0
600 0
2 IMOD, NAPROC, NBLKRS, NSPEC, RSBLKS= 1 18 10
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
(seq_domain_areafactinit) : min/max mdl2drv 0.999565346641447 1.00000000000000 areafact_o_OCN
(seq_domain_areafactinit) : min/max drv2mdl 1.00000000000000 1.00043484236425 areafact_o_OCN
(seq_domain_areafactinit) : min/max mdl2drv 0.999565346641447 1.00000000000000 areafact_i_ICE
(seq_domain_areafactinit) : min/max drv2mdl 1.00000000000000 1.00043484236425 areafact_i_ICE
calcsize j,iq,jac, lsfrm,lstoo 1 1 1 26 21
calcsize j,iq,jac, lsfrm,lstoo 1 1 2 26 21
calcsize j,iq,jac, lsfrm,lstoo 1 2 1 22 15
calcsize j,iq,jac, lsfrm,lstoo 1 2 2 22 15
calcsize j,iq,jac, lsfrm,lstoo 2 3 2 17 24
calcsize j,iq,jac, lsfrm,lstoo 2 4 1 20 25
calcsize j,iq,jac, lsfrm,lstoo 2 4 2 20 25
calcsize j,iq,jac, lsfrm,lstoo 2 5 1 19 23
calcsize j,iq,jac, lsfrm,lstoo 2 5 2 19 23
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
cesm.exe 0000000002E50553 Unknown Unknown Unknown
libc-2.17.so 00002B5CBD4E0400 Unknown Unknown Unknown
libmpi.so.12.0.0 00002B5CBCBC30D0 Unknown Unknown Unknown
libmpi.so.12.0.0 00002B5CBC78AEF7 Unknown Unknown Unknown
libmpi.so.12.0.0 00002B5CBCC02E18 Unknown Unknown Unknown
libmpi.so.12.0.0 00002B5CBCC05CE8 Unknown Unknown Unknown
libmpi.so.12.0.0 00002B5CBCC0D92D MPI_Startall Unknown Unknown
libmpifort.so.12. 00002B5CBBE5361B mpi_startall Unknown Unknown
cesm.exe 000000000268DA14 w3wavemd_mp_w3wav 882 w3wavemd.f90
cesm.exe 000000000262C90E wav_comp_mct_mp_w 884 wav_comp_mct.F90
cesm.exe 0000000000434530 component_mod_mp_ 728 component_mod.F90
cesm.exe 0000000000418BC4 cime_comp_mod_mp_ 2738 cime_comp_mod.F90
cesm.exe 0000000000434177 MAIN__ 125 cime_driver.F90
cesm.exe 00000000004166D2 Unknown Unknown Unknown
libc-2.17.so 00002B5CBD4CC555 __libc_start_main Unknown Unknown
cesm.exe 00000000004165E9 Unknown Unknown Unknown
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
cesm.exe 0000000002E50553 Unknown Unknown Unknown
libc-2.17.so 00002ABB3511E400 Unknown Unknown Unknown
libmpi.so.12.0.0 00002ABB348010D0 Unknown Unknown Unknown
libmpi.so.12.0.0 00002ABB343C8EF7 Unknown Unknown Unknown
libmpi.so.12.0.0 00002ABB34840E18 Unknown Unknown Unknown
libmpi.so.12.0.0 00002ABB34843CE8 Unknown Unknown Unknown
libmpi.so.12.0.0 00002ABB3484B92D MPI_Startall Unknown Unknown
libmpifort.so.12. 00002ABB33A9161B mpi_startall Unknown Unknown
cesm.exe 000000000268DA14 w3wavemd_mp_w3wav 882 w3wavemd.f90
cesm.exe 000000000262C90E wav_comp_mct_mp_w 884 wav_comp_mct.F90
cesm.exe 0000000000434530 component_mod_mp_ 728 component_mod.F90
cesm.exe 0000000000418BC4 cime_comp_mod_mp_ 2738 cime_comp_mod.F90
cesm.exe 0000000000434177 MAIN__ 125 cime_driver.F90
cesm.exe 00000000004166D2 Unknown Unknown Unknown
libc-2.17.so 00002ABB3510A555 __libc_start_main Unknown Unknown
cesm.exe 00000000004165E9 Unknown Unknown Unknown
forrtl: severe (174): SIGSEGV, segmentation fault occurred
...
cesm.exe 00000000004166D2 Unknown Unknown Unknown
libc-2.17.so 00002AAD7F978555 __libc_start_main Unknown Unknown
cesm.exe 00000000004165E9 Unknown Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
cesm.exe 0000000002E50584 Unknown Unknown Unknown
libpthread-2.17.s 00002AB72D843630 Unknown Unknown Unknown
libmpi.so.12.0.0 00002AB7359AA8A6 Unknown Unknown Unknown
libmpi.so.12.0.0 00002AB735A08239 Unknown Unknown Unknown
libmpi.so.12.0.0 00002AB735A07DD0 MPI_Barrier Unknown Unknown
libmpifort.so.12. 00002AB7354CE6BC pmpi_barrier Unknown Unknown
cesm.exe 0000000000418443 cime_comp_mod_mp_ 2458 cime_comp_mod.F90
cesm.exe 0000000000434177 MAIN__ 125 cime_driver.F90
cesm.exe 00000000004166D2 Unknown Unknown Unknown
libc-2.17.so 00002AB736B5C555 __libc_start_main Unknown Unknown
cesm.exe 00000000004165E9 Unknown Unknown Unknown
srun: error: node-0155: tasks 90-107: Exited with exit code 1

**********************************************************************************************
Any hints would be helpful!
Thanks in advance!
Best,
Skylar
 

jedwards

CSEG and Liaisons
Staff member
Why did you post to a new group instead of continuing the previous? Did you talk to your system support team?
What version of mpi are you using?
 

yangx2

xinyi yang
Member
Why did you post to a new group instead of continuing the previous? Did you talk to your system support team?
What version of mpi are you using?
Hi,
I solved the previous problem based on suggestion from my HPC support team. They believed that the error is caused by mixing the MPI implementation. So I revised config_mechines.xml file. And I have run -compset X case successfully. When I tried to run —compset B1850. It gets the different error. Therefore I post on a new group. If it’s not a good way, I will correct. Sorry for the inconvenience. Any hints of this error?
best,
Skylar
 

jedwards

CSEG and Liaisons
Staff member
What version of mpi are you using? What is your stack size limit (ulimit -s)? The stack limit should be set to be the maximum possible on your system or unlimited if possible.
 

yangx2

xinyi yang
Member
What version of mpi are you using? What is your stack size limit (ulimit -s)? The stack limit should be set to be the maximum possible on your system or unlimited if possible.
Hi Jedwards,
Sorry for the late reply. I have impi and openmpi. I have attached config_mechines.xml and config_compiler.xml.
**********************************************
[yangx2@node-0001 ~]$ ulimit -s
unlimited
**********************************************
The problem is not solved. Thanks in advance.

Best,
Xinyi
 

Attachments

  • config_machines.txt
    7.8 KB · Views: 35
  • config_compilers.txt
    1.5 KB · Views: 27

yangx2

xinyi yang
Member
What version of mpi are you using? What is your stack size limit (ulimit -s)? The stack limit should be set to be the maximum possible on your system or unlimited if possible.
I also attached "/mnt/scratch/nfs_fs02/yangx2/b1850.test/run/cesm.log.10807627.200911-073313" file. In case, you may want to take a look.
best,
Skylar
 

Attachments

  • cesm.log.10807627.200911-073313.txt
    888.4 KB · Views: 33

yangx2

xinyi yang
Member
What version of mpi are you using? What is your stack size limit (ulimit -s)? The stack limit should be set to be the maximum possible on your system or unlimited if possible.
One more thing, when I building the case, I used --skip-provenance-check flag. I manage to avoid the error shown below:
*********************************************************
Processing externals description file : Externals.cfg
Processing externals description file : Externals_CAM.cfg
Processing externals description file : Externals_CISM.cfg
Processing externals description file : Externals_CLM.cfg
Processing externals description file : Externals_POP.cfg
Checking status of externals: cam, chem_proc,
ERROR: SVN returned invalid XML message

To solve this, either:

(1) Find and fix the problem: From /home/yangx2/my_cesm_sandbox, try to get this command to work:
./manage_externals/checkout_externals --status --verbose --no-logging

(2) If you don't need provenance information, rebuild with --skip-provenance-check
*********************************************************
Then, I did several steps and get the results shown below:
cd /home/yangx2/my_cesm_sandbox
./manage_externals/checkout_externals --status --verbose --no-logging

*********************************************************
Processing externals description file : Externals.cfg
Processing externals description file : Externals_CLM.cfg
Processing externals description file : Externals_POP.cfg
Processing externals description file : Externals_CISM.cfg
Processing externals description file : Externals_CAM.cfg
Checking status of externals: clm, fates, ptclm, mosart, ww3, cime, cice, pop, cvmix, marbl, cism, source_cism, rtm, cam, clubb, carma, cosp2, chem_proc,
./cime
clean sandbox, on cime5.6.32
./components/cam
clean sandbox, on cam_cesm2_1_rel_41
./components/cam/chem_proc
clean sandbox, on tools/proc_atm/chem_proc/release_tags/chem_proc5_0_03_rel
./components/cam/src/physics/carma/base
clean sandbox, on carma/release_tags/carma3_49_rel
./components/cam/src/physics/clubb
clean sandbox, on vendor_clubb_r8099_n03
./components/cam/src/physics/cosp2/src
clean sandbox, on CFMIP/COSPv2.0/tags/v2.1.4cesm/src
./components/cice
clean sandbox, on cice5_cesm2_1_1_20190321
./components/cism
clean sandbox, on cism-release-cesm2.1.2_02
./components/cism/source_cism
clean sandbox, on release-cism2.1.03
./components/clm
clean sandbox, on release-clm5.0.30
./components/clm/src/fates
clean sandbox, on sci.1.30.0_api.8.0.0
./components/clm/tools/PTCLM
clean sandbox, on PTCLM2_20200121
./components/mosart
clean sandbox, on release-cesm2.0.04
./components/pop
clean sandbox, on pop2_cesm2_1_rel_n09
./components/pop/externals/CVMix
clean sandbox, on v0.93-beta
./components/pop/externals/MARBL
clean sandbox, on cesm2.1-n00
./components/rtm
clean sandbox, on release-cesm2.0.04
./components/ww3
clean sandbox, on ww3_181001
*********************************************************

I am posting this in case it may have some impacts on running process.

Best,
Skylar
 

yangx2

xinyi yang
Member
What version of mpi are you using? What is your stack size limit (ulimit -s)? The stack limit should be set to be the maximum possible on your system or unlimited if possible.
Hi Jedwards,
Sorry for the late reply. I have impi and openmpi. I have attached config_mechines.xml and config_compiler.xml.
**********************************************
[yangx2@node-0001 ~]$ ulimit -s
unlimited
**********************************************
The problem is not solved. Thanks in advance.

Best,
Xinyi
I believe here I am using impi instead of openmpi.
 

jedwards

CSEG and Liaisons
Staff member
This kind of error might be expected if you are mixing your mpi - that is linking with one but attempting to run with the other.
Are you sure you are consistently using the same mpi library?
 

yangx2

xinyi yang
Member
This kind of error might be expected if you are mixing your mpi - that is linking with one but attempting to run with the other.
Are you sure you are consistently using the same mpi library?
Hi Jedwards,
Based on your suggestion. I am going to revise config_mechines.xml (basically remove openMPI and other unnecessary modules) since impi is coming with intel compiler automatically. Let's see whether this change can solve the problem or not. BTW, do you think it is a good way to do?
Best,
Skylar
 

jedwards

CSEG and Liaisons
Staff member
Yes, this is a good plan. You might also consider trying a simple program like hello_world_mpi.c
 

yangx2

xinyi yang
Member
Yes, this is a good plan. You might also consider trying a simple program like hello_world_mpi.c
Hi Jedwards,
First, I tried a test named fhello_world_mpi.F90 (suggested by: 6. Porting and validating CIME on a new platform — CIME master documentation assure that I can run a basic mpi parallel program on my machine.
fhello_world_mpi.F90 is shown below:
**************************************************************
program fhello_world_mpi
use mpi
implicit none
integer ( kind = 4 ) error
integer ( kind = 4 ) id
integer p
character(len=MPI_MAX_PROCESSOR_NAME) :: name
integer clen
integer, allocatable :: mype(:)
real ( kind = 8 ) wtime

call MPI_Init ( error )
call MPI_Comm_size ( MPI_COMM_WORLD, p, error )
call MPI_Comm_rank ( MPI_COMM_WORLD, id, error )
if ( id == 0 ) then
wtime = MPI_Wtime ( )

write ( *, '(a)' ) ' '
write ( *, '(a)' ) 'HELLO_MPI - Master process:'
write ( *, '(a)' ) ' FORTRAN90/MPI version'
write ( *, '(a)' ) ' '
write ( *, '(a)' ) ' An MPI test program.'
write ( *, '(a)' ) ' '
write ( *, '(a,i8)' ) ' The number of processes is ', p
write ( *, '(a)' ) ' '
end if
call MPI_GET_PROCESSOR_NAME(NAME, CLEN, ERROR)
write ( *, '(a)' ) ' '
write ( *, '(a,i8,a,a)' ) ' Process ', id, ' says "Hello, world!" ',name(1:clen)

call MPI_Finalize ( error )
end program
**************************************************************
Compile and run the program fhello_world_mpi.F90 by executing two commands shown below:
> module purge
> module load toolchain/intel/2018.5.274
> mpiifort fhello_world_mpi.F90 -o hello_world
> mpirun -np 2 ./hello_world
FYI, "impi" will automatically be loaded with "intel".
Then I get results:
*************************************************
[yangx2@node-0001 test]$ mpirun -n 2 ./hello_world


Process 1 says "Hello, world!" node-0001
HELLO_MPI - Master process:
FORTRAN90/MPI version

An MPI test program.

The number of processes is 2


Process 0 says "Hello, world!" node-0001
*************************************************
So I assume the "impi" is working, right? Hope I am doing the right way you suggested. And I am revising config_mechines.xml now, giving your feedback later.
Best,
Skylar
 

yangx2

xinyi yang
Member
Also, I used srun command since my university HPC is SLURM.
srun mpirun -np 2 ./hello_world

********************
[yangx2@node-0001 test]$ srun mpirun ./hello_world

HELLO_MPI - Master process:
FORTRAN90/MPI version

An MPI test program.

The number of processes is 2


Process 0 says "Hello, world!" node-0001

Process 1 says "Hello, world!" node-0003


Process 1 says "Hello, world!" node-0003
HELLO_MPI - Master process:
FORTRAN90/MPI version

An MPI test program.

The number of processes is 2


Process 0 says "Hello, world!" node-0001
*******************

Best,
Skylar
 
Top