Hi everyone,
I am trying to port CESM 2.1.0 to a cluster with SLURM. So first I built a simple case (--compset X), everything goes well. Below shows CaseState:
**********************************************************************************************
2020-09-11 06:48:50: case.run starting
---------------------------------------------------
2020-09-11 06:48:54: model execution starting
---------------------------------------------------
2020-09-11 06:50:16: model execution success
---------------------------------------------------
2020-09-11 06:50:16: case.run success
---------------------------------------------------
2020-09-11 06:50:23: st_archive starting
---------------------------------------------------
2020-09-11 06:50:26: st_archive success
**********************************************************************************************
Then I try to build another case (--compset B1850), first, it goes well. After submitting it successfully, it gets errors during the running the case. Below shows CaseState for case (-- compset B1850):
**********************************************************************************************
2020-09-11 07:24:32: case.submit success case.run:10807627, case.st_archive:10807628
---------------------------------------------------
2020-09-11 07:33:13: case.run starting
---------------------------------------------------
2020-09-11 07:33:28: model execution starting
---------------------------------------------------
2020-09-11 07:48:24: model execution success
---------------------------------------------------
2020-09-11 07:48:24: case.run error
ERROR: RUN FAIL: Command 'srun -n 108 /mnt/scratch/nfs_fs02/yangx2/b1850.test/bld/cesm.exe >> cesm.log.$LID 2>&1 ' failed
See log file for details: /mnt/scratch/nfs_fs02/yangx2/b1850.test/run/cesm.log.10807627.200911-073313
**********************************************************************************************
The corresponding "/mnt/scratch/nfs_fs02/yangx2/b1850.test/run/cesm.log.10807627.200911-073313" file is truncated shown below:
**********************************************************************************************
Creating variable thk
Creating variable topg
Creating variable usurf
Writing to file b1850.test.cism.initial_hist.0001-01-01-00000.nc at time 0
.000000000000000E+000
starttype: initial
starttype: initial
Output requests :
--------------------------------------------------
no dedicated output process, any file system
starttype: initial
Output requests :
--------------------------------------------------
no dedicated output process, any file system
starttype: initial
Output requests :
--------------------------------------------------
no dedicated output process, any file system
starttype: initial
Output requests :
--------------------------------------------------
no dedicated output process, any file system
1 IMOD, NAPROC, NBLKRS, NSPEC, RSBLKS= 1 18 35
600 35
2 IMOD, NAPROC, NBLKRS, NSPEC, RSBLKS= 1 18 10
600 17
1 IMOD, NAPROC, NBLKRS, NSPEC, RSBLKS= 1 18 0
600 0
2 IMOD, NAPROC, NBLKRS, NSPEC, RSBLKS= 1 18 10
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
(seq_domain_areafactinit) : min/max mdl2drv 0.999565346641447 1.00000000000000 areafact_o_OCN
(seq_domain_areafactinit) : min/max drv2mdl 1.00000000000000 1.00043484236425 areafact_o_OCN
(seq_domain_areafactinit) : min/max mdl2drv 0.999565346641447 1.00000000000000 areafact_i_ICE
(seq_domain_areafactinit) : min/max drv2mdl 1.00000000000000 1.00043484236425 areafact_i_ICE
calcsize j,iq,jac, lsfrm,lstoo 1 1 1 26 21
calcsize j,iq,jac, lsfrm,lstoo 1 1 2 26 21
calcsize j,iq,jac, lsfrm,lstoo 1 2 1 22 15
calcsize j,iq,jac, lsfrm,lstoo 1 2 2 22 15
calcsize j,iq,jac, lsfrm,lstoo 2 3 2 17 24
calcsize j,iq,jac, lsfrm,lstoo 2 4 1 20 25
calcsize j,iq,jac, lsfrm,lstoo 2 4 2 20 25
calcsize j,iq,jac, lsfrm,lstoo 2 5 1 19 23
calcsize j,iq,jac, lsfrm,lstoo 2 5 2 19 23
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
cesm.exe 0000000002E50553 Unknown Unknown Unknown
libc-2.17.so 00002B5CBD4E0400 Unknown Unknown Unknown
libmpi.so.12.0.0 00002B5CBCBC30D0 Unknown Unknown Unknown
libmpi.so.12.0.0 00002B5CBC78AEF7 Unknown Unknown Unknown
libmpi.so.12.0.0 00002B5CBCC02E18 Unknown Unknown Unknown
libmpi.so.12.0.0 00002B5CBCC05CE8 Unknown Unknown Unknown
libmpi.so.12.0.0 00002B5CBCC0D92D MPI_Startall Unknown Unknown
libmpifort.so.12. 00002B5CBBE5361B mpi_startall Unknown Unknown
cesm.exe 000000000268DA14 w3wavemd_mp_w3wav 882 w3wavemd.f90
cesm.exe 000000000262C90E wav_comp_mct_mp_w 884 wav_comp_mct.F90
cesm.exe 0000000000434530 component_mod_mp_ 728 component_mod.F90
cesm.exe 0000000000418BC4 cime_comp_mod_mp_ 2738 cime_comp_mod.F90
cesm.exe 0000000000434177 MAIN__ 125 cime_driver.F90
cesm.exe 00000000004166D2 Unknown Unknown Unknown
libc-2.17.so 00002B5CBD4CC555 __libc_start_main Unknown Unknown
cesm.exe 00000000004165E9 Unknown Unknown Unknown
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
cesm.exe 0000000002E50553 Unknown Unknown Unknown
libc-2.17.so 00002ABB3511E400 Unknown Unknown Unknown
libmpi.so.12.0.0 00002ABB348010D0 Unknown Unknown Unknown
libmpi.so.12.0.0 00002ABB343C8EF7 Unknown Unknown Unknown
libmpi.so.12.0.0 00002ABB34840E18 Unknown Unknown Unknown
libmpi.so.12.0.0 00002ABB34843CE8 Unknown Unknown Unknown
libmpi.so.12.0.0 00002ABB3484B92D MPI_Startall Unknown Unknown
libmpifort.so.12. 00002ABB33A9161B mpi_startall Unknown Unknown
cesm.exe 000000000268DA14 w3wavemd_mp_w3wav 882 w3wavemd.f90
cesm.exe 000000000262C90E wav_comp_mct_mp_w 884 wav_comp_mct.F90
cesm.exe 0000000000434530 component_mod_mp_ 728 component_mod.F90
cesm.exe 0000000000418BC4 cime_comp_mod_mp_ 2738 cime_comp_mod.F90
cesm.exe 0000000000434177 MAIN__ 125 cime_driver.F90
cesm.exe 00000000004166D2 Unknown Unknown Unknown
libc-2.17.so 00002ABB3510A555 __libc_start_main Unknown Unknown
cesm.exe 00000000004165E9 Unknown Unknown Unknown
forrtl: severe (174): SIGSEGV, segmentation fault occurred
...
cesm.exe 00000000004166D2 Unknown Unknown Unknown
libc-2.17.so 00002AAD7F978555 __libc_start_main Unknown Unknown
cesm.exe 00000000004165E9 Unknown Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
cesm.exe 0000000002E50584 Unknown Unknown Unknown
libpthread-2.17.s 00002AB72D843630 Unknown Unknown Unknown
libmpi.so.12.0.0 00002AB7359AA8A6 Unknown Unknown Unknown
libmpi.so.12.0.0 00002AB735A08239 Unknown Unknown Unknown
libmpi.so.12.0.0 00002AB735A07DD0 MPI_Barrier Unknown Unknown
libmpifort.so.12. 00002AB7354CE6BC pmpi_barrier Unknown Unknown
cesm.exe 0000000000418443 cime_comp_mod_mp_ 2458 cime_comp_mod.F90
cesm.exe 0000000000434177 MAIN__ 125 cime_driver.F90
cesm.exe 00000000004166D2 Unknown Unknown Unknown
libc-2.17.so 00002AB736B5C555 __libc_start_main Unknown Unknown
cesm.exe 00000000004165E9 Unknown Unknown Unknown
srun: error: node-0155: tasks 90-107: Exited with exit code 1
**********************************************************************************************
Any hints would be helpful!
Thanks in advance!
Best,
Skylar
I am trying to port CESM 2.1.0 to a cluster with SLURM. So first I built a simple case (--compset X), everything goes well. Below shows CaseState:
**********************************************************************************************
2020-09-11 06:48:50: case.run starting
---------------------------------------------------
2020-09-11 06:48:54: model execution starting
---------------------------------------------------
2020-09-11 06:50:16: model execution success
---------------------------------------------------
2020-09-11 06:50:16: case.run success
---------------------------------------------------
2020-09-11 06:50:23: st_archive starting
---------------------------------------------------
2020-09-11 06:50:26: st_archive success
**********************************************************************************************
Then I try to build another case (--compset B1850), first, it goes well. After submitting it successfully, it gets errors during the running the case. Below shows CaseState for case (-- compset B1850):
**********************************************************************************************
2020-09-11 07:24:32: case.submit success case.run:10807627, case.st_archive:10807628
---------------------------------------------------
2020-09-11 07:33:13: case.run starting
---------------------------------------------------
2020-09-11 07:33:28: model execution starting
---------------------------------------------------
2020-09-11 07:48:24: model execution success
---------------------------------------------------
2020-09-11 07:48:24: case.run error
ERROR: RUN FAIL: Command 'srun -n 108 /mnt/scratch/nfs_fs02/yangx2/b1850.test/bld/cesm.exe >> cesm.log.$LID 2>&1 ' failed
See log file for details: /mnt/scratch/nfs_fs02/yangx2/b1850.test/run/cesm.log.10807627.200911-073313
**********************************************************************************************
The corresponding "/mnt/scratch/nfs_fs02/yangx2/b1850.test/run/cesm.log.10807627.200911-073313" file is truncated shown below:
**********************************************************************************************
Creating variable thk
Creating variable topg
Creating variable usurf
Writing to file b1850.test.cism.initial_hist.0001-01-01-00000.nc at time 0
.000000000000000E+000
starttype: initial
starttype: initial
Output requests :
--------------------------------------------------
no dedicated output process, any file system
starttype: initial
Output requests :
--------------------------------------------------
no dedicated output process, any file system
starttype: initial
Output requests :
--------------------------------------------------
no dedicated output process, any file system
starttype: initial
Output requests :
--------------------------------------------------
no dedicated output process, any file system
1 IMOD, NAPROC, NBLKRS, NSPEC, RSBLKS= 1 18 35
600 35
2 IMOD, NAPROC, NBLKRS, NSPEC, RSBLKS= 1 18 10
600 17
1 IMOD, NAPROC, NBLKRS, NSPEC, RSBLKS= 1 18 0
600 0
2 IMOD, NAPROC, NBLKRS, NSPEC, RSBLKS= 1 18 10
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
(seq_domain_areafactinit) : min/max mdl2drv 0.999565346641447 1.00000000000000 areafact_o_OCN
(seq_domain_areafactinit) : min/max drv2mdl 1.00000000000000 1.00043484236425 areafact_o_OCN
(seq_domain_areafactinit) : min/max mdl2drv 0.999565346641447 1.00000000000000 areafact_i_ICE
(seq_domain_areafactinit) : min/max drv2mdl 1.00000000000000 1.00043484236425 areafact_i_ICE
calcsize j,iq,jac, lsfrm,lstoo 1 1 1 26 21
calcsize j,iq,jac, lsfrm,lstoo 1 1 2 26 21
calcsize j,iq,jac, lsfrm,lstoo 1 2 1 22 15
calcsize j,iq,jac, lsfrm,lstoo 1 2 2 22 15
calcsize j,iq,jac, lsfrm,lstoo 2 3 2 17 24
calcsize j,iq,jac, lsfrm,lstoo 2 4 1 20 25
calcsize j,iq,jac, lsfrm,lstoo 2 4 2 20 25
calcsize j,iq,jac, lsfrm,lstoo 2 5 1 19 23
calcsize j,iq,jac, lsfrm,lstoo 2 5 2 19 23
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
cesm.exe 0000000002E50553 Unknown Unknown Unknown
libc-2.17.so 00002B5CBD4E0400 Unknown Unknown Unknown
libmpi.so.12.0.0 00002B5CBCBC30D0 Unknown Unknown Unknown
libmpi.so.12.0.0 00002B5CBC78AEF7 Unknown Unknown Unknown
libmpi.so.12.0.0 00002B5CBCC02E18 Unknown Unknown Unknown
libmpi.so.12.0.0 00002B5CBCC05CE8 Unknown Unknown Unknown
libmpi.so.12.0.0 00002B5CBCC0D92D MPI_Startall Unknown Unknown
libmpifort.so.12. 00002B5CBBE5361B mpi_startall Unknown Unknown
cesm.exe 000000000268DA14 w3wavemd_mp_w3wav 882 w3wavemd.f90
cesm.exe 000000000262C90E wav_comp_mct_mp_w 884 wav_comp_mct.F90
cesm.exe 0000000000434530 component_mod_mp_ 728 component_mod.F90
cesm.exe 0000000000418BC4 cime_comp_mod_mp_ 2738 cime_comp_mod.F90
cesm.exe 0000000000434177 MAIN__ 125 cime_driver.F90
cesm.exe 00000000004166D2 Unknown Unknown Unknown
libc-2.17.so 00002B5CBD4CC555 __libc_start_main Unknown Unknown
cesm.exe 00000000004165E9 Unknown Unknown Unknown
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
cesm.exe 0000000002E50553 Unknown Unknown Unknown
libc-2.17.so 00002ABB3511E400 Unknown Unknown Unknown
libmpi.so.12.0.0 00002ABB348010D0 Unknown Unknown Unknown
libmpi.so.12.0.0 00002ABB343C8EF7 Unknown Unknown Unknown
libmpi.so.12.0.0 00002ABB34840E18 Unknown Unknown Unknown
libmpi.so.12.0.0 00002ABB34843CE8 Unknown Unknown Unknown
libmpi.so.12.0.0 00002ABB3484B92D MPI_Startall Unknown Unknown
libmpifort.so.12. 00002ABB33A9161B mpi_startall Unknown Unknown
cesm.exe 000000000268DA14 w3wavemd_mp_w3wav 882 w3wavemd.f90
cesm.exe 000000000262C90E wav_comp_mct_mp_w 884 wav_comp_mct.F90
cesm.exe 0000000000434530 component_mod_mp_ 728 component_mod.F90
cesm.exe 0000000000418BC4 cime_comp_mod_mp_ 2738 cime_comp_mod.F90
cesm.exe 0000000000434177 MAIN__ 125 cime_driver.F90
cesm.exe 00000000004166D2 Unknown Unknown Unknown
libc-2.17.so 00002ABB3510A555 __libc_start_main Unknown Unknown
cesm.exe 00000000004165E9 Unknown Unknown Unknown
forrtl: severe (174): SIGSEGV, segmentation fault occurred
...
cesm.exe 00000000004166D2 Unknown Unknown Unknown
libc-2.17.so 00002AAD7F978555 __libc_start_main Unknown Unknown
cesm.exe 00000000004165E9 Unknown Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
cesm.exe 0000000002E50584 Unknown Unknown Unknown
libpthread-2.17.s 00002AB72D843630 Unknown Unknown Unknown
libmpi.so.12.0.0 00002AB7359AA8A6 Unknown Unknown Unknown
libmpi.so.12.0.0 00002AB735A08239 Unknown Unknown Unknown
libmpi.so.12.0.0 00002AB735A07DD0 MPI_Barrier Unknown Unknown
libmpifort.so.12. 00002AB7354CE6BC pmpi_barrier Unknown Unknown
cesm.exe 0000000000418443 cime_comp_mod_mp_ 2458 cime_comp_mod.F90
cesm.exe 0000000000434177 MAIN__ 125 cime_driver.F90
cesm.exe 00000000004166D2 Unknown Unknown Unknown
libc-2.17.so 00002AB736B5C555 __libc_start_main Unknown Unknown
cesm.exe 00000000004165E9 Unknown Unknown Unknown
srun: error: node-0155: tasks 90-107: Exited with exit code 1
**********************************************************************************************
Any hints would be helpful!
Thanks in advance!
Best,
Skylar