Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

CESM run failed with error in wrap_mpi.F90

I'm trying to run cesm1.2.2 and has error messages as follows. Anyone has any clue? However, if I used only one node, the run executed successfully.I use icc_15.0.2, netdf_fortran+c_4.4.2 and openmpi_1.8.4. Thanks!/inputdata/atm/cam/chem/trop_mozart_aero/emis/RCP85_mam3_ num_a2_elev_2000-2100_c20110913.nc     1376256 NetCDF: Variable not found NetCDF: Variable not found Opened existing file/inputdata/atm/cam/chem/trop_mozart/dvel/regrid_vegetatio n.nc     1441792forrtl: error (78): process killed (SIGTERM)Image              PC                Routine            Line        Sourcecesm.exe           0000000008CD80C1  Unknown               Unknown  Unknowncesm.exe           0000000008CD6817  Unknown               Unknown  Unknownlibnetcdff.so.6    00002B515F483912  Unknown               Unknown  Unknownlibnetcdff.so.6    00002B515F483766  Unknown               Unknown  Unknownlibnetcdff.so.6    00002B515F46A30C  Unknown               Unknown  Unknownlibnetcdff.so.6    00002B515F46E343  Unknown               Unknown  Unknownlibpthread.so.0    00002B5163B377E0  Unknown               Unknown  Unknownlibc.so.6          00002B5163E24113  Unknown               Unknown  Unknownlibopen-pal.so.6   00002B516704F95A  Unknown               Unknown  Unknownlibopen-pal.so.6   00002B516704563B  Unknown               Unknown  Unknownlibopen-pal.so.6   00002B5166FFCC3D  Unknown               Unknown  Unknownmca_pml_ob1.so     00002B516EE2FE9E  Unknown               Unknown  Unknownlibmpi.so.1        00002B5162E6B531  Unknown               Unknown  Unknownlibmpi_mpifh.so.2  00002B51636765F2  Unknown               Unknown  Unknowncesm.exe           00000000027EBD94  mpiscatterv_              976  wrap_mpi.F90cesm.exe           0000000001496C8E  phys_grid_mp_scat        2076  phys_grid.F90cesm.exe           0000000000BCDF08  mo_drydep_mp_inte        2286  mo_drydep.F90cesm.exe           0000000000BAFC90  mo_drydep_mp_dvel        1879  mo_drydep.F90cesm.exe           0000000000AD2F8B  mo_chemini_mp_che         215  mo_chemini.F90cesm.exe           000000000095302D  chemistry_mp_chem        1010  chemistry.F90cesm.exe           000000000161C041  physpkg_mp_phys_i         745  physpkg.F90cesm.exe           0000000000714CC9  cam_comp_mp_cam_i         181  cam_comp.F90cesm.exe           00000000006D1D48  atm_comp_mct_mp_a         276  atm_comp_mct.F90cesm.exe           000000000042945B  ccsm_comp_mod_mp_        1058  ccsm_comp_mod.F90cesm.exe           00000000004BB2E5  MAIN__                     90  ccsm_driver.F90cesm.exe           000000000040B1FE  Unknown               Unknown  Unknownlibc.so.6          00002B5163D63D5D  Unknown               Unknown  Unknown cesm.exe           000000000040B109  Unknown               Unknown  Unknown 
 

santos

Member
Often a "SIGTERM" error occurs if your job is longer than the wall clock time that you specified for the batch system. How long did this job run for? Is it under the limit for the batch system on the machine you're using?
 

santos

Member
Hmm. I ask because CESM doesn't ever send SIGTERM to itself. There must be some interaction with a system or MPI daemon or that is terminating the process, and it's not clear what's happening from the piece of the log that you sent. If you could attach the full cesm and atm logs it might help.
 
I have attached the log file. ccsm.log and atm.log don't reveal any further information. Actually the run with compset B20TRC5CN is successful. The problem seems to be related to the compset BRCP85C5CN. 
 

santos

Member
Ah, this is actually useful, from the last line of the CESM log:"mpirun noticed that process rank 51 with PID 0 on node n0607 exited on signal 11 (Segmentation fault)."That segfault is what's causing all the rest of the errors.Can you create a new case, set DEBUG=TRUE before building, and try running that? That might provide a better idea of where the error is, and it will hopefully print more information to the log.
 
Thanks a lot for looking into the problem. I already turned on the debug and also set debug flags to the complier. I checked the node n0607 with the system manager and the node seems ok..
 

santos

Member
Do you have a core file from the segfault with DEBUG on? It would be good to know what line of code the model was on when it crashed.
 

santos

Member
There's not much that I can do with a binary core file alone. Can you run gdb on this and send me the output? The command is just `gdb cesm.exe core.19360` in the run directory for this case.
 
Top