Main menu

Navigation

CESM run failed with error in wrap_mpi.F90

10 posts / 0 new
Last post
qiong.yang@...
CESM run failed with error in wrap_mpi.F90

I'm trying to run cesm1.2.2 and has error messages as follows. Anyone has any clue? However, if I used only one node, the run executed successfully.

I use icc_15.0.2, netdf_fortran+c_4.4.2 and openmpi_1.8.4. Thanks!

/inputdata/atm/cam/chem/trop_mozart_aero/emis/RCP85_mam3_

 num_a2_elev_2000-2100_c20110913.nc     1376256

 NetCDF: Variable not found

 NetCDF: Variable not found

 Opened existing file

/inputdata/atm/cam/chem/trop_mozart/dvel/regrid_vegetatio

 n.nc     1441792

forrtl: error (78): process killed (SIGTERM)

Image              PC                Routine            Line        Source

cesm.exe           0000000008CD80C1  Unknown               Unknown  Unknown

cesm.exe           0000000008CD6817  Unknown               Unknown  Unknown

libnetcdff.so.6    00002B515F483912  Unknown               Unknown  Unknown

libnetcdff.so.6    00002B515F483766  Unknown               Unknown  Unknown

libnetcdff.so.6    00002B515F46A30C  Unknown               Unknown  Unknown

libnetcdff.so.6    00002B515F46E343  Unknown               Unknown  Unknown

libpthread.so.0    00002B5163B377E0  Unknown               Unknown  Unknown

libc.so.6          00002B5163E24113  Unknown               Unknown  Unknown

libopen-pal.so.6   00002B516704F95A  Unknown               Unknown  Unknown

libopen-pal.so.6   00002B516704563B  Unknown               Unknown  Unknown

libopen-pal.so.6   00002B5166FFCC3D  Unknown               Unknown  Unknown

mca_pml_ob1.so     00002B516EE2FE9E  Unknown               Unknown  Unknown

libmpi.so.1        00002B5162E6B531  Unknown               Unknown  Unknown

libmpi_mpifh.so.2  00002B51636765F2  Unknown               Unknown  Unknown

cesm.exe           00000000027EBD94  mpiscatterv_              976  wrap_mpi.F90

cesm.exe           0000000001496C8E  phys_grid_mp_scat        2076  phys_grid.F90

cesm.exe           0000000000BCDF08  mo_drydep_mp_inte        2286  mo_drydep.F90

cesm.exe           0000000000BAFC90  mo_drydep_mp_dvel        1879  mo_drydep.F90

cesm.exe           0000000000AD2F8B  mo_chemini_mp_che         215  mo_chemini.F90

cesm.exe           000000000095302D  chemistry_mp_chem        1010  chemistry.F90

cesm.exe           000000000161C041  physpkg_mp_phys_i         745  physpkg.F90

cesm.exe           0000000000714CC9  cam_comp_mp_cam_i         181  cam_comp.F90

cesm.exe           00000000006D1D48  atm_comp_mct_mp_a         276  atm_comp_mct.F90

cesm.exe           000000000042945B  ccsm_comp_mod_mp_        1058  ccsm_comp_mod.F90

cesm.exe           00000000004BB2E5  MAIN__                     90  ccsm_driver.F90

cesm.exe           000000000040B1FE  Unknown               Unknown  Unknown

libc.so.6          00002B5163D63D5D  Unknown               Unknown  Unknown

 

cesm.exe           000000000040B109  Unknown               Unknown  Unknown

 
santos

Often a "SIGTERM" error occurs if your job is longer than the wall clock time that you specified for the batch system. How long did this job run for? Is it under the limit for the batch system on the machine you're using?

Sean Patrick Santos

CESM Software Engineering Group

qiong.yang@...

The job failed almost instantaneously, last about 40 seconds. The walltime was set to be 8 hours. So I don't think it is related to walltime.

santos

Hmm. I ask because CESM doesn't ever send SIGTERM to itself. There must be some interaction with a system or MPI daemon or that is terminating the process, and it's not clear what's happening from the piece of the log that you sent. If you could attach the full cesm and atm logs it might help.

Sean Patrick Santos

CESM Software Engineering Group

qiong.yang@...

I have attached the log file. ccsm.log and atm.log don't reveal any further information. Actually the run with compset B20TRC5CN is successful. The problem seems to be related to the compset BRCP85C5CN.

 

Attachment: 
santos

Ah, this is actually useful, from the last line of the CESM log:

"mpirun noticed that process rank 51 with PID 0 on node n0607 exited on signal 11 (Segmentation fault)."

That segfault is what's causing all the rest of the errors.

Can you create a new case, set DEBUG=TRUE before building, and try running that? That might provide a better idea of where the error is, and it will hopefully print more information to the log.

Sean Patrick Santos

CESM Software Engineering Group

qiong.yang@...

Thanks a lot for looking into the problem. I already turned on the debug and also set debug flags to the complier. I checked the node n0607 with the system manager and the node seems ok..

santos

Do you have a core file from the segfault with DEBUG on? It would be good to know what line of code the model was on when it crashed.

Sean Patrick Santos

CESM Software Engineering Group

qiong.yang@...

Here is the core file. thanks!

Attachment: 
santos

There's not much that I can do with a binary core file alone. Can you run gdb on this and send me the output? The command is just `gdb cesm.exe core.19360` in the run directory for this case.

Sean Patrick Santos

CESM Software Engineering Group

Log in or register to post comments

Who's new

  • praveenmaniyatt@...
  • arjunbabun11@...
  • lama@...
  • sisi393@...
  • 1658093099@...