Main menu

Navigation

Run Error while building CESM 1.2.2 using threads

12 posts / 0 new
Last post
nitkbhat@...
Run Error while building CESM 1.2.2 using threads

I am getting a run time error when I try to run a threaded model with just 2 threads per MPI task. (PFA the env_mach_pes.xml file). The model runs for some time and gives the following error in the log. (PFA the cesm log file)

 

MCT::m_Router::initp_: RGSMap indices not increasing...Will correct

MCT::m_Router::initp_: GSMap indices not increasing...Will correct

(seq_domain_areafactinit) : min/max mdl2drv   0.999841513526222       1.00031732638246    areafact_a_ATM

(seq_domain_areafactinit) : min/max drv2mdl   0.999682774281628       1.00015851159572    areafact_a_ATM

(seq_domain_areafactinit) : min/max mdl2drv   0.999841513526350       1.00076245909423    areafact_l_LND

(seq_domain_areafactinit) : min/max drv2mdl   0.999238121806731       1.00015851159559    areafact_l_LND

(seq_domain_areafactinit) : min/max mdl2drv   0.999996826904345      0.999996826905162    areafact_r_ROF

(seq_domain_areafactinit) : min/max drv2mdl    1.00000317310491       1.00000317310572    areafact_r_ROF

(seq_domain_areafactinit) : min/max mdl2drv   0.999565456406962       1.00000000000000    areafact_o_OCN

(seq_domain_areafactinit) : min/max drv2mdl    1.00000000000000       1.00043473250326    areafact_o_OCN

(seq_domain_areafactinit) : min/max mdl2drv   0.999565456406962       1.00000000000000    areafact_i_ICE

(seq_domain_areafactinit) : min/max drv2mdl    1.00000000000000       1.00043473250326    areafact_i_ICE

(seq_mct_drv) : Initialize atm component phase 2 ATM

[48:node2] unexpected disconnect completion event from [77:node7]

Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0

internal ABORT - process 48

[56:node1] unexpected disconnect completion event from [77:node7]

Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0

[52:node2] unexpected disconnect completion event from [77:node7]

Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0

internal ABORT - process 56

internal ABORT - process 52

[50:node2] unexpected disconnect completion event from [56:node1]

Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0

internal ABORT - process 50

[58:node1] unexpected disconnect completion event from [48:node2]

Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0

internal ABORT - process 58

[63:node1] unexpected disconnect completion event from [48:node2]

Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0

internal ABORT - process 63

[55:node2] unexpected disconnect completion event from [56:node1]

Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0

internal ABORT - process 55

[59:node1] unexpected disconnect completion event from [77:node7]

Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0

internal ABORT - process 59

[54:node2] unexpected disconnect completion event from [56:node1]

Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0

internal ABORT - process 54

[53:node2] unexpected disconnect completion event from [56:node1]

Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0

internal ABORT - process 53

[61:node1] unexpected disconnect completion event from [77:node7]

Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0

internal ABORT - process 61

[60:node1] unexpected disconnect completion event from [77:node7]

Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0

internal ABORT - process 60

[57:node1] unexpected disconnect completion event from [48:node2]

Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0

internal ABORT - process 57

[51:node2] unexpected disconnect completion event from [56:node1]

Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0

internal ABORT - process 51

[62:node1] unexpected disconnect completion event from [78:node7]

Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0

internal ABORT - process 62

APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)

 

How do i solve this error? Is there anything incorrect with my PE layout? 

 

In this case, I have 2 threads for each component. Additionally, I wanted to know if it possible to run CESM model with different number of threads for each component? I see that the OMP_NUM_THREADS is set in $CASE.run file. 

 

Thanks

jedwards

I'm not really sure what's going on but try setting the NTHRDS for all of the components to the same value.   Some of them don't use threads I know, but

it makes the layout easier and  may solve the problem.

CESM Software Engineer

nitkbhat@...

I set every component to have 2 threads. 

I have the following configuration. 

 

<entry id="NTASKS_ATM"   value="48"  />

<entry id="NTHRDS_ATM"   value="2"  />

<entry id="ROOTPE_ATM"   value="0"  />

<entry id="NINST_ATM"   value="1"  />

<entry id="NINST_ATM_LAYOUT"   value="concurrent"  />

 

<entry id="NTASKS_LND"   value="4"  />

<entry id="NTHRDS_LND"   value="2"  />

<entry id="ROOTPE_LND"   value="32"  />

<entry id="NINST_LND"   value="1"  />

<entry id="NINST_LND_LAYOUT"   value="concurrent"  />

 

<entry id="NTASKS_ICE"   value="16"  />

<entry id="NTHRDS_ICE"   value="2"  />

<entry id="ROOTPE_ICE"   value="0"  />

<entry id="NINST_ICE"   value="1"  />

<entry id="NINST_ICE_LAYOUT"   value="concurrent"  />

 

<entry id="NTASKS_OCN"   value="16"  />

<entry id="NTHRDS_OCN"   value="2"  />

<entry id="ROOTPE_OCN"   value="48"  />

<entry id="NINST_OCN"   value="1"  />

<entry id="NINST_OCN_LAYOUT"   value="concurrent"  />

 

<entry id="NTASKS_CPL"   value="16"  />

<entry id="NTHRDS_CPL"   value="2"  />

<entry id="ROOTPE_CPL"   value="16"  />

 

<entry id="NTASKS_GLC"   value="1"  />

<entry id="NTHRDS_GLC"   value="2"  />

<entry id="ROOTPE_GLC"   value="46"  />

<entry id="NINST_GLC"   value="1"  />

<entry id="NINST_GLC_LAYOUT"   value="concurrent"  />

 

<entry id="NTASKS_ROF"   value="6"  />

<entry id="NTHRDS_ROF"   value="2"  />

<entry id="ROOTPE_ROF"   value="40"  />

<entry id="NINST_ROF"   value="1"  />

<entry id="NINST_ROF_LAYOUT"   value="concurrent"  />

 

<entry id="NTASKS_WAV"   value="1"  />

<entry id="NTHRDS_WAV"   value="2"  />

<entry id="ROOTPE_WAV"   value="47"  />

<entry id="NINST_WAV"   value="1"  />

<entry id="NINST_WAV_LAYOUT"   value="concurrent"  />

 

<entry id="PSTRID_ATM"   value="1"  />

<entry id="PSTRID_LND"   value="1"  />

<entry id="PSTRID_ICE"   value="1"  />

<entry id="PSTRID_OCN"   value="1"  />

<entry id="PSTRID_CPL"   value="1"  />

<entry id="PSTRID_GLC"   value="1"  />

<entry id="PSTRID_ROF"   value="1"  />

<entry id="PSTRID_WAV"   value="1"  />

 

<entry id="TOTALPES"   value="128"  />

<entry id="PES_LEVEL"   value="1r"  />

<entry id="MAX_TASKS_PER_NODE"   value="16"  />

<entry id="PES_PER_NODE"   value="$MAX_TASKS_PER_NODE"  />

<entry id="COST_PES"   value="0"  />

<entry id="CCSM_PCOST"   value="-1"  />

<entry id="CCSM_TCOST"   value="0"  />

<entry id="CCSM_ESTCOST"   value="2"  />

 

 

 

I am still getting the same error when I run the model. 

 Thanks
jedwards

Given the task numbers in your error message it looks like the problem is in the ocean component.   It could be a threading issue or it could be an out of memory problem.  Try turning off threading in the ocean only, if it still gives an error try giving more pes to the ocean.     

Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0

internal ABORT - process 48

 

CESM Software Engineer

nitkbhat@...

I have corrected the previous error. Now, when I try to run with openmp, I ran the X component set successfully.

But, I get an error when I run the F and B component set (with f19_g16 resolution). 

 

 NetCDF: Variable not found

 NetCDF: Attribute not found

 NetCDF: Attribute not found

 NetCDF: Attribute not found

 Reading setup_nml

 Reading grid_nml

 Reading ice_nml

 Reading tracer_nml

CalcWorkPerBlock: Total blocks:    64 Ice blocks:    24 IceFree blocks:    40 Land blocks:     0

MCT::m_Router::initp_: GSMap indices not increasing...Will correct

MCT::m_Router::initp_: RGSMap indices not increasing...Will correct

MCT::m_Router::initp_: RGSMap indices not increasing...Will correct

MCT::m_Router::initp_: GSMap indices not increasing...Will correct

MCT::m_Router::initp_: GSMap indices not increasing...Will correct

MCT::m_Router::initp_: RGSMap indices not increasing...Will correct

MCT::m_Router::initp_: RGSMap indices not increasing...Will correct

MCT::m_Router::initp_: GSMap indices not increasing...Will correct

forrtl: severe (174): SIGSEGV, segmentation fault occurred

 

Stack trace terminated abnormally.

forrtl: severe (174): SIGSEGV, segmentation fault occurred

Image              PC                Routine            Line        Source

libintlc.so.5      00002B9B529402C9  Unknown               Unknown  Unknown

libintlc.so.5      00002B9B5293EB9E  Unknown               Unknown  Unknown

libifcoremt.so.5   00002B9B517878EF  Unknown               Unknown  Unknown

 

Stack trace terminated abnormally.

forrtl: severe (174): SIGSEGV, segmentation fault occurred

Image              PC                Routine            Line        Source

libintlc.so.5      00002B9C6946C2C9  Unknown               Unknown  Unknown

libintlc.so.5      00002B9C6946AB9E  Unknown               Unknown  Unknown

libifcoremt.so.5   00002B9C682B38EF  Unknown               Unknown  Unknown

libifcoremt.so.5   00002B9C68218279  Unknown               Unknown  Unknown

libifcoremt.so.5   00002B9C682298B3  Unknown               Unknown  Unknown

libpthread.so.0    00000037FD40F500  Unknown               Unknown  Unknown

cesm.exe           00000000005FCFA2  radsw_mp_radcswmx        1599  radsw.F90

cesm.exe           00000000005DF4F2  radiation_mp_radi         778  radiation.F90

cesm.exe           0000000000590799  physpkg_mp_tphysb        2153  physpkg.F90

cesm.exe           000000000058BAB5  physpkg_mp_phys_r         944  physpkg.F90

libiomp5.so        00002B9C691D2233  Unknown               Unknown  Unknown

forrtl: severe (174): SIGSEGV, segmentation fault occurred

Image              PC                Routine            Line        Source

libintlc.so.5      00002BA6D719A2C9  Unknown               Unknown  Unknown

libintlc.so.5      00002BA6D7198B9E  Unknown               Unknown  Unknown

libifcoremt.so.5   00002BA6D5FE18EF  Unknown               Unknown  Unknown

libifcoremt.so.5   00002BA6D5F46279  Unknown               Unknown  Unknown

libifcoremt.so.5   00002BA6D5F578B3  Unknown               Unknown  Unknown

libpthread.so.0    00000037FD40F500  Unknown               Unknown  Unknown

cesm.exe           00000000005FCFA2  radsw_mp_radcswmx        1599  radsw.F90

cesm.exe           00000000005DF4F2  radiation_mp_radi         778  radiation.F90

cesm.exe           0000000000590799  physpkg_mp_tphysb        2153  physpkg.F90

cesm.exe           000000000058BAB5  physpkg_mp_phys_r         944  physpkg.F90

libiomp5.so        00002BA6D6F00233  Unknown               Unknown  Unknown

forrtl: severe (174): SIGSEGV, segmentation fault occurred

Image              PC                Routine            Line        Source

libintlc.so.5      00002BA1DB5C02C9  Unknown               Unknown  Unknown

libintlc.so.5      00002BA1DB5BEB9E  Unknown               Unknown  Unknown

libifcoremt.so.5   00002BA1DA4078EF  Unknown               Unknown  Unknown

libifcoremt.so.5   00002BA1DA36C279  Unknown               Unknown  Unknown

libifcoremt.so.5   00002BA1DA37D8B3  Unknown               Unknown  Unknown

libpthread.so.0    00000037FD40F500  Unknown               Unknown  Unknown

cesm.exe           00000000005FCFA2  radsw_mp_radcswmx        1599  radsw.F90

cesm.exe           00000000005DF4F2  radiation_mp_radi         778  radiation.F90

cesm.exe           0000000000590799  physpkg_mp_tphysb        2153  physpkg.F90

cesm.exe           000000000058BAB5  physpkg_mp_phys_r         944  physpkg.F90

libiomp5.so        00002BA1DB326233  Unknown               Unknown  Unknown

forrtl: severe (174): SIGSEGV, segmentation fault occurred

Image              PC                Routine            Line        Source

libintlc.so.5      00002B5B8CF6E2C9  Unknown               Unknown  Unknown

libintlc.so.5      00002B5B8CF6CB9E  Unknown               Unknown  Unknown

libifcoremt.so.5   00002B5B8BDB58EF  Unknown               Unknown  Unknown

libifcoremt.so.5   00002B5B8BD1A279  Unknown               Unknown  Unknown

libifcoremt.so.5   00002B5B8BD2B8B3  Unknown               Unknown  Unknown

libpthread.so.0    00000037FD40F500  Unknown               Unknown  Unknown

cesm.exe           00000000005FCFA2  radsw_mp_radcswmx        1599  radsw.F90

cesm.exe           00000000005DF4F2  radiation_mp_radi         778  radiation.F90

cesm.exe           0000000000590799  physpkg_mp_tphysb        2153  physpkg.F90

cesm.exe           000000000058BAB5  physpkg_mp_phys_r         944  physpkg.F90

libiomp5.so        00002B5B8CCD4233  Unknown               Unknown  Unknown

  Please find attached the log files and the PE layout files. Have there been openmp versions of CESM run with intel compilers? Should I build Netcdf differently. If anyone has successfully built threaded versions, please let me know. I have been struggling to solve these errors.  Thanks,NitinIndian Institute of Science 

 

jedwards

Try running in DEBUG mode


./xmlchange DEBUG=TRUE

$CASE.clean_build

$CASE.build

 

CESM Software Engineer

nitkbhat@...

I ran in the debug mode and got the same output in the log. I tried compiling with the mt_mpi option for the Intel compilers. I even updated to the Intel v15 compilers. The error just doesn't seem to be going. How can you run hybrid CESM with openmp (and MPI) using Intel V15 compilers. I have successfully built the code, but having problems running. I am not able to debug this code as well. 

 

Thanks,

Nitin

jedwards

If you are not able to debug the code and it gives an error with openmp threading on, have you tried turning threading off - does it still give an error?

If not then run with threading off.  

CESM Software Engineer

nitkbhat@...

I have tried the following:

1) Running with threading off, where I give nthreads as 1 for each process. This run is successful. No errors seen. 

2)  Running with threading on ((adds an -openmp flag) , where I give nthreads as 1 for each process. This run is successful. No errors seen. 

3) Running with threading on, where I give nthreads>=2. This is where I face the run time error. Here after the mct router initialization, the process stops running unexpectedly. 

 

Is there a specific configuration for running openmp threads along with MPI processes in CESM. Should I give the same number of threads to each process? I have tried all possible combinations of PE layout. Yet, I face the same error. 

Thanks,

Nitin 

jedwards

I guess that what you do next depends on what your goal is.   If your goal does not require threading then leave it off.   If you do require threading then you will need to debug what you are doing:

Try using a different compiler or compiler version.

Run under a parallel debugger such as ddt or totalview.

Identify the location of the error and manually turn off openmp in that section of code - does this solve the problem?

CESM is regularly run using threading, the issue you are having is somehow unique to your situation - good luck.

 

CESM Software Engineer

nitkbhat@...

 

I need threading to compare the performance between the threading model and non-threading model.

 

I ran the F model (CAM) on a single node with the f19_g16 resolution. For each component I gave 8 processes and 2 threads each.

 I used Allinea ddt to debug the code. I found that the problem was occurring in line 40 of radsw.F90. 

The error I am getting is "Process stopped in radsw::radcswmx (radsw.F90:40) with signal SIGSEGV (Segmentation fault)".

 

Memory error detected in radsw::radcswmx (radsw.F90:40):

 

Please find attached the env_mach_pes.xml and the DDT debugging Report. 

Please look at the current stack section in the Additional Information. That indicates the point where the code stopped. 

 

What do you think the issue is about?

 

Thanks,

Nitin K Bhat

SERC,

Indian Institute of Science

jedwards

Is it stopping on the subroutine interface?   This is almost certainly because you don't have the thread stack size large enough,

try setting OMP_STACKSIZE 64M in your environment.

CESM Software Engineer

Log in or register to post comments

Who's new

  • zoe.gillett@...
  • jayaiisc@...
  • xianwu0403@...
  • jiangfeng@...
  • gonzalo-ferrada@...