Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Run Error while building CESM 1.2.2 using threads

I am getting a run time error when I try to run a threaded model with just 2 threads per MPI task. (PFA the env_mach_pes.xml file). The model runs for some time and gives the following error in the log. (PFA the cesm log file) MCT::m_Router::initp_: RGSMap indices not increasing...Will correctMCT::m_Router::initp_: GSMap indices not increasing...Will correct(seq_domain_areafactinit) : min/max mdl2drv   0.999841513526222       1.00031732638246    areafact_a_ATM(seq_domain_areafactinit) : min/max drv2mdl   0.999682774281628       1.00015851159572    areafact_a_ATM(seq_domain_areafactinit) : min/max mdl2drv   0.999841513526350       1.00076245909423    areafact_l_LND(seq_domain_areafactinit) : min/max drv2mdl   0.999238121806731       1.00015851159559    areafact_l_LND(seq_domain_areafactinit) : min/max mdl2drv   0.999996826904345      0.999996826905162    areafact_r_ROF(seq_domain_areafactinit) : min/max drv2mdl    1.00000317310491       1.00000317310572    areafact_r_ROF(seq_domain_areafactinit) : min/max mdl2drv   0.999565456406962       1.00000000000000    areafact_o_OCN(seq_domain_areafactinit) : min/max drv2mdl    1.00000000000000       1.00043473250326    areafact_o_OCN(seq_domain_areafactinit) : min/max mdl2drv   0.999565456406962       1.00000000000000    areafact_i_ICE(seq_domain_areafactinit) : min/max drv2mdl    1.00000000000000       1.00043473250326    areafact_i_ICE(seq_mct_drv) : Initialize atm component phase 2 ATM[48:node2] unexpected disconnect completion event from [77:node7]Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0internal ABORT - process 48[56:node1] unexpected disconnect completion event from [77:node7]Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0[52:node2] unexpected disconnect completion event from [77:node7]Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0internal ABORT - process 56internal ABORT - process 52[50:node2] unexpected disconnect completion event from [56:node1]Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0internal ABORT - process 50[58:node1] unexpected disconnect completion event from [48:node2]Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0internal ABORT - process 58[63:node1] unexpected disconnect completion event from [48:node2]Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0internal ABORT - process 63[55:node2] unexpected disconnect completion event from [56:node1]Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0internal ABORT - process 55[59:node1] unexpected disconnect completion event from [77:node7]Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0internal ABORT - process 59[54:node2] unexpected disconnect completion event from [56:node1]Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0internal ABORT - process 54[53:node2] unexpected disconnect completion event from [56:node1]Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0internal ABORT - process 53[61:node1] unexpected disconnect completion event from [77:node7]Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0internal ABORT - process 61[60:node1] unexpected disconnect completion event from [77:node7]Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0internal ABORT - process 60[57:node1] unexpected disconnect completion event from [48:node2]Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0internal ABORT - process 57[51:node2] unexpected disconnect completion event from [56:node1]Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0internal ABORT - process 51[62:node1] unexpected disconnect completion event from [78:node7]Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0internal ABORT - process 62APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11) How do i solve this error? Is there anything incorrect with my PE layout?  In this case, I have 2 threads for each component. Additionally, I wanted to know if it possible to run CESM model with different number of threads for each component? I see that the OMP_NUM_THREADS is set in $CASE.run file.  Thanks
 

jedwards

CSEG and Liaisons
Staff member
I'm not really sure what's going on but try setting the NTHRDS for all of the components to the same value.   Some of them don't use threads I know, butit makes the layout easier and  may solve the problem.
 

jedwards

CSEG and Liaisons
Staff member
I'm not really sure what's going on but try setting the NTHRDS for all of the components to the same value.   Some of them don't use threads I know, butit makes the layout easier and  may solve the problem.
 
I set every component to have 2 threads. I have the following configuration.              I am still getting the same error when I run the model.  Thanks
 
I set every component to have 2 threads. I have the following configuration.              I am still getting the same error when I run the model.  Thanks
 

jedwards

CSEG and Liaisons
Staff member
Given the task numbers in your error message it looks like the problem is in the ocean component.   It could be a threading issue or it could be an out of memory problem.  Try turning off threading in the ocean only, if it still gives an error try giving more pes to the ocean.     Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0internal ABORT - process 48 
 

jedwards

CSEG and Liaisons
Staff member
Given the task numbers in your error message it looks like the problem is in the ocean component.   It could be a threading issue or it could be an out of memory problem.  Try turning off threading in the ocean only, if it still gives an error try giving more pes to the ocean.     Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0internal ABORT - process 48 
 
I have corrected the previous error. Now, when I try to run with openmp, I ran the X component set successfully.But, I get an error when I run the F and B component set (with f19_g16 resolution).   NetCDF: Variable not found NetCDF: Attribute not found NetCDF: Attribute not found NetCDF: Attribute not found Reading setup_nml Reading grid_nml Reading ice_nml Reading tracer_nmlCalcWorkPerBlock: Total blocks:    64 Ice blocks:    24 IceFree blocks:    40 Land blocks:     0MCT::m_Router::initp_: GSMap indices not increasing...Will correctMCT::m_Router::initp_: RGSMap indices not increasing...Will correctMCT::m_Router::initp_: RGSMap indices not increasing...Will correctMCT::m_Router::initp_: GSMap indices not increasing...Will correctMCT::m_Router::initp_: GSMap indices not increasing...Will correctMCT::m_Router::initp_: RGSMap indices not increasing...Will correctMCT::m_Router::initp_: RGSMap indices not increasing...Will correctMCT::m_Router::initp_: GSMap indices not increasing...Will correctforrtl: severe (174): SIGSEGV, segmentation fault occurred Stack trace terminated abnormally.forrtl: severe (174): SIGSEGV, segmentation fault occurredImage              PC                Routine            Line        Sourcelibintlc.so.5      00002B9B529402C9  Unknown               Unknown  Unknownlibintlc.so.5      00002B9B5293EB9E  Unknown               Unknown  Unknownlibifcoremt.so.5   00002B9B517878EF  Unknown               Unknown  Unknown Stack trace terminated abnormally.forrtl: severe (174): SIGSEGV, segmentation fault occurredImage              PC                Routine            Line        Sourcelibintlc.so.5      00002B9C6946C2C9  Unknown               Unknown  Unknownlibintlc.so.5      00002B9C6946AB9E  Unknown               Unknown  Unknownlibifcoremt.so.5   00002B9C682B38EF  Unknown               Unknown  Unknownlibifcoremt.so.5   00002B9C68218279  Unknown               Unknown  Unknownlibifcoremt.so.5   00002B9C682298B3  Unknown               Unknown  Unknownlibpthread.so.0    00000037FD40F500  Unknown               Unknown  Unknowncesm.exe           00000000005FCFA2  radsw_mp_radcswmx        1599  radsw.F90cesm.exe           00000000005DF4F2  radiation_mp_radi         778  radiation.F90cesm.exe           0000000000590799  physpkg_mp_tphysb        2153  physpkg.F90cesm.exe           000000000058BAB5  physpkg_mp_phys_r         944  physpkg.F90libiomp5.so        00002B9C691D2233  Unknown               Unknown  Unknownforrtl: severe (174): SIGSEGV, segmentation fault occurredImage              PC                Routine            Line        Sourcelibintlc.so.5      00002BA6D719A2C9  Unknown               Unknown  Unknownlibintlc.so.5      00002BA6D7198B9E  Unknown               Unknown  Unknownlibifcoremt.so.5   00002BA6D5FE18EF  Unknown               Unknown  Unknownlibifcoremt.so.5   00002BA6D5F46279  Unknown               Unknown  Unknownlibifcoremt.so.5   00002BA6D5F578B3  Unknown               Unknown  Unknownlibpthread.so.0    00000037FD40F500  Unknown               Unknown  Unknowncesm.exe           00000000005FCFA2  radsw_mp_radcswmx        1599  radsw.F90cesm.exe           00000000005DF4F2  radiation_mp_radi         778  radiation.F90cesm.exe           0000000000590799  physpkg_mp_tphysb        2153  physpkg.F90cesm.exe           000000000058BAB5  physpkg_mp_phys_r         944  physpkg.F90libiomp5.so        00002BA6D6F00233  Unknown               Unknown  Unknownforrtl: severe (174): SIGSEGV, segmentation fault occurredImage              PC                Routine            Line        Sourcelibintlc.so.5      00002BA1DB5C02C9  Unknown               Unknown  Unknownlibintlc.so.5      00002BA1DB5BEB9E  Unknown               Unknown  Unknownlibifcoremt.so.5   00002BA1DA4078EF  Unknown               Unknown  Unknownlibifcoremt.so.5   00002BA1DA36C279  Unknown               Unknown  Unknownlibifcoremt.so.5   00002BA1DA37D8B3  Unknown               Unknown  Unknownlibpthread.so.0    00000037FD40F500  Unknown               Unknown  Unknowncesm.exe           00000000005FCFA2  radsw_mp_radcswmx        1599  radsw.F90cesm.exe           00000000005DF4F2  radiation_mp_radi         778  radiation.F90cesm.exe           0000000000590799  physpkg_mp_tphysb        2153  physpkg.F90cesm.exe           000000000058BAB5  physpkg_mp_phys_r         944  physpkg.F90libiomp5.so        00002BA1DB326233  Unknown               Unknown  Unknownforrtl: severe (174): SIGSEGV, segmentation fault occurredImage              PC                Routine            Line        Sourcelibintlc.so.5      00002B5B8CF6E2C9  Unknown               Unknown  Unknownlibintlc.so.5      00002B5B8CF6CB9E  Unknown               Unknown  Unknownlibifcoremt.so.5   00002B5B8BDB58EF  Unknown               Unknown  Unknownlibifcoremt.so.5   00002B5B8BD1A279  Unknown               Unknown  Unknownlibifcoremt.so.5   00002B5B8BD2B8B3  Unknown               Unknown  Unknownlibpthread.so.0    00000037FD40F500  Unknown               Unknown  Unknowncesm.exe           00000000005FCFA2  radsw_mp_radcswmx        1599  radsw.F90cesm.exe           00000000005DF4F2  radiation_mp_radi         778  radiation.F90cesm.exe           0000000000590799  physpkg_mp_tphysb        2153  physpkg.F90cesm.exe           000000000058BAB5  physpkg_mp_phys_r         944  physpkg.F90libiomp5.so        00002B5B8CCD4233  Unknown               Unknown  Unknown  Please find attached the log files and the PE layout files. Have there been openmp versions of CESM run with intel compilers? Should I build Netcdf differently. If anyone has successfully built threaded versions, please let me know. I have been struggling to solve these errors.  Thanks,NitinIndian Institute of Science  
 
I have corrected the previous error. Now, when I try to run with openmp, I ran the X component set successfully.But, I get an error when I run the F and B component set (with f19_g16 resolution).   NetCDF: Variable not found NetCDF: Attribute not found NetCDF: Attribute not found NetCDF: Attribute not found Reading setup_nml Reading grid_nml Reading ice_nml Reading tracer_nmlCalcWorkPerBlock: Total blocks:    64 Ice blocks:    24 IceFree blocks:    40 Land blocks:     0MCT::m_Router::initp_: GSMap indices not increasing...Will correctMCT::m_Router::initp_: RGSMap indices not increasing...Will correctMCT::m_Router::initp_: RGSMap indices not increasing...Will correctMCT::m_Router::initp_: GSMap indices not increasing...Will correctMCT::m_Router::initp_: GSMap indices not increasing...Will correctMCT::m_Router::initp_: RGSMap indices not increasing...Will correctMCT::m_Router::initp_: RGSMap indices not increasing...Will correctMCT::m_Router::initp_: GSMap indices not increasing...Will correctforrtl: severe (174): SIGSEGV, segmentation fault occurred Stack trace terminated abnormally.forrtl: severe (174): SIGSEGV, segmentation fault occurredImage              PC                Routine            Line        Sourcelibintlc.so.5      00002B9B529402C9  Unknown               Unknown  Unknownlibintlc.so.5      00002B9B5293EB9E  Unknown               Unknown  Unknownlibifcoremt.so.5   00002B9B517878EF  Unknown               Unknown  Unknown Stack trace terminated abnormally.forrtl: severe (174): SIGSEGV, segmentation fault occurredImage              PC                Routine            Line        Sourcelibintlc.so.5      00002B9C6946C2C9  Unknown               Unknown  Unknownlibintlc.so.5      00002B9C6946AB9E  Unknown               Unknown  Unknownlibifcoremt.so.5   00002B9C682B38EF  Unknown               Unknown  Unknownlibifcoremt.so.5   00002B9C68218279  Unknown               Unknown  Unknownlibifcoremt.so.5   00002B9C682298B3  Unknown               Unknown  Unknownlibpthread.so.0    00000037FD40F500  Unknown               Unknown  Unknowncesm.exe           00000000005FCFA2  radsw_mp_radcswmx        1599  radsw.F90cesm.exe           00000000005DF4F2  radiation_mp_radi         778  radiation.F90cesm.exe           0000000000590799  physpkg_mp_tphysb        2153  physpkg.F90cesm.exe           000000000058BAB5  physpkg_mp_phys_r         944  physpkg.F90libiomp5.so        00002B9C691D2233  Unknown               Unknown  Unknownforrtl: severe (174): SIGSEGV, segmentation fault occurredImage              PC                Routine            Line        Sourcelibintlc.so.5      00002BA6D719A2C9  Unknown               Unknown  Unknownlibintlc.so.5      00002BA6D7198B9E  Unknown               Unknown  Unknownlibifcoremt.so.5   00002BA6D5FE18EF  Unknown               Unknown  Unknownlibifcoremt.so.5   00002BA6D5F46279  Unknown               Unknown  Unknownlibifcoremt.so.5   00002BA6D5F578B3  Unknown               Unknown  Unknownlibpthread.so.0    00000037FD40F500  Unknown               Unknown  Unknowncesm.exe           00000000005FCFA2  radsw_mp_radcswmx        1599  radsw.F90cesm.exe           00000000005DF4F2  radiation_mp_radi         778  radiation.F90cesm.exe           0000000000590799  physpkg_mp_tphysb        2153  physpkg.F90cesm.exe           000000000058BAB5  physpkg_mp_phys_r         944  physpkg.F90libiomp5.so        00002BA6D6F00233  Unknown               Unknown  Unknownforrtl: severe (174): SIGSEGV, segmentation fault occurredImage              PC                Routine            Line        Sourcelibintlc.so.5      00002BA1DB5C02C9  Unknown               Unknown  Unknownlibintlc.so.5      00002BA1DB5BEB9E  Unknown               Unknown  Unknownlibifcoremt.so.5   00002BA1DA4078EF  Unknown               Unknown  Unknownlibifcoremt.so.5   00002BA1DA36C279  Unknown               Unknown  Unknownlibifcoremt.so.5   00002BA1DA37D8B3  Unknown               Unknown  Unknownlibpthread.so.0    00000037FD40F500  Unknown               Unknown  Unknowncesm.exe           00000000005FCFA2  radsw_mp_radcswmx        1599  radsw.F90cesm.exe           00000000005DF4F2  radiation_mp_radi         778  radiation.F90cesm.exe           0000000000590799  physpkg_mp_tphysb        2153  physpkg.F90cesm.exe           000000000058BAB5  physpkg_mp_phys_r         944  physpkg.F90libiomp5.so        00002BA1DB326233  Unknown               Unknown  Unknownforrtl: severe (174): SIGSEGV, segmentation fault occurredImage              PC                Routine            Line        Sourcelibintlc.so.5      00002B5B8CF6E2C9  Unknown               Unknown  Unknownlibintlc.so.5      00002B5B8CF6CB9E  Unknown               Unknown  Unknownlibifcoremt.so.5   00002B5B8BDB58EF  Unknown               Unknown  Unknownlibifcoremt.so.5   00002B5B8BD1A279  Unknown               Unknown  Unknownlibifcoremt.so.5   00002B5B8BD2B8B3  Unknown               Unknown  Unknownlibpthread.so.0    00000037FD40F500  Unknown               Unknown  Unknowncesm.exe           00000000005FCFA2  radsw_mp_radcswmx        1599  radsw.F90cesm.exe           00000000005DF4F2  radiation_mp_radi         778  radiation.F90cesm.exe           0000000000590799  physpkg_mp_tphysb        2153  physpkg.F90cesm.exe           000000000058BAB5  physpkg_mp_phys_r         944  physpkg.F90libiomp5.so        00002B5B8CCD4233  Unknown               Unknown  Unknown  Please find attached the log files and the PE layout files. Have there been openmp versions of CESM run with intel compilers? Should I build Netcdf differently. If anyone has successfully built threaded versions, please let me know. I have been struggling to solve these errors.  Thanks,NitinIndian Institute of Science  
 
I ran in the debug mode and got the same output in the log. I tried compiling with the mt_mpi option for the Intel compilers. I even updated to the Intel v15 compilers. The error just doesn't seem to be going. How can you run hybrid CESM with openmp (and MPI) using Intel V15 compilers. I have successfully built the code, but having problems running. I am not able to debug this code as well.  Thanks,Nitin
 
I ran in the debug mode and got the same output in the log. I tried compiling with the mt_mpi option for the Intel compilers. I even updated to the Intel v15 compilers. The error just doesn't seem to be going. How can you run hybrid CESM with openmp (and MPI) using Intel V15 compilers. I have successfully built the code, but having problems running. I am not able to debug this code as well.  Thanks,Nitin
 

jedwards

CSEG and Liaisons
Staff member
If you are not able to debug the code and it gives an error with openmp threading on, have you tried turning threading off - does it still give an error?If not then run with threading off.  
 

jedwards

CSEG and Liaisons
Staff member
If you are not able to debug the code and it gives an error with openmp threading on, have you tried turning threading off - does it still give an error?If not then run with threading off.  
 
I have tried the following:1) Running with threading off, where I give nthreads as 1 for each process. This run is successful. No errors seen. 2)  Running with threading on ((adds an -openmp flag) , where I give nthreads as 1 for each process. This run is successful. No errors seen. 3) Running with threading on, where I give nthreads>=2. This is where I face the run time error. Here after the mct router initialization, the process stops running unexpectedly.  Is there a specific configuration for running openmp threads along with MPI processes in CESM. Should I give the same number of threads to each process? I have tried all possible combinations of PE layout. Yet, I face the same error. Thanks,Nitin 
 
I have tried the following:1) Running with threading off, where I give nthreads as 1 for each process. This run is successful. No errors seen. 2)  Running with threading on ((adds an -openmp flag) , where I give nthreads as 1 for each process. This run is successful. No errors seen. 3) Running with threading on, where I give nthreads>=2. This is where I face the run time error. Here after the mct router initialization, the process stops running unexpectedly.  Is there a specific configuration for running openmp threads along with MPI processes in CESM. Should I give the same number of threads to each process? I have tried all possible combinations of PE layout. Yet, I face the same error. Thanks,Nitin 
 

jedwards

CSEG and Liaisons
Staff member
I guess that what you do next depends on what your goal is.   If your goal does not require threading then leave it off.   If you do require threading then you will need to debug what you are doing:Try using a different compiler or compiler version.Run under a parallel debugger such as ddt or totalview.Identify the location of the error and manually turn off openmp in that section of code - does this solve the problem?CESM is regularly run using threading, the issue you are having is somehow unique to your situation - good luck. 
 

jedwards

CSEG and Liaisons
Staff member
I guess that what you do next depends on what your goal is.   If your goal does not require threading then leave it off.   If you do require threading then you will need to debug what you are doing:Try using a different compiler or compiler version.Run under a parallel debugger such as ddt or totalview.Identify the location of the error and manually turn off openmp in that section of code - does this solve the problem?CESM is regularly run using threading, the issue you are having is somehow unique to your situation - good luck. 
 
 I need threading to compare the performance between the threading model and non-threading model. I ran the F model (CAM) on a single node with the f19_g16 resolution. For each component I gave 8 processes and 2 threads each. I used Allinea ddt to debug the code. I found that the problem was occurring in line 40 of radsw.F90. The error I am getting is "Process stopped in radsw::radcswmx (radsw.F90:40) with signal SIGSEGV (Segmentation fault)". Memory error detected in radsw::radcswmx (radsw.F90:40): Please find attached the env_mach_pes.xml and the DDT debugging Report. Please look at the current stack section in the Additional Information. That indicates the point where the code stopped.  What do you think the issue is about? Thanks,Nitin K BhatSERC,Indian Institute of Science
 
Top