Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES

Lumoss

Member
Hi, everyone:

I have successfully created a case and I have done ./case.setup and ./case.build. Now for ./case.submit I am getting the following error. And I turned on debug.

Bash:
ERROR: RUN FAIL: Command 'mpirun  -np 135  /public/home//src_cesm2_3_beta08/projects/scratch/mycase_test1/bld/cesm.exe   >> cesm.log.$LID 2>&1 ' failed
See log file for details: /public/home//src_cesm2_3_beta08/projects/scratch/mycase_test1/run/cesm.log.788157.mgr.231104-193446

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 151022 RUNNING AT node15
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

I am using cesm2_3_beta08.

Could anyone suggest to me how can I resolve this problem?

Enclosed is my cesm_log file.

All the best,
Lumos.
 

Attachments

  • cesm.log.788157.mgr.231104-193446.txt
    122.5 KB · Views: 7

jedwards

CSEG and Liaisons
Staff member
There is no information here regarding the nature of the issue. Have you looked at the component logs? The atm.log may have more useful information.
 

Lumoss

Member
There is no information here regarding the nature of the issue. Have you looked at the component logs? The atm.log may have more useful information.
jedwards,

Thank you for your reply. I checked the other component logs, but I can't seem to get a valid message. Enclosed is my atm.log.

Thank you,
Lumos.
 

Lumoss

Member
jedwards,

Thank you for your reply. I checked the other component logs, but I can't seem to get a valid message. Enclosed is my atm.log.

Thank you,
Lumos.
Since the atm.log file is too large to upload, I split it into two files.
 

Attachments

  • atm.log.788157.mgr.231104-193446(1).txt
    554.2 KB · Views: 7
  • atm.log.788157.mgr.231104-193446(2).txt
    594.3 KB · Views: 5

jedwards

CSEG and Liaisons
Staff member
Nothing there - you may need to seek help from your local system administrators - can you run a simple 'hello world' mpi program?
 

Lumoss

Member
Nothing there - you may need to seek help from your local system administrators - can you run a simple 'hello world' mpi program?
Thanks for your reply, I would like to ask how to run the Hello World mpi program. I'm using intelmpi and successfully ran the CMAQ model on this machine, don't know if that will help.
 

Lumoss

Member
Thanks for your reply, I would like to ask how to run the Hello World mpi program. I'm using intelmpi and successfully ran the CMAQ model on this machine, don't know if that will help.
Also, the resolution at which I create the case is ne0CONUSne30x8_ne0CONUSne30x8_mt1. How many resources will be needed to run this case?
 

Lumoss

Member
Nothing there - you may need to seek help from your local system administrators - can you run a simple 'hello world' mpi program?
As an aside, I asked a question on the topic of High resolution/variable resolution (atmsrf_ne120np4_181018.nc could not be found in the website "https://svn-ccsm-inputdata.cgd.ucar.edu/trunk/inputdata/atm/cam/chem/trop_mam/"), but no one paid any attention to me, I wonder if you could help me? This file seems to be stored in Cheyenne.

Thank you,
Lumos.
 

Lumoss

Member
Nothing there - you may need to seek help from your local system administrators - can you run a simple 'hello world' mpi program?
I tried to run an mpi Hello World program, and the result is below.
Bash:
mpirun -np 4 helloworld

Hello world from processor login, rank 2 out of 4 processors
Hello world from processor login, rank 3 out of 4 processors
Hello world from processor login, rank 1 out of 4 processors
[0] MPI startup(): I_MPI_SHM_LMT environment variable is not supported.
[0] MPI startup(): Similar variables:
         I_MPI_SHM_HEAP
         I_MPI_SHM
         I_MPI_SHM_OPT
         I_MPI_SHM_THP
[0] MPI startup(): To check the list of supported variables, use the impi_info utility or refer to https://software.intel.com/en-us/mpi-library/documentation/get-started.
Hello world from processor login, rank 0 out of 4 processors
 

Lumoss

Member
Nothing there - you may need to seek help from your local system administrators - can you run a simple 'hello world' mpi program?
Hi jedwards,

I tried to modify my config_batch.xml, but I got a new error:

Bash:
shr_carma_readnl:  no carma_inparm namelist found in drv_flds_in
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
cesm.exe           0000000002A06C7A  Unknown               Unknown  Unknown
libpthread-2.17.s  00002ACAA85D45E0  Unknown               Unknown  Unknown
librxm-fi.so       00002ACBA73B99A3  Unknown               Unknown  Unknown
librxm-fi.so       00002ACBA73BC568  Unknown               Unknown  Unknown
librxm-fi.so       00002ACBA73BC609  Unknown               Unknown  Unknown
librxm-fi.so       00002ACBA73D2A3D  Unknown               Unknown  Unknown
librxm-fi.so       00002ACBA73D380B  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00002ACAA78AF166  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00002ACAA73F4F4B  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00002ACAA7A4F44B  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00002ACAA7999AD8  PMPI_Send             Unknown  Unknown
cesm.exe           000000000284853B  Unknown               Unknown  Unknown
cesm.exe           0000000002845B49  Unknown               Unknown  Unknown
libesmf.so         00002ACA9764E481  _Z33get_nodeCoord     Unknown  Unknown
libesmf.so         00002ACA97B138F2  _Z36ESMCI_mesh_cr     Unknown  Unknown
libesmf.so         00002ACA97B12B62  _Z27ESMCI_mesh_cr     Unknown  Unknown
libesmf.so         00002ACA97AE66A5  _ZN5ESMCI7MeshCap     Unknown  Unknown
libesmf.so         00002ACA97B17351  c_esmc_meshcreate     Unknown  Unknown
libesmf.so         00002ACA98513383  esmf_meshmod_mp_e     Unknown  Unknown
cesm.exe           0000000001C9F122  lnd_set_decomp_an          97  lnd_set_decomp_and_domain.F90
cesm.exe           0000000001C901CE  lnd_comp_nuopc_mp         627  lnd_comp_nuopc.F90
libesmf.so         00002ACA976838D0  _ZN5ESMCI6FTable1     Unknown  Unknown
libesmf.so         00002ACA9768787B  ESMCI_FTableCallE     Unknown  Unknown
libesmf.so         00002ACA97D9D33A  _ZN5ESMCI3VMK5ent     Unknown  Unknown
libesmf.so         00002ACA97DBD015  _ZN5ESMCI2VM5ente     Unknown  Unknown
libesmf.so         00002ACA97684F8A  c_esmc_ftablecall     Unknown  Unknown
libesmf.so         00002ACA98044C50  esmf_compmod_mp_e     Unknown  Unknown
libesmf.so         00002ACA982A2F11  esmf_gridcompmod_     Unknown  Unknown
libesmf.so         00002ACA988DFA67  nuopc_driver_mp_l     Unknown  Unknown
libesmf.so         00002ACA988FBFFF  nuopc_driver_mp_i     Unknown  Unknown
libesmf.so         00002ACA976838D0  _ZN5ESMCI6FTable1     Unknown  Unknown
libesmf.so         00002ACA9768787B  ESMCI_FTableCallE     Unknown  Unknown
libesmf.so         00002ACA97D9D33A  _ZN5ESMCI3VMK5ent     Unknown  Unknown
libesmf.so         00002ACA97DBD015  _ZN5ESMCI2VM5ente     Unknown  Unknown
libesmf.so         00002ACA97684F8A  c_esmc_ftablecall     Unknown  Unknown
libesmf.so         00002ACA98044C50  esmf_compmod_mp_e     Unknown  Unknown
libesmf.so         00002ACA982A2F11  esmf_gridcompmod_     Unknown  Unknown
libesmf.so         00002ACA988DFA67  nuopc_driver_mp_l     Unknown  Unknown
libesmf.so         00002ACA988FC116  nuopc_driver_mp_i     Unknown  Unknown
libesmf.so         00002ACA98911049  nuopc_driver_mp_i     Unknown  Unknown
libesmf.so         00002ACA976838D0  _ZN5ESMCI6FTable1     Unknown  Unknown
libesmf.so         00002ACA9768787B  ESMCI_FTableCallE     Unknown  Unknown
libesmf.so         00002ACA97D9D33A  _ZN5ESMCI3VMK5ent     Unknown  Unknown
libesmf.so         00002ACA97DBD015  _ZN5ESMCI2VM5ente     Unknown  Unknown
libesmf.so         00002ACA97684F8A  c_esmc_ftablecall     Unknown  Unknown
libesmf.so         00002ACA98044C50  esmf_compmod_mp_e     Unknown  Unknown
libesmf.so         00002ACA982A2F11  esmf_gridcompmod_     Unknown  Unknown
cesm.exe           0000000000432A83  MAIN__                    140  esmApp.F90
cesm.exe           0000000000424AE2  Unknown               Unknown  Unknown
libc-2.17.so       00002ACAA8B04C05  __libc_start_main     Unknown  Unknown
cesm.exe           00000000004249E9  Unknown               Unknown  Unknown
Abort(471451535) on node 28 (rank 28 in comm 0): Fatal error in PMPI_Recv: Other MPI error, error stack:
PMPI_Recv(173).................: MPI_Recv(buf=0x2aac26b65010, count=4521720, MPI_DOUBLE, src=0, tag=17, comm=0xc400012d, status=0x7ffc6b7894b0) failed
MPID_Recv(590).................:
MPIDI_recv_unsafe(205).........:
MPIDI_OFI_handle_cq_error(1042): OFI poll failed (ofi_events.c:1042:MPIDI_OFI_handle_cq_error:Transport endpoint is not connected)

Local system administrators do not know much about CESM and do not have a solution. By the way, the intel version I'm using is 2021.1 and mpi is intelmpi for the intel compiler. Could you kindly provide any recommendations?

Thank you,
Lumos

My .xmls:
 

Attachments

  • config_machines+compilers+batch.xml.txt
    6.6 KB · Views: 4

Lumoss

Member
Hi jedwards,

Excuse me again. I tried to run the fhist case on our machine, using a node with 28 cores, and it worked.

However, I found that when applying for a multi-node run FCnudged case, it only runs on one node and finally reports the aforementioned error. I tried to use another machine to run the FCnudged case (70 cores for one node). Some extra log output, but it will still stop at:

Bash:
 Opened existing file
 /home/cesminputdata/atm/cam/chem/trop_mozart/ub/clim_p_trop
 .nc         396
calcsize j,iq,jac, lsfrm,lstoo  1  1  1 139 138
calcsize j,iq,jac, lsfrm,lstoo  1  1  2 139 138
calcsize j,iq,jac, lsfrm,lstoo  1  2  1 163 162
calcsize j,iq,jac, lsfrm,lstoo  1  2  2 163 162
calcsize j,iq,jac, lsfrm,lstoo  1  3  1 166 165
calcsize j,iq,jac, lsfrm,lstoo  1  3  2 166 165
calcsize j,iq,jac, lsfrm,lstoo  1  4  1 168 167
calcsize j,iq,jac, lsfrm,lstoo  1  4  2 168 167
calcsize j,iq,jac, lsfrm,lstoo  1  5  1 170 169
calcsize j,iq,jac, lsfrm,lstoo  1  5  2 170 169
calcsize j,iq,jac, lsfrm,lstoo  1  6  1 172 171
calcsize j,iq,jac, lsfrm,lstoo  1  6  2 172 171
calcsize j,iq,jac, lsfrm,lstoo  1  7  1 174 173
calcsize j,iq,jac, lsfrm,lstoo  1  7  2 174 173
calcsize j,iq,jac, lsfrm,lstoo  1  8  1 127 126
calcsize j,iq,jac, lsfrm,lstoo  1  8  2 127 126
calcsize j,iq,jac, lsfrm,lstoo  1  9  1  78  77
calcsize j,iq,jac, lsfrm,lstoo  1  9  2  78  77
calcsize j,iq,jac, lsfrm,lstoo  2  1  1 138 139
calcsize j,iq,jac, lsfrm,lstoo  2  1  2 138 139
calcsize j,iq,jac, lsfrm,lstoo  2  2  1 162 163
calcsize j,iq,jac, lsfrm,lstoo  2  2  2 162 163
calcsize j,iq,jac, lsfrm,lstoo  2  3  1 165 166
calcsize j,iq,jac, lsfrm,lstoo  2  3  2 165 166
calcsize j,iq,jac, lsfrm,lstoo  2  4  1 167 168
calcsize j,iq,jac, lsfrm,lstoo  2  4  2 167 168
calcsize j,iq,jac, lsfrm,lstoo  2  5  1 169 170
calcsize j,iq,jac, lsfrm,lstoo  2  5  2 169 170
calcsize j,iq,jac, lsfrm,lstoo  2  6  1 171 172
calcsize j,iq,jac, lsfrm,lstoo  2  6  2 171 172
calcsize j,iq,jac, lsfrm,lstoo  2  7  1 173 174
calcsize j,iq,jac, lsfrm,lstoo  2  7  2 173 174
calcsize j,iq,jac, lsfrm,lstoo  2  8  1 126 127
calcsize j,iq,jac, lsfrm,lstoo  2  8  2 126 127
calcsize j,iq,jac, lsfrm,lstoo  2  9  1  77  78
calcsize j,iq,jac, lsfrm,lstoo  2  9  2  77  78

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 99322 RUNNING AT pam2
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

Perhaps you have any other good suggestions? Thank you so much!

All the best,
Lumos.
 

jedwards

CSEG and Liaisons
Staff member
Can you run a simple mpi hello world program on multiple nodes? You should probably discuss with your system admin support staff (If you have one).
Also FCNudged is one of the more compilcated compsets - you should try QPC6 with the same grid and see if that works.
 

Lumoss

Member
Can you run a simple mpi hello world program on multiple nodes? You should probably discuss with your system admin support staff (If you have one).
Also FCNudged is one of the more compilcated compsets - you should try QPC6 with the same grid and see if that works.
Hi jedwards,

Thank you very much for your prompt reply, I am communicating with my system admin support staff about this issue but no answer yet. Meanwhile, I'm trying to run a simple mpi hello world program on multiple nodes, but the nodes are busy and I'm not getting results yet.

In addition, I tried the QPC6 with the f09_f09_mg17 grid, which can run on a single core.

When I try QPC6 with the ne0CONUSne30x8_ne0CONUSne30x8_mt12 grid. I get an error: "negative moist layer thickness. timestep or remap time too large", I found the same question in the forum: The error named "negative moist layer thickness. timestep or remap time too large" in CESM2.2.0, but it doesn't seem to be answered.

I will follow up on this issue. Thank you again for your reply and suggestions.

Best regards,
Lumos.
 

Lumoss

Member
Hi jedwards,

Thank you very much for your prompt reply, I am communicating with my system admin support staff about this issue but no answer yet. Meanwhile, I'm trying to run a simple mpi hello world program on multiple nodes, but the nodes are busy and I'm not getting results yet.

In addition, I tried the QPC6 with the f09_f09_mg17 grid, which can run on a single core.

When I try QPC6 with the ne0CONUSne30x8_ne0CONUSne30x8_mt12 grid. I get an error: "negative moist layer thickness. timestep or remap time too large", I found the same question in the forum: The error named "negative moist layer thickness. timestep or remap time too large" in CESM2.2.0, but it doesn't seem to be answered.

I will follow up on this issue. Thank you again for your reply and suggestions.

Best regards,
Lumos.
I can run a simple mpi hello world program on multiple nodes on my machine.

I tried to update to the cesm2_3_alpha016g version, but when I ran FCNudged, I needed more new input data. There were two I couldn't find. Could you help me?

Bash:
Model ctsm missing file fsurdat = '/cesminputdata/lnd/clm2/surfdata_esmf/ctsm5.2.0/surfdata_ne0np4CONUS.ne30X8_hist_78pfts_CMIP6_1850_c230517.nc'
Model ctsm missing file flanduse_timeseries = '/cesminputdata/lnd/clm2/surfdata_esmf/ctsm5.2.0/landuse.timeseries_ne0np4CONUS.ne30x8_SSP5-8.5_78_CMIP6_1850-2100_c230530.nc'

Best regards,
Lumos.
 

jedwards

CSEG and Liaisons
Staff member
These files have been added to the repository, you should be able to get them now. But it looks like you are trying to begin with one of the most complicated CESM configurations - I recommend that you start with simple cases and work your way up. If you don't have QPC6 working yet there is little possibility that FCNudged will work.
 

Lumoss

Member
These files have been added to the repository, you should be able to get them now. But it looks like you are trying to begin with one of the most complicated CESM configurations - I recommend that you start with simple cases and work your way up. If you don't have QPC6 working yet there is little possibility that FCNudged will work.
Thank you again for your reply. I found these two files, but one of them is very large and the download is very slow. Is there a faster and easier way to get the input data?

In addition, thanks for your advice, I have successfully run QPC 6 with the f09_f09_mg17 grid on two nodes, so I start to try FCNudged in the new version of CESM.

Thank you again for your reply and suggestions.

Best regards,
Lumos.
 

Lumoss

Member
I tried to compile my netcdf4.7.4 using mpiicc with pnetcdf1.12.3 (which I didn't use before), but when I used cesm2_3_alpha016g, I got an error when building pio:

Bash:
/public/home/cesm2_3_alpha016g/libraries/parallelio/src/flib/pio.F90(90): error #6580: Name in only-list does not exist or is not accessible.   [PIO_INQ_VAR_FILTER_IDS]
       PIO_inq_var_filter_ids   , &
-------^
/public/home/cesm2_3_alpha016g/libraries/parallelio/src/flib/pio.F90(91): error #6580: Name in only-list does not exist or is not accessible.   [PIO_INQ_VAR_FILTER_INFO]
       PIO_inq_var_filter_info  , &
-------^
/public/home/cesm2_3_alpha016g/libraries/parallelio/src/flib/pio.F90(92): error #6580: Name in only-list does not exist or is not accessible.   [PIO_INQ_FILTER_AVAIL]
       PIO_inq_filter_avail     , &
-------^
/public/home/cesm2_3_alpha016g/libraries/parallelio/src/flib/pio.F90(93): error #6580: Name in only-list does not exist or is not accessible.   [PIO_DEF_VAR_SZIP]
       PIO_def_var_szip, &
-------^
compilation aborted for /public/home/cesm2_3_alpha016g/libraries/parallelio/src/flib/pio.F90 (code 1)
make[2]: *** [src/flib/CMakeFiles/piof.dir/pio.F90.o] Error 1
make[1]: *** [src/flib/CMakeFiles/piof.dir/all] Error 2
make: *** [all] Error 2

But in the old version cesm2_3_beta08 this problem does not appear, Perhaps you have any other good suggestions?

Best regards,
Lumos.
 

jedwards

CSEG and Liaisons
Staff member
Please try after updating netcdf to version 4.9.2, the pio was supposed to be backward compatible but it turns out not to be.
Sorry for the issues.
 

Lumoss

Member
Please try after updating netcdf to version 4.9.2, the pio was supposed to be backward compatible but it turns out not to be.
Sorry for the issues.
Thank you for your reply.

If I want to use cesm2_3_alpha016g. What version of netcdf-c, netcdf-fortran, hdf5, ESMF, etc., should I use? Is there a specific recommended version? What is necessary?

Thanks again.

Best regards,
Lumos.

The following are the external system and software requirements for installing and running CESM2.

 

jedwards

CSEG and Liaisons
Staff member
Those are pretty old requirements - I'll try to get that web page updated.
python >= 3.7
netcdf-c 4.9.2, netcdf-fortran 4.6.1
hdf5 1.12.3 or newer
esmf 8.6.0
 
Top