Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Running error for multi-nodes with CESM2.1.1

Hi:

I am porting CESM2.1.1 to the cluster in my university, everything is good so far.

When I run a test case for F2000climo with f09_f09_mg17, the default settting with one node, it is totally OK.

However, when I use more than one node, it is going to collapse while running in the first several days. the attached file is for two nodes. It collapse earlier with more nodes.

The cluster is: 80 CPUs per node, 2 nodes = 160 CPUs, with Intel complier.

Best,
Z-Q

Opening file test_160.cism.initial_hist.0001-01-01-00000.nc for output;
Write output at start of run and every 1.00000000000000 years
Creating variables internal_time, time, and tstep_count
Creating variable level
Creating variable lithoz
Creating variable staglevel
Creating variable stagwbndlevel
Creating variable x0
Creating variable x1
Creating variable y0
Creating variable y1
Creating variable artm
Creating variable smb
Creating variable thk
Creating variable topg
Creating variable usurf
Writing to file test_160.cism.initial_hist.0001-01-01-00000.nc at time 0.0
00000000000000E+000
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
calcsize j,iq,jac, lsfrm,lstoo 1 1 1 26 21
calcsize j,iq,jac, lsfrm,lstoo 1 1 2 26 21
calcsize j,iq,jac, lsfrm,lstoo 1 2 1 22 15
calcsize j,iq,jac, lsfrm,lstoo 1 2 2 22 15
calcsize j,iq,jac, lsfrm,lstoo 1 3 1 24 17
calcsize j,iq,jac, lsfrm,lstoo 1 3 2 24 17
calcsize j,iq,jac, lsfrm,lstoo 1 4 1 25 20
calcsize j,iq,jac, lsfrm,lstoo 1 4 2 25 20
calcsize j,iq,jac, lsfrm,lstoo 1 5 1 23 19
calcsize j,iq,jac, lsfrm,lstoo 1 5 2 23 19
calcsize j,iq,jac, lsfrm,lstoo 2 1 1 21 26
calcsize j,iq,jac, lsfrm,lstoo 2 1 2 21 26
calcsize j,iq,jac, lsfrm,lstoo 2 2 1 15 22
calcsize j,iq,jac, lsfrm,lstoo 2 2 2 15 22
calcsize j,iq,jac, lsfrm,lstoo 2 3 1 17 24
calcsize j,iq,jac, lsfrm,lstoo 2 3 2 17 24
calcsize j,iq,jac, lsfrm,lstoo 2 4 1 20 25
calcsize j,iq,jac, lsfrm,lstoo 2 4 2 20 25
calcsize j,iq,jac, lsfrm,lstoo 2 5 1 19 23
calcsize j,iq,jac, lsfrm,lstoo 2 5 2 19 23
max rss=549.1 MB
max rss=549.1 MB
max rss=549.1 MB
max rss=549.1 MB
max rss=549.1 MB
max rss=549.1 MB
max rss=549.1 MB
max rss=549.1 MB
max rss=549.1 MB
max rss=549.1 MB
max rss=549.1 MB
max rss=549.1 MB
max rss=549.1 MB
max rss=584.3 MB

cesm.exe:121149 terminated with signal 11 at PC=b75ee3 SP=7ffd66cda3a0. Backtrace:
./cesm.exe[0xb75ee3]
./cesm.exe[0xb46943]
./cesm.exe[0xb39915]
./cesm.exe[0x74cb67]
./cesm.exe[0x7312dd]
./cesm.exe[0x6e6a05]
./cesm.exe[0x6df962]
./cesm.exe[0x4fef7c]
./cesm.exe[0x4efec0]
./cesm.exe[0x4322ca]
./cesm.exe[0x4192ee]
./cesm.exe[0x431f6d]
./cesm.exe[0x41569e]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2ba26a9813d5]
./cesm.exe[0x4155a9]
 

Attachments

  • runlog.txt
    83.6 KB · Views: 11
  • cpl.txt
    76.8 KB · Views: 7

sacks

Bill Sacks
CSEG and Liaisons
Staff member
A few suggestions of things to check:

(1) Try rebuilding in DEBUG mode. The easiest way to do this is to just create a brand new case. Then run ./xmlchange DEBUG=TRUE before running ./case.build.

(2) Have you checked the log files from other components to see if they might have an error message near the end?

(3) The tracebacks at the end of runlog.txt are hex values rather than file / line numbers. On some systems, you can convert these hex values to something more meaningful with the addr2line command-line utility, if it is installed on your system. Usage is addr2line -e ../bld/cesm.exe 0xb75ee3 (where you should replace ../bld/cesm.exe with the path to your build directory: that assumes that you are in the run directory). Try that with a number of the different hex values given. (Sometimes it works on some of the items in the backtrace but not others.) However, I would only try this after (1) and (2).
 
Hi sacks,

Thank you very much for your reply.
1. I have tried to create a new case with DEBUG=TRUE, the error appears as :
forrtl: error (65): floating invalid

Image PC Routine Line Source
cesm.exe 00000000094F613E for__signal_handl Unknown Unknown
libpthread-2.17.s 00002B935A9585D0 Unknown Unknown Unknown
cesm.exe 0000000002D9EC7F Unknown Unknown Unknown

2. Then I change the complier from mpicc to mpiicc, and flags (see Macros.make for more detail)
FFLAGS = -o0 -ip -m64 -convert big_endian -assume byterecl -mcmodel=medium
CFLAGS = -o0 -ip -m64 -xHost -mcmodel=medium

This time it works with multi-nodes, however, it crash after one or two month running, with similar error as before
cesm.exe:258439 terminated with signal 11 at PC=19380ba SP=7ffe6a447500. Backtrace:

3. I also tried with B1850, it is same with F2000climo and FHIST
 

Attachments

  • Macrosmake.txt
    2.1 KB · Views: 9

sacks

Bill Sacks
CSEG and Liaisons
Staff member
Can you please attach all of the log files from (1) (the run with DEBUG=TRUE)?
 
Sorry, I just tried to rebuilding the libraries and model again, turns out to be the same. Please find the logs with DEBUG=TRUE.
 

Attachments

  • logs.zip
    122.7 KB · Views: 8

sacks

Bill Sacks
CSEG and Liaisons
Staff member
Thank you very much for your reply, I have tried to rebuilt NetCDF library with INTEL compiler myself, which was installed by our administrator.

And now it works with multi nodes for two months output so far, maybe the compiler thing is the reason. However, the speed is not very fast, 1 degree resolution almost 3 minutes per day for 320PEs with mpiicc, almost identical with single node for 80 PEs with gcc.

When I changed from -O0 to -O2/-O1 for Flags, the following error appears, do you have any idea about this?

Many thanks and Merry Christmas.

ifeq ($(DEBUG),FALSE)
FFLAGS := $(FFLAGS) -O1 -ip -m64 -convert big_endian -assume byterecl -mcmodel=medium
CFLAGS := $(CFLAGS) -O1 -ip -m64 -xHost -mcmodel=medium

SHR_REPROSUM_CALC: Input contains 0.34656E+05 NaNs and 0.00000E+00 INFs on process 315
ERROR: shr_reprosum_calc ERROR: NaNs or INFs in input
SHR_REPROSUM_CALC: Input contains 0.35264E+05 NaNs and 0.00000E+00 INFs on process 316
ERROR: shr_reprosum_calc ERROR: NaNs or INFs in input
SHR_REPROSUM_CALC: Input contains 0.35264E+05 NaNs and 0.00000E+00 INFs on process 317
ERROR: shr_reprosum_calc ERROR: NaNs or INFs in input
SHR_REPROSUM_CALC: Input contains 0.35264E+05 NaNs and 0.00000E+00 INFs on process 318
ERROR: shr_reprosum_calc ERROR: NaNs or INFs in input
SHR_REPROSUM_CALC: Input contains 0.34656E+05 NaNs and 0.00000E+00 INFs on process 319
ERROR: shr_reprosum_calc ERROR: NaNs or INFs in input
Image PC Routine Line Source
cesm.exe 000000000229F6F6 tracebackqq_ Unknown Unknown
cesm.exe 0000000001EBEE9E Unknown Unknown Unknown
cesm.exe 0000000001FEE7F8 Unknown Unknown Unknown
cesm.exe 00000000010D00A4 Unknown Unknown Unknown
cesm.exe 0000000000F3E0CB Unknown Unknown Unknown
cesm.exe 0000000000DAADB4 Unknown Unknown Unknown
cesm.exe 00000000004CA4A8 Unknown Unknown Unknown
cesm.exe 00000000004C2B23 Unknown Unknown Unknown
cesm.exe 0000000000433E65 Unknown Unknown Unknown
cesm.exe 000000000041B8E0 Unknown Unknown Unknown
cesm.exe 00000000004339CE Unknown Unknown Unknown
cesm.exe 0000000000414F9E Unknown Unknown Unknown
libc-2.17.so 00002AD12A8533D5 __libc_start_main Unknown Unknown
cesm.exe 0000000000414EA9 Unknown Unknown Unknown
application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 318
 

Attachments

  • logs.zip
    97.1 KB · Views: 3

sacks

Bill Sacks
CSEG and Liaisons
Staff member
I'm glad to hear you're getting further. Yes, I would expect a slow speed when compiling with -O0, I only suggested this DEBUG mode for the sake of getting additional information on the error.

This is getting to be a hard issue for me to debug remotely: the error that you're seeing now is just a symptom of the problem rather than the root cause.

What version of the intel compiler are you using (ifort --version)? I'd like to check if it's a version that we currently test with. Also, if you have a different compiler available (e.g., gfortran), you could try with that, though the performance may not be as good.

Also, can you please give the details of how you are setting up this case (starting with the create_newcase command, and including any modifications you made to the case before building / running), any changes you have made in the code base (i.e., any files you changed after downloading CESM), and if you have created your own version of config_compilers.xml or config_machines.xml (or any other files), please attach those? To be honest, I'm not sure how helpful this information will be in solving this problem, but it would help us to know if you're running something very standard or something else.
 
Hi Bill,

Thank you for all your help, I have tried your suggestions by several times, and the problem is still there.

Then I decided to reinstall the software and related NetCDF library myself with Intel18.0.5 and openmpi4, which were installed by our administrator before. Finally it works, so I actually not sure why it didn't work before, maybe due to the software problem.

Thanks a lot and please delete the current thread.
 

ntandon

Neil Tandon
Member
I was experiencing the same type of error with CESM 2.1.3. In my case, switching from using intelmpi/2018.3.222. netcdf-fortran-mpi/4.4.4, netcdf-mpi/4.4.1.1 to using openmpi/3.1.2, netcdf-fortran-mpi/4.5.1, netcdf-mpi/4.6.1 resolved the problem.
 
Top