Runtime ISSUE (possible bug) with CESM1.0.2 / WACCM-4

gchiodo@fis_ucm_es · Mar 25, 2011

Dear all,

we were able to build CESM1.0.2 on our machine with the IFORT compiler and run it for a few years with the "B1850WCN" compset on 256 cores in pure MPI parallelism (on 16 nodes with 16 cores each).

Unfortunately, the model crashed at the end of the third year.

At first sight, it definitely seemed to be a problem related to the memory usage. A closer check at the memory usage on each node revealed that one of the nodes (which was only partially used, since 2 cores of the 16 were assigned for the execution) was overloaded. The high memory load on that node may imply that that only one node is dedicated to the I/O procedures for the coupling between all components.

We first thought that by using complete nodes (16 cores on each used node; 16*16=256 cores), that memory problem could be solved... so our second try was to run it on complete nodes, but we again experience the same problem; the model crashes at the end of the third year.

Every time we launch a job we assign 112GB of RAM memory on each node (7.5 GB of RAM for each core, which is close to their physical limit of 8 GB of --> 128 GB per node).
Again, even when we use full nodes, it seems to be a memory problem. One node is overloaded, and the model seems to hit the limit when the model crashes, while the other nodes apparently use 1% of the assigned RAM memory.

The error in the log file reads as follows:

Model did not complete - see /sfs/home/uvi/fa/waccm/test-waccm4//B1850WCN_256pes/run/cpl.log.110322-181413

tail -f run/ccsm.log.110322-181413
filew failed, worst i, j, qtmp, q = 1 86
-5.328317601894912E-203 -4.040628976044413E-204
filew failed, worst i, j, qtmp, q = 1 86
-6.372875223089407E-203 -5.310499314414365E-204
d_p_coupling: failed to allocate cbuffer; error = 41
ENDRUN: called without a message string
[cli_219]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 219
rank 219 in job 1 cn041.null_50393 caused collective abort of all ranks
exit status of rank 219: return code 1

The tail of the cpl.log file reads as follows (it returns NO error, but it gives a hint at the memory usage):

memory_write: model date = 51229 0 memory = 7385.59 MB (highwater) 11565.47 MB (usage) (pe= 0 comps= cpl ocn atm lnd ice glc)
tStamp_write: model date = 51230 0 wall clock = 2011-03-25 10:53:37 avg dt = 212.49 dt = 197.28
memory_write: model date = 51230 0 memory = 7391.44 MB (highwater) 11571.32 MB (usage) (pe= 0 comps= cpl ocn atm lnd ice glc)

I checked the memory usage - value since the beginning of the simulation in the same log file. It starts at a value of 989.00MB at the first model day, and increases linearly by 6MB/model day, until reaching the value 7391.44 MB, which is very close to the physical limit of the RAM memory of 1 single core of the machine. How is it possible that the memory usage increases at each model day? It definitely seems like the coupler is calculating (or retrieving) a wrong amount of memory which needs to be assigned by the machine after each model day. If the aim of this memory diagnostic by the coupler is really to calculate the amount of memory needed by the model, than it may well be that for some strange reason, it is calculating a biased value. As soon as this value hits the physical limit of the machine, the model crashes...

Could this be a "bug" of the coupled model?

The machine we use is the FinisTerrae: http://www.cesga.es/content/view/917/115/lang,en/

142 HP Integrity rx7640 nodes with 16 Itanium Montvale cores and 128 GB of memory each.
# 390.000 GB in disc.
# 2.200.000 GB in robotized tape.

A network for interconnecting Infiniband 4x DDR nodes at 20Gbps.

Many thanks!

mmills@ucar_edu · Mar 31, 2011

My apologies for the delay in responding. When porting CESM to a new machine, you should test a CAM configuration before trying WACCM. If you find the same issue with CAM, please post this issue to the CAM section of this forum. You should find more support for porting issues there.

eaton · Apr 13, 2011

I have done the following run of cesm1_0_2 on our ibm power6 platform
(bluefire):

B1850WCN compset, f19_g16, 256 tasks in pure MPI mode, 3 month simulation.

In this run there is no sign of a memory leak. The following is a sampling
of the "memory_write:" lines from the cpl.log file:

memory_write: model date = 10102 0 memory = 917.34 MB (highwater) 0.00 MB (usage) (pe= 0 comps= cpl ocn atm lnd ice glc)
memory_write: model date = 10131 0 memory = 1151.67 MB (highwater) 0.00 MB (usage) (pe= 0 comps= cpl ocn atm lnd ice glc)
memory_write: model date = 10201 0 memory = 1321.62 MB (highwater) 0.00 MB (usage) (pe= 0 comps= cpl ocn atm lnd ice glc)
memory_write: model date = 10301 0 memory = 1321.62 MB (highwater) 0.00 MB (usage) (pe= 0 comps= cpl ocn atm lnd ice glc)
memory_write: model date = 10401 0 memory = 1321.62 MB (highwater) 0.00 MB (usage) (pe= 0 comps= cpl ocn atm lnd ice glc)

So after writing the history files at the end of the first month, the
memory highwater mark remains unchanged for the next two months.

This output shows a memory use of 1.3 GB per task when using 256 tasks, so
to total memory requirement is about 333 GB.

The above output indicates a memory use of 917 MB at the end of the first
day which is about what you reported. But I see no evidence of the 6 MB/day
leak that you are seeing. The only thing I can suggest is to try running a
smaller configuration, and making use of tools on your system to look for
the memory leak. This can be a very challenging exercise.

Runtime ISSUE (possible bug) with CESM1.0.2 / WACCM-4

gchiodo@fis_ucm_es

Member

mmills@ucar_edu

Member

eaton

CSEG and Liaisons