gchiodo@fis_ucm_es
Member
Dear all,
we were able to build CESM1.0.2 on our machine with the IFORT compiler and run it for a few years with the "B1850WCN" compset on 256 cores in pure MPI parallelism (on 16 nodes with 16 cores each).
Unfortunately, the model crashed at the end of the third year.
At first sight, it definitely seemed to be a problem related to the memory usage. A closer check at the memory usage on each node revealed that one of the nodes (which was only partially used, since 2 cores of the 16 were assigned for the execution) was overloaded. The high memory load on that node may imply that that only one node is dedicated to the I/O procedures for the coupling between all components.
We first thought that by using complete nodes (16 cores on each used node; 16*16=256 cores), that memory problem could be solved... so our second try was to run it on complete nodes, but we again experience the same problem; the model crashes at the end of the third year.
Every time we launch a job we assign 112GB of RAM memory on each node (7.5 GB of RAM for each core, which is close to their physical limit of 8 GB of --> 128 GB per node).
Again, even when we use full nodes, it seems to be a memory problem. One node is overloaded, and the model seems to hit the limit when the model crashes, while the other nodes apparently use 1% of the assigned RAM memory.
The error in the log file reads as follows:
Model did not complete - see /sfs/home/uvi/fa/waccm/test-waccm4//B1850WCN_256pes/run/cpl.log.110322-181413
tail -f run/ccsm.log.110322-181413
filew failed, worst i, j, qtmp, q = 1 86
-5.328317601894912E-203 -4.040628976044413E-204
filew failed, worst i, j, qtmp, q = 1 86
-6.372875223089407E-203 -5.310499314414365E-204
d_p_coupling: failed to allocate cbuffer; error = 41
ENDRUN: called without a message string
[cli_219]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 219
rank 219 in job 1 cn041.null_50393 caused collective abort of all ranks
exit status of rank 219: return code 1
The tail of the cpl.log file reads as follows (it returns NO error, but it gives a hint at the memory usage):
memory_write: model date = 51229 0 memory = 7385.59 MB (highwater) 11565.47 MB (usage) (pe= 0 comps= cpl ocn atm lnd ice glc)
tStamp_write: model date = 51230 0 wall clock = 2011-03-25 10:53:37 avg dt = 212.49 dt = 197.28
memory_write: model date = 51230 0 memory = 7391.44 MB (highwater) 11571.32 MB (usage) (pe= 0 comps= cpl ocn atm lnd ice glc)
I checked the memory usage - value since the beginning of the simulation in the same log file. It starts at a value of 989.00MB at the first model day, and increases linearly by 6MB/model day, until reaching the value 7391.44 MB, which is very close to the physical limit of the RAM memory of 1 single core of the machine. How is it possible that the memory usage increases at each model day? It definitely seems like the coupler is calculating (or retrieving) a wrong amount of memory which needs to be assigned by the machine after each model day. If the aim of this memory diagnostic by the coupler is really to calculate the amount of memory needed by the model, than it may well be that for some strange reason, it is calculating a biased value. As soon as this value hits the physical limit of the machine, the model crashes...
Could this be a "bug" of the coupled model?
The machine we use is the FinisTerrae: http://www.cesga.es/content/view/917/115/lang,en/
142 HP Integrity rx7640 nodes with 16 Itanium Montvale cores and 128 GB of memory each.
# 390.000 GB in disc.
# 2.200.000 GB in robotized tape.
A network for interconnecting Infiniband 4x DDR nodes at 20Gbps.
Many thanks!
we were able to build CESM1.0.2 on our machine with the IFORT compiler and run it for a few years with the "B1850WCN" compset on 256 cores in pure MPI parallelism (on 16 nodes with 16 cores each).
Unfortunately, the model crashed at the end of the third year.
At first sight, it definitely seemed to be a problem related to the memory usage. A closer check at the memory usage on each node revealed that one of the nodes (which was only partially used, since 2 cores of the 16 were assigned for the execution) was overloaded. The high memory load on that node may imply that that only one node is dedicated to the I/O procedures for the coupling between all components.
We first thought that by using complete nodes (16 cores on each used node; 16*16=256 cores), that memory problem could be solved... so our second try was to run it on complete nodes, but we again experience the same problem; the model crashes at the end of the third year.
Every time we launch a job we assign 112GB of RAM memory on each node (7.5 GB of RAM for each core, which is close to their physical limit of 8 GB of --> 128 GB per node).
Again, even when we use full nodes, it seems to be a memory problem. One node is overloaded, and the model seems to hit the limit when the model crashes, while the other nodes apparently use 1% of the assigned RAM memory.
The error in the log file reads as follows:
Model did not complete - see /sfs/home/uvi/fa/waccm/test-waccm4//B1850WCN_256pes/run/cpl.log.110322-181413
tail -f run/ccsm.log.110322-181413
filew failed, worst i, j, qtmp, q = 1 86
-5.328317601894912E-203 -4.040628976044413E-204
filew failed, worst i, j, qtmp, q = 1 86
-6.372875223089407E-203 -5.310499314414365E-204
d_p_coupling: failed to allocate cbuffer; error = 41
ENDRUN: called without a message string
[cli_219]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 219
rank 219 in job 1 cn041.null_50393 caused collective abort of all ranks
exit status of rank 219: return code 1
The tail of the cpl.log file reads as follows (it returns NO error, but it gives a hint at the memory usage):
memory_write: model date = 51229 0 memory = 7385.59 MB (highwater) 11565.47 MB (usage) (pe= 0 comps= cpl ocn atm lnd ice glc)
tStamp_write: model date = 51230 0 wall clock = 2011-03-25 10:53:37 avg dt = 212.49 dt = 197.28
memory_write: model date = 51230 0 memory = 7391.44 MB (highwater) 11571.32 MB (usage) (pe= 0 comps= cpl ocn atm lnd ice glc)
I checked the memory usage - value since the beginning of the simulation in the same log file. It starts at a value of 989.00MB at the first model day, and increases linearly by 6MB/model day, until reaching the value 7391.44 MB, which is very close to the physical limit of the RAM memory of 1 single core of the machine. How is it possible that the memory usage increases at each model day? It definitely seems like the coupler is calculating (or retrieving) a wrong amount of memory which needs to be assigned by the machine after each model day. If the aim of this memory diagnostic by the coupler is really to calculate the amount of memory needed by the model, than it may well be that for some strange reason, it is calculating a biased value. As soon as this value hits the physical limit of the machine, the model crashes...
Could this be a "bug" of the coupled model?
The machine we use is the FinisTerrae: http://www.cesga.es/content/view/917/115/lang,en/
142 HP Integrity rx7640 nodes with 16 Itanium Montvale cores and 128 GB of memory each.
# 390.000 GB in disc.
# 2.200.000 GB in robotized tape.
A network for interconnecting Infiniband 4x DDR nodes at 20Gbps.
Many thanks!