Signal error of compset B_1850-2000_CN

yaozhixiong@foxmail_com · May 9, 2014

Hi,

I am using CESM1_2_0. Compset CIAF can be run for several decades on cluster. But I met one problem when I created case with compset B_1850-2000_CN.

create_newcase -case ../caseoutput/b20trcn_f19g16 -compset B_1850-2000_CN -res f19_g16 -mach szhpc

I have run some cases before. In all of these cases, no files were outputed after running several years(less than 10 years), while the status of case was still running. There are no errors in logs excepte cesm.log. Several error messages in cesm.log are showed as fallowed. The cesm.log is also attached.

******
[gk0345:27622] *** Process received signal ***
[gk0345:27622] Signal: Bus error (7)
[gk0345:27622] Signal code: (128)
[gk0345:27622] Failing at address: (nil)
[gk0345:27623] *** Process received signal ***
[gk0345:27623] Signal: Bus error (7)
[gk0345:27623] Signal code: (128)
[gk0345:27623] Failing at address: (nil)
[gk0345:27624] *** Process received signal ***
[gk0345:27624] Signal: Bus error (7)
[gk0345:27624] Signal code: (128)
[gk0345:27624] Failing at address: (nil)
*****

The cluster manager do know why this problem happened.
Thanks,

yao

jedwards · May 9, 2014

I see in the log that these files for example Opened file ./b20trcn_f19g16.rtm.h0.1958-03.nc to write      589824
Opened file ./b20trcn_f19g16.clm2.h0.1958-03.nc to write      589824
Opened file b20trcn_f19g16.cam.h0.1958-03.nc to write      589824

were written. The only thing I can gather from the cesm.log is that the node gk0345 stopped responding.

yaozhixiong@foxmail_com · May 10, 2014

Thanks jedwards.Both compset B and B_1850-2000_CN have the same error after running several years. Why did the node always stop when I run coupled model. But it is ok for ocean model. The cluster manager also do not know the reason. It does not happen on other cluster which I used before. Can you tell me how to fix it or how to find the reason?Best regards,yao

jedwards · May 10, 2014

It could be a memory leak - look at the cpl.log output to see if the memory highwater is growing unreasonably.

yaozhixiong@foxmail_com · May 13, 2014

The cesm.log uploaded before was a continue run. We can suppose it as CASE A. It met the same problem which may be about "memory leak".CASE A: 24 nodes and 12 processors per node (total 240 tasks)cpl.log_.140508-110728.zipAnd I also created a same case but with different tasks.CASE B: 5 nodes and 12 processors per node (total 60 tasks), Time: 1850-1859cpl.log_.140509-192020.gz Memory usage in cpl.log of CASE A are showed as followed. memory_write: model date = 19580102 0 memory = 335.77 MB (highwater) 0.00 MB (usage) (pe= 0 comps= cpl ATM LND OCN ICE GLC ROF WAV) memory_write: model date = 19580201 0 memory = 1528.53 MB (highwater) 0.00 MB (usage) (pe= 0 comps= cpl ATM LND OCN ICE GLC ROF WAV) memory_write: model date = 19580301 0 memory = 1679.62 MB (highwater) 0.00 MB (usage) (pe= 0 comps= cpl ATM LND OCN ICE GLC ROF WAV)... memory_write: model date = 19610624 0 memory = 1916.74 MB (highwater) 0.00 MB (usage) (pe= 0 comps= cpl ATM LND OCN ICE GLC ROF WAV) Memory usage in cpl.log of CASE B are showed as followed.tStamp_write: model date = 18500102 0 wall clock = 2014-05-09 19:23:02 avg dt = 27.13 dt = 27.13 memory_write: model date = 18500102 0 memory = 405.48 MB (highwater) 0.00 MB (usage) (pe= 0 comps= cpl ATM LND OCN ICE GLC ROF WAV) memory_write: model date = 18500201 0 memory = 610.96 MB (highwater) 0.00 MB (usage) (pe= 0 comps= cpl ATM LND OCN ICE GLC ROF WAV) memory_write: model date = 18500301 0 memory = 614.76 MB (highwater) 0.00 MB (usage) (pe= 0 comps= cpl ATM LND OCN ICE GLC ROF WAV).... memory_write: model date = 18591231 0 memory = 645.38 MB (highwater) 0.00 MB (usage) (pe= 0 comps= cpl ATM LND OCN ICE GLC ROF WAV) The memory highwater of CASE A increased fastly from the first month to second month, and was about near 2G when the case crashed. For CASE B, it used a memory of only 645M at the end of case. The memory per node of the cluster is 24G. So it is about 2.0G for one processors. Question 1:Does the "memory peak" happen because of the different tasks? Why?Question 2:What is the meaning of "usage" and "pe= 0 comps= cpl ATM LND OCN ICE GLC ROF WAV"? Here is 0.00MB of usage in my case. But it is not zero in some other peoples' case. And I think the comps means "component". What does "pe" mean? Why is it equal to 0?memory_write: model date = 18500131 0 memory = 450.47 MB (highwater) 0.00 MB (usage) (pe= 0 comps= cpl ATM LND OCN ICE GLC ROF WAV)

jedwards · May 13, 2014

Both cases indicate a significant memory leak. If your model starts out using 400MB per task, you should not expect it to exceed 500MB per task after several years of simulation. > What is the meaning of "usage" and "pe= 0 comps= cpl ATM LND OCN ICE GLC ROF WAV"?On some systems we are able to get current memory usage as well as memory high water, on other systems this functionality is not available and we just report 0.
the pe = 0 means that the maximum memory usage is occuring on mpi task 0. The list of comps is just a list of model components active on that task, in this case it's all of them.
Your next step might be to run some simplier compsets and see if you can isolate the source of the leak. You might also try a different compiler if you have one available.

yaozhixiong@foxmail_com · May 16, 2014

Two new cases are created. One is ocean model with compset CIAF with 120 processors. The other is coupled model with compset B with 120 processors too.CIAF: cpl.log.140514-212549.gzB: cpl.log.140514-205418.gzThe memory highwater of both log files are small in the first month, but they increase to more than 1000 MB in the second month. The situation also happen on other two clusters. The memory highwater of the last several months in cpl.log for compset B are not changed. But they are are changed for compset CIAF.I wonder how the memory leak happen? In factly, I think the "memory leak" can not affect model ruuning. I can reflect this problem if it does not exceed the maxmum memory of one processor in cluster.

jedwards · May 16, 2014

You neglected to attach the log files and I'm confused by your comments - At the start of the thread you said that CIAF was fine but that B there were problems, now you seem to be saying the opposite. Do all of these clusters have the same OS and compiler versions?

yaozhixiong@foxmail_com · May 16, 2014

Sorry for my careless.CIAF was fine, that I said at the start of the thread, just means model can run successfully. It also has a problem with "memory leak", although it does not exceed the maximum memory of one processor. And compset B crashed because its memory highwater exceeded the maximum.There are three different clusters used. I always use the first one.1. SuSE Linux Enterprise Server(SLES)11SP1 INTEL2. Red Hat 5 PGI3. CentOS 6.4 INTEL

jedwards · May 16, 2014

Can you please send the README.case from your CIAF case and let me know if you have any source modifications.I would not expect this kind of memory increase at monthly boundaries in the first year and would like to see if I can reproduce your run.
Thanks

yaozhixiong@foxmail_com · May 17, 2014

The README.case of compset CIAF is attached.

dingnan0701@163_com · Jun 5, 2014

Hi what's the difference between memory high water and memory usage in the cpl log?

jedwards · Jun 5, 2014

Memory highwater is the maximum memory consumed by the application on that task since it began, memory usage is the amount used at the time of the query. Depending on the OS we may get data for one or both, if we report 0 it's because we don't get that data from the OS.

dingnan0701@163_com · Jun 6, 2014

Thank you very much. But I do not understand what does the query stand for in the sentence "memory usage is the amount used at the time of the query". Does query represent the operation of memory high water statistics??

dingnan0701@163_com · Jun 6, 2014

And about memory highwater, does the maximum memory comsumed equals to the amount of the memory stack? Does it including the cache?

Signal error of compset B_1850-2000_CN

yaozhixiong@foxmail_com

Member

jedwards

CSEG and Liaisons

yaozhixiong@foxmail_com

Member

jedwards

CSEG and Liaisons

yaozhixiong@foxmail_com

Member

jedwards

CSEG and Liaisons

yaozhixiong@foxmail_com

Member

jedwards

CSEG and Liaisons

yaozhixiong@foxmail_com

Member

jedwards

CSEG and Liaisons

yaozhixiong@foxmail_com

Member

dingnan0701@163_com

New Member

jedwards

CSEG and Liaisons

dingnan0701@163_com

New Member

dingnan0701@163_com

New Member