Main menu

Navigation

Signal error of compset B_1850-2000_CN

15 posts / 0 new
Last post
yaozhixiong@...
Signal error of compset B_1850-2000_CN

Hi,

I am using CESM1_2_0. Compset CIAF can be run for several decades on cluster. But I met one problem when I created case with compset B_1850-2000_CN.

create_newcase -case ../caseoutput/b20trcn_f19g16 -compset B_1850-2000_CN -res f19_g16 -mach szhpc

I have run some cases before. In all of these cases, no files were outputed after running several years(less than 10 years), while the status of case was still running. There are no errors in logs excepte cesm.log. Several error messages in cesm.log are showed as fallowed. The cesm.log is also attached.

******
[gk0345:27622] *** Process received signal ***
[gk0345:27622] Signal: Bus error (7)
[gk0345:27622] Signal code: (128)
[gk0345:27622] Failing at address: (nil)
[gk0345:27623] *** Process received signal ***
[gk0345:27623] Signal: Bus error (7)
[gk0345:27623] Signal code: (128)
[gk0345:27623] Failing at address: (nil)
[gk0345:27624] *** Process received signal ***
[gk0345:27624] Signal: Bus error (7)
[gk0345:27624] Signal code: (128)
[gk0345:27624] Failing at address: (nil)
*****

The cluster manager do know why this problem happened.
Thanks,

yao

Attachment: 
jedwards

I see in the log that these files for example

 Opened file ./b20trcn_f19g16.rtm.h0.1958-03.nc to write      589824
 Opened file ./b20trcn_f19g16.clm2.h0.1958-03.nc to write      589824
 Opened file b20trcn_f19g16.cam.h0.1958-03.nc to write      589824

were written.

 

The only thing I can gather from the cesm.log is that the node gk0345 stopped responding.

CESM Software Engineer

yaozhixiong@...

Thanks jedwards.

Both compset B and B_1850-2000_CN have the same error after running several years. Why did the node always stop when I run coupled model. But it is ok for ocean model. The cluster manager also do not know the reason. It does not happen on other cluster which I used before. Can you tell me how to fix it or how to find the reason?

Best regards,

yao

jedwards

It could be a memory leak - look at the cpl.log output to see if the memory highwater is growing unreasonably.

CESM Software Engineer

yaozhixiong@...

The cesm.log uploaded before was a continue run.  We can suppose it as CASE A. It met the same problem which may be about "memory leak".

CASE A:  24 nodes and 12 processors per node (total 240 tasks)

cpl.log_.140508-110728.zip

And I also created a same case but with different tasks.

CASE B:  5 nodes and 12 processors per node (total 60 tasks), Time: 1850-1859

cpl.log_.140509-192020.gz

 

Memory usage in cpl.log of CASE A are showed as followed.

 memory_write: model date = 19580102       0 memory =     335.77 MB (highwater)          0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV)

 memory_write: model date = 19580201       0 memory =    1528.53 MB (highwater)          0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV)

 memory_write: model date = 19580301       0 memory =    1679.62 MB (highwater)          0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV)

...

 memory_write: model date = 19610624       0 memory =    1916.74 MB (highwater)          0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV)

 

 

Memory usage  in cpl.log of CASE B  are showed as followed.

tStamp_write: model date = 18500102       0 wall clock = 2014-05-09 19:23:02 avg dt =    27.13 dt =    27.13

 memory_write: model date = 18500102       0 memory =     405.48 MB (highwater)          0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV)

 memory_write: model date = 18500201       0 memory =     610.96 MB (highwater)          0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV)

 memory_write: model date = 18500301       0 memory =     614.76 MB (highwater)          0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV)

....

 memory_write: model date = 18591231       0 memory =     645.38 MB (highwater)          0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV)

 

The memory highwater of CASE A increased fastly from the first month to second month, and was about near 2G when the case crashed. For CASE B, it used a memory of only 645M at the end of case. The memory per node of the cluster is 24G. So it is about 2.0G for one processors

 

Question 1:

Does the "memory peak" happen because of the different tasks? Why?

Question 2:

What is the meaning of "usage" and "pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV"? Here is 0.00MB of usage in my case. But it is not zero in some other peoples' case.    And I think the comps means "component". What does "pe" mean? Why is it equal to 0?

memory_write: model date = 18500131       0 memory =     450.47 MB (highwater)          0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV)

 

jedwards

Both cases indicate a significant memory leak.  If your model starts out using 400MB per task, you should not expect it to exceed 500MB per task after several years of simulation.

 

> What is the meaning of "usage" and "pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV"?

On some systems we are able to get current memory usage as well as memory high water, on other systems this functionality is not available and we just report 0.  

the pe = 0 means that the maximum memory usage is occuring on mpi task 0.  The list of comps is just a list of model components active on that task, in this case it's all of them.

Your next step might be to run some simplier compsets and see if you can isolate the source of the leak.   You might also try a different compiler if you have one available.


CESM Software Engineer

yaozhixiong@...

Two new cases are created. One is ocean model with compset CIAF with 120 processors. The other is coupled model with compset B with 120 processors too.

CIAF: cpl.log.140514-212549.gz

B: cpl.log.140514-205418.gz

The memory highwater of both log files are small in the first month, but they increase to more than 1000 MB in the second month. The situation also happen on other two clusters. 

The memory highwater of the last several months in cpl.log for compset B  are not changed. But they are are changed for compset CIAF.

I wonder how the memory leak happen? 

In factly, I think the "memory leak" can not affect model ruuning. I can reflect this problem if it does not exceed the maxmum memory of one processor in cluster.


 

jedwards

You neglected to attach the log files and I'm confused by your comments - At the start of the thread you said that CIAF was fine but that B there were problems, now you seem to be saying the opposite.   Do all of these clusters have the same OS and compiler versions?

CESM Software Engineer

yaozhixiong@...

Sorry for my careless.

CIAF was fine, that I said at the start of the thread, just means model can run successfully. It also has a problem with "memory leak", although it does not exceed the maximum memory of one processor.  And compset B crashed because its memory highwater exceeded the maximum.

There are three different clusters used. I always use the first one.

1. SuSE Linux Enterprise Server(SLES)11SP1        INTEL

2. Red Hat 5                                                          PGI

3. CentOS 6.4                                                        INTEL

 


 

 

jedwards

Can you please send the README.case from your CIAF case and let me know if you have any source modifications.

I would not expect this kind of memory increase at monthly boundaries in the first year and would like to see if I can reproduce your run.  


Thanks

CESM Software Engineer

yaozhixiong@...

The README.case of compset CIAF is attached.

Attachment: 
dingnan0701@...

Hi

  what's the difference between memory high water and memory usage in the cpl log?

jedwards

Memory highwater is the maximum memory consumed by the application on that task since it began, memory usage is the amount used at the time of the query.  Depending on the OS we may get data for one or both, if we report 0 it's because we don't get that data from the OS.

CESM Software Engineer

dingnan0701@...

Thank you very much. But I do not understand what does the query stand for in the sentence  "memory usage is the amount used at the time of the query". Does query represent the operation of memory high water statistics??

dingnan0701@...

 

 

And about memory highwater, does the maximum memory comsumed equals to the amount of the memory stack? Does it including the cache?

Log in or register to post comments

Who's new

  • praveenmaniyatt@...
  • arjunbabun11@...
  • lama@...
  • sisi393@...
  • 1658093099@...