Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Signal error of compset B_1850-2000_CN

Hi,

I am using CESM1_2_0. Compset CIAF can be run for several decades on cluster. But I met one problem when I created case with compset B_1850-2000_CN.

create_newcase -case ../caseoutput/b20trcn_f19g16 -compset B_1850-2000_CN -res f19_g16 -mach szhpc

I have run some cases before. In all of these cases, no files were outputed after running several years(less than 10 years), while the status of case was still running. There are no errors in logs excepte cesm.log. Several error messages in cesm.log are showed as fallowed. The cesm.log is also attached.

******
[gk0345:27622] *** Process received signal ***
[gk0345:27622] Signal: Bus error (7)
[gk0345:27622] Signal code: (128)
[gk0345:27622] Failing at address: (nil)
[gk0345:27623] *** Process received signal ***
[gk0345:27623] Signal: Bus error (7)
[gk0345:27623] Signal code: (128)
[gk0345:27623] Failing at address: (nil)
[gk0345:27624] *** Process received signal ***
[gk0345:27624] Signal: Bus error (7)
[gk0345:27624] Signal code: (128)
[gk0345:27624] Failing at address: (nil)
*****

The cluster manager do know why this problem happened.
Thanks,

yao
 

jedwards

CSEG and Liaisons
Staff member
I see in the log that these files for example Opened file ./b20trcn_f19g16.rtm.h0.1958-03.nc to write      589824
 Opened file ./b20trcn_f19g16.clm2.h0.1958-03.nc to write      589824
 Opened file b20trcn_f19g16.cam.h0.1958-03.nc to write      589824

were written. The only thing I can gather from the cesm.log is that the node gk0345 stopped responding.
 
Thanks jedwards.Both compset B and B_1850-2000_CN have the same error after running several years. Why did the node always stop when I run coupled model. But it is ok for ocean model. The cluster manager also do not know the reason. It does not happen on other cluster which I used before. Can you tell me how to fix it or how to find the reason?Best regards,yao
 

jedwards

CSEG and Liaisons
Staff member
It could be a memory leak - look at the cpl.log output to see if the memory highwater is growing unreasonably.
 
The cesm.log uploaded before was a continue run.  We can suppose it as CASE A. It met the same problem which may be about "memory leak".CASE A:  24 nodes and 12 processors per node (total 240 tasks)cpl.log_.140508-110728.zipAnd I also created a same case but with different tasks.CASE B:  5 nodes and 12 processors per node (total 60 tasks), Time: 1850-1859cpl.log_.140509-192020.gz Memory usage in cpl.log of CASE A are showed as followed. memory_write: model date = 19580102       0 memory =     335.77 MB (highwater)          0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV) memory_write: model date = 19580201       0 memory =    1528.53 MB (highwater)          0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV) memory_write: model date = 19580301       0 memory =    1679.62 MB (highwater)          0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV)... memory_write: model date = 19610624       0 memory =    1916.74 MB (highwater)          0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV)  Memory usage  in cpl.log of CASE B  are showed as followed.tStamp_write: model date = 18500102       0 wall clock = 2014-05-09 19:23:02 avg dt =    27.13 dt =    27.13 memory_write: model date = 18500102       0 memory =     405.48 MB (highwater)          0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV) memory_write: model date = 18500201       0 memory =     610.96 MB (highwater)          0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV) memory_write: model date = 18500301       0 memory =     614.76 MB (highwater)          0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV).... memory_write: model date = 18591231       0 memory =     645.38 MB (highwater)          0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV) The memory highwater of CASE A increased fastly from the first month to second month, and was about near 2G when the case crashed. For CASE B, it used a memory of only 645M at the end of case. The memory per node of the cluster is 24G. So it is about 2.0G for one processors.  Question 1:Does the "memory peak" happen because of the different tasks? Why?Question 2:What is the meaning of "usage" and "pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV"? Here is 0.00MB of usage in my case. But it is not zero in some other peoples' case.    And I think the comps means "component". What does "pe" mean? Why is it equal to 0?memory_write: model date = 18500131       0 memory =     450.47 MB (highwater)          0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV) 
 

jedwards

CSEG and Liaisons
Staff member
Both cases indicate a significant memory leak.  If your model starts out using 400MB per task, you should not expect it to exceed 500MB per task after several years of simulation. > What is the meaning of "usage" and "pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV"?On some systems we are able to get current memory usage as well as memory high water, on other systems this functionality is not available and we just report 0.  
the pe = 0 means that the maximum memory usage is occuring on mpi task 0.  The list of comps is just a list of model components active on that task, in this case it's all of them.
Your next step might be to run some simplier compsets and see if you can isolate the source of the leak.   You might also try a different compiler if you have one available.

 
Two new cases are created. One is ocean model with compset CIAF with 120 processors. The other is coupled model with compset B with 120 processors too.CIAF: cpl.log.140514-212549.gzB: cpl.log.140514-205418.gzThe memory highwater of both log files are small in the first month, but they increase to more than 1000 MB in the second month. The situation also happen on other two clusters. The memory highwater of the last several months in cpl.log for compset B  are not changed. But they are are changed for compset CIAF.I wonder how the memory leak happen? In factly, I think the "memory leak" can not affect model ruuning. I can reflect this problem if it does not exceed the maxmum memory of one processor in cluster.
 
 

jedwards

CSEG and Liaisons
Staff member
You neglected to attach the log files and I'm confused by your comments - At the start of the thread you said that CIAF was fine but that B there were problems, now you seem to be saying the opposite.   Do all of these clusters have the same OS and compiler versions?
 
Sorry for my careless.CIAF was fine, that I said at the start of the thread, just means model can run successfully. It also has a problem with "memory leak", although it does not exceed the maximum memory of one processor.  And compset B crashed because its memory highwater exceeded the maximum.There are three different clusters used. I always use the first one.1. SuSE Linux Enterprise Server(SLES)11SP1        INTEL2. Red Hat 5                                                          PGI3. CentOS 6.4                                                        INTEL 
  
 

jedwards

CSEG and Liaisons
Staff member
Can you please send the README.case from your CIAF case and let me know if you have any source modifications.I would not expect this kind of memory increase at monthly boundaries in the first year and would like to see if I can reproduce your run.  
Thanks
 

jedwards

CSEG and Liaisons
Staff member
Memory highwater is the maximum memory consumed by the application on that task since it began, memory usage is the amount used at the time of the query.  Depending on the OS we may get data for one or both, if we report 0 it's because we don't get that data from the OS.
 
Thank you very much. But I do not understand what does the query stand for in the sentence  "memory usage is the amount used at the time of the query". Does query represent the operation of memory high water statistics??
 
Top