Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Running CESM out of memory

          I recently built a cesm case using the " 20TR_CAM5%FCHM_CLM40%SP_CICE_POP2_RTM_SGLC_SWAV" compset with CESM 1.2.2,but it stopped after nearly one hour, it showed that YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Killed (signal 9)This typically refers to a problem with your application.in the cesm.log file .while the cpl.log file showed that it used  large memory during simulation Stamp_write: model date = 19000102       0 wall clock = 2015-10-27 10:04:36 avg dt =   382.71 dt =   382.71 memory_write: model date = 19000102       0 memory =    1794.47 MB (highwater)          0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV) tStamp_write: model date = 19000103       0 wall clock = 2015-10-27 10:11:46 avg dt =   406.08 dt =   429.45 memory_write: model date = 19000103       0 memory =    2501.36 MB (highwater)          0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV) tStamp_write: model date = 19000104       0 wall clock = 2015-10-27 10:18:56 avg dt =   414.22 dt =   430.49 memory_write: model date = 19000104       0 memory =    3118.88 MB (highwater)          0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV) tStamp_write: model date = 19000105       0 wall clock = 2015-10-27 10:26:09 avg dt =   418.70 dt =   432.13 memory_write: model date = 19000105       0 memory =    3736.44 MB (highwater)          0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV) tStamp_write: model date = 19000106       0 wall clock = 2015-10-27 10:33:23 avg dt =   421.90 dt =   434.74 memory_write: model date = 19000106       0 memory =    4353.89 MB (highwater)          0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV) tStamp_write: model date = 19000107       0 wall clock = 2015-10-27 10:41:04 avg dt =   428.33 dt =   460.47 memory_write: model date = 19000107       0 memory =    4971.40 MB (highwater)          0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV) tStamp_write: model date = 19000108       0 wall clock = 2015-10-27 10:49:12 avg dt =   436.95 dt =   488.67 memory_write: model date = 19000108       0 memory =    5588.85 MB (highwater)          0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV) tStamp_write: model date = 19000109       0 wall clock = 2015-10-27 10:58:21 avg dt =   450.85 dt =   548.12 memory_write: model date = 19000109       0 memory =    6206.26 MB (highwater)          0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV) tStamp_write: model date = 19000110       0 wall clock = 2015-10-27 11:06:45 avg dt =   456.81 dt =   504.55         I have never met this problem while use other compsets before. So, i wonder know how to fix this problem?  
 

jedwards

CSEG and Liaisons
Staff member
This is a clear indication of a memory leak.  The leak could be coming from several places including your compiler, your mpi library or other non-cesm specific parts of your environment.  Please provide a detailed description of your tools and runtime environment.
 
Thank you, Jedwards!      I build this case using those commands:./create_newcase -case $casedir -res f19_g16 -user_compset 20TR_CAM5%FCHM_CLM40%SP_CICE_POP2_RTM_SGLC_SWAV -mach userdefinedcd ./$casedir./xmlchange -file env_build.xml -id OS -val linux./xmlchange -file env_build.xml -id COMPILER -val pgi./xmlchange -file env_build.xml -id MPILIB -val mpichand run it with 16 nodes , my .bashrc file and cesm namelist file are attached below. I wish those information whold be helpful.
 

jedwards

CSEG and Liaisons
Staff member
The problem is that the pe-layout you are getting by default is a poor choice for this compset - the OCN starts on day 2 and you are running out of memory.Try the attached env_mach_pes.xml file
 
Thank you for your reply, Jedwards      I have tried your env_mach_pes.xml file,but I got this error"qsub: Job exceeds queue resource limits MSG=cannot locate feasible nodes (nodes file is empty or all systems are busy)"then. I realized that I have made a mistake that misled you. I run this case using only just one node with 16 CPUs(max up to 64 CPUs). I am very sorry for that.      I wonder that is this insufficient for this compset? How should I change the env_mach_pes.xml file or other setup in this situation? my original env_mach_pes.xml are attached below.    
 

jedwards

CSEG and Liaisons
Staff member
You should divide the available 64 pes amoung the components in proportions as shown in that file.  The ocn should get it's own pes seperate from the other componets - this will help performane however since you just have one node it won't help the memory usage at all.   You may need to try a lower resolution first - f45_g37?
 
But I'm very confused that if I change another compset ,the BMOZ, with same resolution.It works well in this case. I use the present compset"20TR_CAM5%FCHM_CLM40%SP_CICE_POP2_RTM_SGLC_SWAV" becasue its chemistry is much simple. In my knowledge, this would help to save simulation time. Why would it run out of memory instead?
 

jedwards

CSEG and Liaisons
Staff member
Perhaps you should compare the env_mach_pes.xml from your BMOZ case to the one you are trying to run.User defined compsets such as you are using do not have tuned pe layouts by default, the BMOZ alias does.
 
I have compared the env_mach_pes.xml as you suggested,but finf out that they have little differences.What's more, I have tried those user defined compsets,for example, combine the CAM5 with trop_mozart chemistry.It works well . However, if I combine the CAM5 with fast chemistry,  no matter which time period I chose and how many nodes I used, I got similar problem. In the cpl.log file, the used memory increases constantly. In this case, I believe that it would crash sooner or later no matter how many nodes or pes I use. Is there something wrong with the used memory  release after every step?  tStamp_write: model date = 19000102       0 wall clock = 2015-11-02 20:26:14 avg dt =   279.89 dt =   279.89 memory_write: model date = 19000102       0 memory =    1283.60 MB (highwater)          0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV) tStamp_write: model date = 19000103       0 wall clock = 2015-11-02 20:31:52 avg dt =   308.83 dt =   337.78 memory_write: model date = 19000103       0 memory =    1617.78 MB (highwater)          0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV) tStamp_write: model date = 19000104       0 wall clock = 2015-11-02 20:37:30 avg dt =   318.43 dt =   337.61 memory_write: model date = 19000104       0 memory =    1861.06 MB (highwater)          0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV) tStamp_write: model date = 19000105       0 wall clock = 2015-11-02 20:43:09 avg dt =   323.55 dt =   338.91 memory_write: model date = 19000105       0 memory =    2104.25 MB (highwater)          0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV) tStamp_write: model date = 19000106       0 wall clock = 2015-11-02 20:48:47 avg dt =   326.44 dt =   338.02 memory_write: model date = 19000106       0 memory =    2347.44 MB (highwater)          0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV) tStamp_write: model date = 19000107       0 wall clock = 2015-11-02 20:54:26 avg dt =   328.68 dt =   339.85 memory_write: model date = 19000107       0 memory =    2590.59 MB (highwater)          0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV) tStamp_write: model date = 19000108       0 wall clock = 2015-11-02 21:00:03 avg dt =   329.78 dt =   336.38 memory_write: model date = 19000108       0 memory =    2833.76 MB (highwater)          0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV) tStamp_write: model date = 19000109       0 wall clock = 2015-11-02 21:05:40 avg dt =   330.67 dt =   336.88 memory_write: model date = 19000109       0 memory =    3076.88 MB (highwater)          0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV) tStamp_write: model date = 19000110       0 wall clock = 2015-11-02 21:11:17 avg dt =   331.36 dt =   336.93 memory_write: model date = 19000110       0 memory =    3320.11 MB (highwater)          0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV) tStamp_write: model date = 19000111       0 wall clock = 2015-11-02 21:16:53 avg dt =   331.88 dt =   336.51 memory_write: model date = 19000111       0 memory =    3563.23 MB (highwater)          0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV) tStamp_write: model date = 19000112       0 wall clock = 2015-11-02 21:22:30 avg dt =   332.30 dt =   336.49 memory_write: model date = 19000112       0 memory =    3806.35 MB (highwater)          0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV) tStamp_write: model date = 19000113       0 wall clock = 2015-11-02 21:28:06 avg dt =   332.60 dt =   335.97 memory_write: model date = 19000113       0 memory =    4049.45 MB (highwater)          0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV) tStamp_write: model date = 19000114       0 wall clock = 2015-11-02 21:33:41 avg dt =   332.83 dt =   335.53 memory_write: model date = 19000114       0 memory =    4292.57 MB (highwater)          0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV) tStamp_write: model date = 19000115       0 wall clock = 2015-11-02 21:39:17 avg dt =   333.02 dt =   335.53 memory_write: model date = 19000115       0 memory =    4535.72 MB (highwater)          0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV) tStamp_write: model date = 19000116       0 wall clock = 2015-11-02 21:44:56 avg dt =   333.47 dt =   339.77 memory_write: model date = 19000116       0 memory =    4778.85 MB (highwater)          0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV) tStamp_write: model date = 19000117       0 wall clock = 2015-11-02 21:50:44 avg dt =   334.32 dt =   347.08 memory_write: model date = 19000117       0 memory =    5021.99 MB (highwater)          0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV) tStamp_write: model date = 19000118       0 wall clock = 2015-11-02 21:56:33 avg dt =   335.20 dt =   349.27
 
Top