Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Model Running Empty After First 14 Days

Hi All, I ran into a strange situation here.I used $COMPSET=B_2000_CAM5 and $RES=ne120_g16 on edison (a cray sumpercomputer of NERSC). I tried a couple of short test runs without any problems. Hence I started to try to run the model for 3 months. Then I got the error message in the email regarding my model run exceeded the walltime limit, which is very strange because I did my math carefully before I requested the computer resources. Then I found that the model actually stopped on Day 15 (Wall clock around 1 hour), even though the PBS server showed that it was still running (wall clock 1 hr - 9 hr). So, for the 8 hours, the model basically ran empty. The log files stopped recording anything new after the first 1 hour wall time or first 14 model days. I checked the log files. The only errors that I can see are the complaint about negative mixing ratio or soil imbalance in the atm.log.* and cesm.log.* files.I tried doubling the nodes because the IT thought I might not have enough memory, but the same thing happened. The model run became a zombie again after the first 14 model days for several hours not doing anything except consuming more computer time. However, the PBS shows that the model is running on the nodes when I "qstat" the status. So, even after the model for some reason stopped, it became a zombie on the PBS until I manually killed it. I checked the log files and so far haven't seen anything fatal.
Does anyone have a hint for this kind of problem? Thank you very much. Best Regards,Aaron
 

jedwards

CSEG and Liaisons
Staff member
Hi Aaron,Sounds like it's hanging in MPI and the fact that it happens on day 15 suggests that you are having trouble reading a boundary file.   You didn't tell us what cesm version you are using.   What value of PIO_TYPENAME is in env_run.xml?  What is the last thing in the atm.log?  My guess is that it's a file open message.
 
Hello jedwards,I am using CESM 1.2.2. The PIO_TYPENAME is pnetcdf. And you are right, the last thing in the atm.log is about reading some files--- READ_NEXT_TRCDATA emiss_awb
 emiss_dom                       emiss_tra
 emiss_wst                       emiss_shp
 READ_NEXT_TRCDATA SOAG_BIGALK
 SOAG_BIGENE                     SOAG_ISOPRENE
 SOAG_TERPENE                    SOAG_TOLUENE
 READ_NEXT_TRCDATA emiss_awb
 emiss_dom                       emiss_ene
 emiss_ind                       emiss_tra
 emiss_wst                       emiss_shp
 READ_NEXT_TRCDATA BC_emiss_awb
 BC_emiss_dom                    BC_emiss_ene
 BC_emiss_ind                    BC_emiss_tra
 BC_emiss_wst                    BC_emiss_shp
 OC_emiss_awb                    OC_emiss_dom
 OC_emiss_ene                    OC_emiss_ind
 OC_emiss_tra                    OC_emiss_wst
 OC_emiss_shp                    SO4_emiss_awb
 SO4_emiss_wst                   SO4_emiss_shp
 READ_NEXT_TRCDATA SO4_emiss_dom
 SO4_emiss_tra
 READ_NEXT_TRCDATA emiss_awb
 emiss_dom                       emiss_ene
 emiss_ind                       emiss_tra
 emiss_wst                       emiss_shp
 READ_NEXT_TRCDATA emiss_awb
 emiss_wst                       emiss_shp
 READ_NEXT_TRCDATA emiss_dom
 emiss_tra
 READ_NEXT_TRCDATA emiss_ene
 emiss_ind                       forestfire
 grassfire                       contvolc
 READ_NEXT_TRCDATA emiss_ene
 emiss_ind                       forestfire
 grassfire                       contvolc
 READ_NEXT_TRCDATA forestfire
 grassfire
 READ_NEXT_TRCDATA forestfire
 grassfire
 READ_NEXT_TRCDATA BC_forestfire
 BC_grassfire                    OC_forestfire
 OC_grassfire                    SO4_emiss_ene
 SO4_emiss_ind                   SO4_forestfire
 SO4_grassfire                   SO4_contvolc
 READ_NEXT_TRCDATA contvolc
 READ_NEXT_TRCDATA SO4_contvolc
 READ_NEXT_TRCDATA O3
 OH                              NO3
 HO2
 READ_NEXT_TRCDATA ozone
--- It looks to me like some chemistry stuff. I thought the "build" downloaded the necessary files for me. I guess I was wrong. I am not familiar with atmospheric chemistry. Should I somehow turn it off or download the data from somewhere? Thanks a lot.   
 
Hi All:In my last post, I suspected that one of the boundary files (e.g., prescribed ozone data) was not downloaded properly befor I submitted the run. I thought that was the reason why my model run stopped at Day 15.I re-ran the model, and now with much lower resolution (change ne120_g16 to ne30_g16), and it succeeded for the whole model year. So, apparently I was wrong about the boundary file. I read jedwards' post again, and the problem is, just like he/she said, it could be associated with the READING of the boundary file. Now the question is, why is there a problem like this and how to avoid that?Does anyone have any suggestion?Thank you very much.
 
Hi All,After several trials, the error was found to be associated with the history output options for land component. My user_nl_clm looks like this:---------------- hist_nhtfrq = 0, -24 hist_fincl2 = 'TG', 'TV'----------------It caused error on Edison when I used $COMPSET=B_2000_CAM5 and $RES=ne120_g16.I will try to track the bug when I've got time. If anybody knows what is going on, please let me know. Thank you very much
 
Top