jihwang@colorado_edu
Member
Hi All, I ran into a strange situation here.I used $COMPSET=B_2000_CAM5 and $RES=ne120_g16 on edison (a cray sumpercomputer of NERSC). I tried a couple of short test runs without any problems. Hence I started to try to run the model for 3 months. Then I got the error message in the email regarding my model run exceeded the walltime limit, which is very strange because I did my math carefully before I requested the computer resources. Then I found that the model actually stopped on Day 15 (Wall clock around 1 hour), even though the PBS server showed that it was still running (wall clock 1 hr - 9 hr). So, for the 8 hours, the model basically ran empty. The log files stopped recording anything new after the first 1 hour wall time or first 14 model days. I checked the log files. The only errors that I can see are the complaint about negative mixing ratio or soil imbalance in the atm.log.* and cesm.log.* files.I tried doubling the nodes because the IT thought I might not have enough memory, but the same thing happened. The model run became a zombie again after the first 14 model days for several hours not doing anything except consuming more computer time. However, the PBS shows that the model is running on the nodes when I "qstat" the status. So, even after the model for some reason stopped, it became a zombie on the PBS until I manually killed it. I checked the log files and so far haven't seen anything fatal.
Does anyone have a hint for this kind of problem? Thank you very much. Best Regards,Aaron
Does anyone have a hint for this kind of problem? Thank you very much. Best Regards,Aaron