Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

CESM got stuck for unknown reason

AndrewY

New Member
Hello all,

I recently ported CESM2 on a new machine and everything works fine before the job submission (i.e., new cases can be created and built successfully). However, the model got stuck during the running process and the jobs would be killed due to the time limit. I was using a HPC system and did some test runs so that the model would only run for a few months which I believe won't take too much time. I wonder if you can give me some hints on where to find the problem and how to fix it?

The details of the model simulation are listed as follows:

The version I'm using is cesm2.1.3-rc.01-0-g0596a97
I created a test case by ./create_newcase --case ~/cesm2/cesm2.1.3/Case/F2000climo_nwtr.test01 --compset F2000climo --res f09_f09_mg17 --machine odyssey2
The changes I made were ./xmlchange JOB_WALLCLOCK_TIME=24:00:00,walltime=8:00:00,STOP_N=1,STOP_OPTION=nmonths,RESUBMIT=3

Attached please find the zip file containing all the log and configuration files.

Thank you!

P.s. "odyssey2" is the MACH name used in the files like config_compilers.xml, config_machines.xml, and config_batch.xml.
 

Attachments

  • 2.1.3.zip
    223.9 KB · Views: 6

nusbaume

Jesse Nusbaumer
CSEG and Liaisons
Staff member
Hi AndrewY,

It does appear that the model is running, but could potentially be running very slowly. Can you attach your env_mach_pes.xml and env_workflow.xml files from your case directory? If there is also a file that looks like run.F2000climo_nwtr.test01.oXXX in your case directory then I would attach that as well. These could help us see if everything is being set correctly.

Thanks, and have a great day!

Jesse
 

AndrewY

New Member
Hi Jesse,

Thanks for your reply! I have included the three files you mentioned in another zip file. Please let me know if any other information is needed!
 

Attachments

  • 2.1.3_supplement.zip
    3.4 KB · Views: 3

nusbaume

Jesse Nusbaumer
CSEG and Liaisons
Staff member
Hi AndrewY,

Thanks for the extra files! In general 64 tasks is not that many for running a 1-degree CESM2 simulation, so I would probably try increasing that number to see if you get better throughput (I personally run with 180 tasks or more when doing F-compset runs). I would also make sure that you are not running with runtime debugging on (i.e. make sure DEBUG is False in env_build.xml), as that will certainly hurt the model performance too.

I also couldn't find any documentation for a machine called "odyssey" online, but in config_machines.xml it states in the DESC tag that there are 32 PEs per node, but then the actual MAX_TASKS_PER_NODE is set at 16. On top of that, I was able to find a description for the huce_cascade queue on Harvard's website here:


which says there are actually 48 cores per node. I would definitely recommend figuring out exactly what the tasks per node value actually is and using that specific value in the config files. Otherwise you are going to be doing a lot of unnecessary inter-node communication, which will hurt performance as well.

Finally, I am moving this thread to the Infrastructure forum, as that forum is watched by our porting and performance experts, who might be able to provide better advice then I have, or catch something I have missed.

Hope that helps, and have a good weekend!

Jesse
 

jedwards

CSEG and Liaisons
Staff member
You are running very slowly 1.1 model days per hour. Increasing task count will certainly help, there may also be machine specific improvements that you can make - I suggest discussing with your HPC support staff.
 

AndrewY

New Member
Thanks jedwards!

It turned out increasing task count didn't solve this problem. I have reported that to the support staff to look for their suggestions. Besides, I noticed it would take much longer building cases on this new machine (1300 - 1600 seconds) than on the one I used before (400 - 500 seconds). The thing is that I used the same compset (F2000climo). Do you think the slow running speed is somehow related to the slow building process or would this help us spot where the actual problem lies?
 

jedwards

CSEG and Liaisons
Staff member
Slow build speed often indicates a problem accessing a compiler license server or contention for resources on the (often shared) build server.
Probably not related to slow run issue.
 
Top