Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

CESM.EXE runs but does not finish, does not output log files

joe_s

Joe Salamone
New Member
Hello.
I am using CESM 2.1.5 and am having difficulty getting cesm.exe to run as ported to my machine. I have not made any code changes or namelist changes.

I created the case with:

./create_newcase --case f2000climo_testcase --compset F2000climo --res f09_f09_mg17 --mach lake

I have set NTASKS=32 and NTASKS_ESP=1 based on other threads I have read where the number of tasks needs to be less than the total number of cores to see if that resolved the issue (but alas it did not). I am running on 8 nodes with 32 cores each, and made sure to set max tasks per core to 32 in my machine config file (attached).

I run case setup, preview namelists, check input data with download option, then build the case. The code builds successfully without errors.

I can successfully submit the job to our PBS queuing system (file attached). The job will run on a cluster of intel xeon gold processors, and the PBS job script has the execution line of:

mpirun --report-bindings --bind-to core --map-by socket:PE=1 -n 256 -N 32 bld/cesm.exe

The job runs and uses all of the requested 256 cores. But there are no log files generated in the run directory and the cesm.exe just runs endlessly (I have tried up to 10 hours and it does not finish). There are no errors output in the PBS logfile generated when the job is accepted and run by the queuing system.

If there are build steps I am missing or other porting steps I have omitted please offer recommendations, checks and steps to follow. Any help or support is greatly appreciated.

Thank you
 

joe_s

Joe Salamone
New Member
I also forgot to mention that I set INFODBUG = 2 and DEBUG = TRUE before building the case.
 

jedwards

CSEG and Liaisons
Staff member
If there are no log files then it is likely you are hanging in the mpi_init step - can you run a basic hello world mpi program on 256 tasks?
Try running cesm on a single node and work your way up to your goal of 32.
 

joe_s

Joe Salamone
New Member
Hi and thank you for responding.
I ran a test with basic "hello world" mpi fortran 90 program with 256 tasks, and had confirmed 256 hello worlds back.

I tried running cesm.exe on a single node, and still no success.

Next, I copied 'cesm.exe' from the "bld" directory to the "run" directory. And that seemed to get log files to show up, and was able to get up and running on 32 cores then 256 cores as well. But in either of those two cases, the code only runs a few minutes, and then I get a new error in the PBS job log file:

forrtl: No such file or directory
forrtl: severe (29): file not found, unit 10, file f2000climo_testcase/run/./timing/checkpoints/model_timing_00010102_00000_stats



Are there some additional porting steps I am missing?

Thanks again for the help.
 

jedwards

CSEG and Liaisons
Staff member
Is it possible that the bld and run directories is not mounted to all of the nodes of your system? I suspect that you got lucky on the run that worked and got a set of nodes that all had the directory mounted. The timing error suggests that there is an IO issue on your system.
 

joe_s

Joe Salamone
New Member
Hello,
I had our IT team put the code to a fast mount during the PBS run to check for IO issues, and that did not resolve it.

But, a previous post on this forum had a user just create the timing/checkpoint directory prior to execution. After doing that, we saw the log files, but the code only ran a few minutes and then got hung. I checked another post about reducing the NTASKS below the total cores. After doing that to adjust the load balance for our particular system, the code ran successfully for a simulation month. And we were able to extract data from the resulting h0 history files for plotting.

So, looks like the code is ported to our system.

Thanks again for your help here. And, many thanks for all the tremendous work you and your team do on CESM.

Regards,
Joe S.
 

jedwards

CSEG and Liaisons
Staff member
So glad you have it working. From the description it seems that maybe the model is memory bound on your system.
 
Top