Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

BWSSP245cmip6 f19_g17 hangs on day 2

Zhuyi

Zhuyi Wang
New Member
What version of the code are you using?
CESM2.1.5

Have you made any changes to files in the source tree?
No

Describe every step you took leading up to the problem:
./create_newcase --case b.e21.f19_g17.BWSSP245cmip6 --res f19_g17 --compset BWSSP245cmip6 --run-unsupported
./case.setup
./case.build --skip-provenance-check
./case.submit

If this is a port to a new machine: Please attach any files you added or changed for the machine port (e.g., config_compilers.xml, config_machines.xml, and config_batch.xml) and tell us the compiler version you are using on this machine.
Please attach any log files showing error messages or other useful information.

This a port to a new machine and I've attached the config_compilers.xml, config_machines.xml, and config_batch.xml.
Compiler version: compiler/intel/2017.5.239

Describe your problem or question:
The model hangs on the second day of the simulation without producing any explicit error messages or updating on log files (attached). The last component that produces output before the hang is GLC. I can confirm that the machine port itself is likely not the issue, since I was able to successfully run the FC2010climo compset.

Also, since I intend to run the model at f19_g17 resolution, I have only been able to locate initial condition files for the BWSSP245cmip6 compset at f09_g17 resolution. Could anyone please advise where I can obtain the f19_g17 initial condition files for 2015-01-01?
 

sacks

Bill Sacks
CSEG and Liaisons
Staff member
Thank you for your description and attaching the log files. One log file I don't see is the cesm log file. There is often information there that can be useful in diagnosing problems. You should have a cesm log file from your case; I suggest looking in there or attaching it. (If you don't, that could indicate some issue on your system.)

Without seeing what's in your cesm log file, there are a few things that I would try initially. The fact that it seems to be hanging at the end of the first day suggests there may be an issue with I/O or the GLC component.
  • This may be an out-of-memory issue. Do you have the flexibility on your system to give the case more processors? If so, I would try that first. Try giving twice the number of processors or even more.
  • Try removing the custom output from the user_nl_cice file and the user_nl_clm file, by removing the settings in those files (I don't think the user_nl_clm file should be responsible, since it looks like that is just setting additional monthly and annual output, not daily, but try it anyway to be safe). If you have made any other output changes in other user_nl files, remove those as well
  • Try removing CISM (the ice sheet model) by creating a case using --compset SSP245_CAM60%WCTS_CLM50%BGC-CROP-CMIP6WACCMDECK_CICE%CMIP6_POP2%ECO%NDEP_MOSART_SGLC_WW3 (this compset long name is the same as the compset you started with, but uses SGLC - i.e., a stub glc model - instead of CISM).
  • If none of these work, then you could try some other, slightly simpler configurations to see what works and what fails. For example, does a B1850 compset work?

Regarding initial conditions files at this resolution: I think there probably are no out-of-the-box initial conditions files for this compset at f19_g17 resolution. This is related to your need to specify --run-unsupported: this is not a scientifically supported configuration at this resolution, and so there may be no spun-up initial conditions for this resolution. But I'll reach out to a couple of other people to see if others have more information on this.
 
Vote Upvote 0 Downvote

Zhuyi

Zhuyi Wang
New Member
Thank you so much for the detailed suggestions, and sorry I forgot to attach the cesm log in my last message. After reading your reply, I tested 3 compsets: BWSSP245cmip6, BWSSP245cmip6_stubGLC and B1850. For the 3 cases, I increased the processor count to 1280 cores and cleared out all settings in the user_nl_* files (i.e., removed any custom output). However, they all appear to hang at the same point (near the end of day 1).

As you suggested, now I am focusing on the simpler B1850 configuration first. I’ve attached the logs and the case env_*.xml files from my B1850 case. The last logs that updated were ROF and WAV, and then no new log output was written. After waiting ~20 minutes, I cancelled the job, so the cesm log contains: slurmstepd: error: *** STEP 45531117.4 ON h04r4n34 CANCELLED AT 2026-01-13T05:39:36 ***

So at the moment I still can’t tell where the model is getting stuck. Would you mind taking a look at the attached logs and env settings when you have a chance, and letting me know if you see anything suspicious or any additional diagnostics I should try?

What version of the code are you using?
CESM2.1.5
Have you made any changes to files in the source tree?
No
Describe every step you took leading up to the problem:
./create_newcase --case b.e21.f19_g17.B1850_v2 --res f19_g17 --compset B1850
./case.setup
./case.build --skip-provenance-check
./case.submit

Also regarding BWSSP245cmip6 initial conditions at f19_g17, have you heard any updates? If there aren’t any available initial files for f19_g17, I think I could regrid the existing BWSSP245cmip6 initial conditions from f09_g17 to f19_g17 to start the run.
 

Attachments

  • logs_n_env.zip
    303.4 KB · Views: 1
Vote Upvote 0 Downvote

sacks

Bill Sacks
CSEG and Liaisons
Staff member
I'll start with this part:

Also regarding BWSSP245cmip6 initial conditions at f19_g17, have you heard any updates? If there aren’t any available initial files for f19_g17, I think I could regrid the existing BWSSP245cmip6 initial conditions from f09_g17 to f19_g17 to start the run.

No, I haven't heard of any updates about this. I talked to someone else here who thinks it's unlikely that we have initial conditions for this compset at that resolution. I'll post back if I hear from anyone else who knows something different. Yes, to just get the run going, you could regrid the f09 initial conditions to f19. The problem is that these initial conditions won't be in equilibrium (spun-up) with the climate of the model at this resolution, so you will almost certainly experience model drift in your simulation unless you perform sufficient additional spinup yourself.

As for the hang:

First, thank you very much for providing this detailed information!

From the CESM log file, I see that at least a few processors seem to be stuck in an MPI_Waitany call that is called as a result of exchanging data from the runoff component to the coupler. But I'm not sure what would cause that.

An initial next thing I'd try is adjusting your processor layout. I doubt this will solve the problem, but in looking at your processor layout, I saw some unexpected things, so it may be worth doing just to simplify things and make them look more usual - and possibly run more efficiently - and there's a chance it will resolve the issue. Specifically, I noticed that your ROOTPE settings don't really make sense together with your NTASKS settings. For now, try just doing ./xmlchange ROOTPE=0 to start all components on processor 0 (then run ./case.setup --reset followed by ./case.build). (This probably isn't the optimal setting for performance, but let's start there to get things working.)

Next I'd suggest the opposite of what I suggested before: since this now seems more likely to be a mpi-related hang than a memory issue, I'd actually suggest trying decreasing your processor count. If possible, trying running on a single node (./xmlchange NTASKS=-1) (then run ./case.setup --reset followed by ./case.build). Just try running for a few days that way. It's possible you'll hit memory limits with that, so you may need to increase to 2 or 4 nodes or a little more, but I'm curious if you get the same crash when running on fewer nodes.

My next suggestion may be less straightforward, but might be the most likely to resolve the issue: It looks like you're using the intel mpi (impi) library. Does your system support other MPI libraries like openmpi or mpich? If so, I would try building and running with a different MPI library.

If none of these help, then you could try to pinpoint the problem better by either attaching a debugger to the hung process or (possibly painstakingly) inserting write statements. Attaching a debugger can be an easier way to find the source of a hang, but doing this is system-dependent and I don't have a lot of experience with it myself. For inserting write statements, I'd start around the line in cime_comp_mod.F90 (in cime/src/drivers/mct/main/cime_comp_mod.F90) identified in the cesm log file. Here's the block of code (lines 3077 - 3083):

Code:
          if (iamin_CPLALLROFID) then
             call component_exch(rof, flow='c2x', &
                  infodata=infodata, infodata_string='rof2cpl_run', &
                  mpicom_barrier=mpicom_CPLALLROFID, run_barriers=run_barriers, &
                  timer_barrier='CPL:R2C_BARRIER', timer_comp_exch='CPL:R2C', &
                  timer_map_exch='CPL:r2c_rofr2rofx', timer_infodata_exch='CPL:r2c_infoexch')
          endif

Before that block of code, you could add something like:

Code:
write(logunit,*) iam_GLOID, "About to call component_exch for rof -> cpl"
call shr_sys_flush(logunit)

and then after that block of code add:

Code:
write(logunit,*) iam_GLOID, "Done component_exch for rof -> cpl"
call shr_sys_flush(logunit)

(I'm not positive that iam_GLOID is the right way to get at the unique processor number, but try that.)

Then rebuild and rerun.

I'm expecting to see that some or all processors are getting stuck in there at some point, so you would see some processors writing the "About to call" that never write the "Done" message. However, I may be wrong... and even if that's right, then going further than that might take some painstaking trial and error. So I'd consider this insertion of write statements a last resort.

I also found some possibly useful tips in Port problem: model hang after finishing initialization for B1850, so you may want to look through that thread.
 
Vote Upvote 0 Downvote
Top