Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

model freezes

Hi, all,

I tried to run B1850 case in my cluster but CESM2.1.3 seems freeze and hang in there for hours.
I did not see obvious errors in the log file. Only some warnings like

(Task 123, block 1) MARBL WARNING (marbl_interior_tendency_mod:compute_large_detritus_prod): dz*DOP_loss_P_bal= 0.598E-013 exceeds Jint_Ptot_thres= 0.271E-013
max rss=2288.2 MB

So I killed the job. I have attached the log file and searched the forum. Seems like it is because POP doesn't get enough memory ?
But I can finish same test run setup with STOP_OPTION = ndays (in the freezing case, I set it to nmonths)

Any hints where I should proceed? Thanks very much
 

Attachments

  • cesm.log.txt
    255.9 KB · Views: 10

fischer

CSEG and Liaisons
Staff member
Hi,

When you switched the STOP_OPTIONS for nmonths to ndays, did you also change STOP_N to get the same run length? Whenever a run crashes on
me, without any useful log information, it's usually a memory issue. A couple of separate things you can try is the run with more nodes, if possible. Or
turn on debugging with ./xmlchange DEBUG=TRUE, then rebuild the model. This might give a better idea where the model is freeze. You can also
look at the cpl.log to see where to model is hanging, is it hanging during initialization? Or is it hanging after it's run a certain number of timesteps?

Chris
 
Hi,

When you switched the STOP_OPTIONS for nmonths to ndays, did you also change STOP_N to get the same run length? Whenever a run crashes on
me, without any useful log information, it's usually a memory issue. A couple of separate things you can try is the run with more nodes, if possible. Or
turn on debugging with ./xmlchange DEBUG=TRUE, then rebuild the model. This might give a better idea where the model is freeze. You can also
look at the cpl.log to see where to model is hanging, is it hanging during initialization? Or is it hanging after it's run a certain number of timesteps?

Chris
Thanks for the reply, Chris,

I already turned on the debug option and the model hangs after 15 days after initialization. I also changed the stop_n to 12.
I already used 216 tasks for POP. I will try to increase it. ( I don’t know pop requires that many tasks, emm... )
 

QINKONG

QINQIN KONG
Member
Hi,

When you switched the STOP_OPTIONS for nmonths to ndays, did you also change STOP_N to get the same run length? Whenever a run crashes on
me, without any useful log information, it's usually a memory issue. A couple of separate things you can try is the run with more nodes, if possible. Or
turn on debugging with ./xmlchange DEBUG=TRUE, then rebuild the model. This might give a better idea where the model is freeze. You can also
look at the cpl.log to see where to model is hanging, is it hanging during initialization? Or is it hanging after it's run a certain number of timesteps?

Chris
Hi Chris. I think I just encountered the same problem of ocn hanging at initialization. I will try to increase task for ocn component. But I wonder what do you mean by "Whenever a run crashes on me, without any useful log information, it's usually a memory issue"? What kind of memory issue? Can you explain it a little more?

Also, an official documentation of previous CESM version (https://www.cesm.ucar.edu/models/cesm1.2/clm/models/lnd/clm/doc/UsersGuide/x13571.html) has suggested people to turn to serial mode (set NTASK=1 for all components) to identify potential problems by ruling out multi-processing issues first. I wonder, if ocn or other components do need lots of memory, how can it even work with NTASK=1 for all components?

Thanks!
-Qin
 

fischer

CSEG and Liaisons
Staff member
The memory issue I'm referring to is running out of memory. The document you're showing the link to is just for the land model, and an older
version of the land model too. So that will not apply to the ocean model.

Chris
 

QINKONG

QINQIN KONG
Member
The memory issue I'm referring to is running out of memory. The document you're showing the link to is just for the land model, and an older
version of the land model too. So that will not apply to the ocean model.

Chris
Hi Chris, thanks for the replay. Do we have a general idea of the memory requirement of each component like for B1850 compset? Which components tend to have a larger requirement for memory?
 
Top