Welcome to the new DiscussCESM forum!
We are still working on the website migration, so you may experience downtime during this process.

Existing users, please reset your password before logging in here: https://xenforo.cgd.ucar.edu/cesm/index.php?lost-password/

Stack overflow?

ssmith

Sergio Smith
New Member
Hello,

I'm trying to run the double gyre setup with my own bathymetry. The topo_config and related parameters are the only options I have changed from the original double_gyre experiment. I get the following error:

"FATAL from PE 0: MPP_START_UPDATE_DOMAINS: mpp_domains_stack overflow, call mpp_domains_set_stack_size( 34576) from all PEs."

I notice that I am using far more grid points (~5000x3000) than the default example (~40x40). Could I be overwhelming the memory (32gb) in this way? If so, is there a fix other than using a smaller region or lower resolution? About how many grid points should I use with MOM6?

Any help is appreciated, thanks in advance.
 
How many cores are you using? You could try an intermediate size to confirm that that is indeed your problem. I've heard that the advice is a certain number of 3-d gridpoints per core, rather than a maximum number of gridpoints. I'm not up on what that number is, though.
 

ssmith

Sergio Smith
New Member
Thanks for your reply. I'm using 4 cores. I've just now been testing it on a smaller grid (480x480) which I think is about the same size as some of the other examples. I get a similar error but with a different value for mpp_domains_set_stack_size(#). I'm attaching a screenshot of the command and output. I was focused on the FATAL messages below, but it looks like there are problems allocating memory early on, or something?
 

Attachments

  • Screenshot from 2021-03-21 21-01-09.png
    Screenshot from 2021-03-21 21-01-09.png
    199.2 KB · Views: 4

marshallward

Marshall Ward
New Member
Often this MPP stack will be automatically resized, but there are a few instances where it is not handled.

You can manually increase this parameter in the namelist (input.nml), something like this:
Code:
&fms_nml
    domains_stack_size=250000
/
I don't know what the number needs to be in your case, you'll have to toggle it a bit.

Generally this only happens when domains are very large, as in your examples. When the size gets beyond ~100x100 points, you should probably parallelize your run more aggressively.

---

I can't really comment on the UCX error, but you may also want to increase the stack size of the process (e.g. `ulimit -s unlimited` in bash, but check your shell). Many compilers (e.g. Intel) will aggressively use up process stack when heavily optimized.
 
Top