Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Running CESM2.1.5 on Stampede 3

xiangli

Xiang Li
Member
The error refers you to a file - config.log in the mct directory. Did you examine that file? I am working on stampede3 today,
I'll let you know when a port is ready.
Hi Jim,

Yes, I checked the config.log, but did not find anything significant.

Great that the porting on stampede3 will be ready! Looking forward to your message!

Thanks,
Xiang

Here is the config.log:

1713191093566.png

1713191114701.png

1713191137827.png

1713191155175.png
 

xiangli

Xiang Li
Member
The flag -zmuldefs is no longer needed or supported, remove it from config_compilers.xml
Hi Jim,

I removed this flag and tried again. There seemed to be another issue. Config.log is attached. It would be no big deal if we are going to have porting on stampede3.

Thanks,
Xiang

Here is the output of my test:

1713207437467.png

1713207460663.png
 

Attachments

  • config.txt
    98.1 KB · Views: 0

jedwards

CSEG and Liaisons
Staff member
I now have a port ready for you to try - to get it do the following.
cd cesm/cime
git fetch origin
git checkout cime5.6.50
 

xiangli

Xiang Li
Member
I now have a port ready for you to try - to get it do the following.
cd cesm/cime
git fetch origin
git checkout cime5.6.50
Hi Jim,

Great to know! My test for a B1850 f19_g17 case was successful! Preliminary results show that it's running 5 min per model month - very efficient!

I also created a "B2000" case by using the long name "2000_CAM60_CLM50%BGC-CROP_CICE_POP2%ECO%ABIO-DIC_MOSART_CISM2%NOEVOLVE_WW3_BGC%BDRD" which references the long name of B1850. This case is also running!

Thanks so much for your efforts in porting on stampede3!

Best,
Xiang
 

harukihirasawa

Haruki Hirasawa
New Member
Hello Jim,

I am also preparing to run CESM2.1.5 on Stampede3. I successfully ran a BSSP245 simulation using the port you provided, but noticed the PES for the ice model seem to be idling while waiting for the land model. I've attempted to rebalance the model and reduce the number of nodes. However, the model now throws a segmentation fault during the first ocean model time step (the model completes 5 land/atmosphere time steps). This occurs for both BSSP245 and B1850.

I don't know why changing the PES layout would cause the model to crash, so I'd appreciate any advice!

Thank you,
Haruki

Attached is the cesm.log file. Note it crashes at baroclinic_mp_bar in baroclinic.F90

Relevant xmlchange commands:
---------------------------------------------------
2024-08-06 13:46:42: xmlchange success <command> ./xmlchange NTASKS_ATM=156 </command>
---------------------------------------------------
2024-08-06 13:46:43: xmlchange success <command> ./xmlchange NTASKS_OCN=12 </command>
---------------------------------------------------
2024-08-06 13:46:43: xmlchange success <command> ./xmlchange NTASKS_LND=154 </command>
---------------------------------------------------
2024-08-06 13:46:43: xmlchange success <command> ./xmlchange NTASKS_ROF=154 </command>
---------------------------------------------------
2024-08-06 13:46:43: xmlchange success <command> ./xmlchange NTASKS_ICE=2 </command>
---------------------------------------------------
2024-08-06 13:46:44: xmlchange success <command> ./xmlchange NTASKS_GLC=156 </command>
---------------------------------------------------
2024-08-06 13:46:44: xmlchange success <command> ./xmlchange ROOTPE_OCN=156 </command>
---------------------------------------------------
2024-08-06 13:46:44: xmlchange success <command> ./xmlchange ROOTPE_ICE=154 </command>
---------------------------------------------------
2024-08-06 14:48:28: xmlchange success <command> ./xmlchange NTASKS_WAV=28 </command>
---------------------------------------------------
2024-08-06 14:49:36: xmlchange success <command> ./xmlchange NTASKS_GLC=28 </command>
---------------------------------------------------
2024-08-06 17:56:45: xmlchange success <command> ./xmlchange NTASKS_CPL=156 </command>
 

Attachments

  • cesm.log.837880.240806-150455.txt
    798.2 KB · Views: 3

jedwards

CSEG and Liaisons
Staff member
Maybe you should go back to the original pelayout and try again at rebalancing - it may be important that
each module component use an integer multiple of the number of tasks per node.
 

harukihirasawa

Haruki Hirasawa
New Member
Hi Jim,

Thanks for the reply! I tested a few other configurations (all with POP given a single node) and eventually the model produced the error message:
POP Exiting...
POP_HaloUpdate4DR8: error allocating buffers
step: error updating halo for TRACER
Which indicated I needed more memory for POP. I added a few PES from a second node, similar to your original configuration, and the model completed successfully. I landed on the following layout:
component comp_pes root_pe tasks x threads instances (stride)
--------- ------ ------- ------ ------ --------- ------
cpl = cpl 1560 0 390 x 4 1 (1 )
atm = cam 1560 0 390 x 4 1 (1 )
lnd = clm 1344 0 336 x 4 1 (1 )
ice = cice 216 336 54 x 4 1 (1 )
ocn = pop 120 390 30 x 4 1 (1 )
rof = mosart 1456 0 364 x 4 1 (1 )
glc = cism 1560 0 390 x 4 1 (1 )
wav = ww 1560 0 390 x 4 1 (1 )
esp = sesp 4 0 1 x 4 1 (1 )
Which gets 11.20 myears/wday for B1850 at f09_g17 versus 6.43 originally. Although the ocean model still spends some time waiting, so this could probably be faster if I could get it to run on a single node or less. Unfortunately I don't have the knowledge to configure that.

Haruki
 
Top