Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

CESM2.1.4 BHIST_f09g17 hangs at ATM initialization (creating gsmap_cx)

xli585

xuezhu
New Member
When running a BHIST_f09g17 case with CESM2.1.4, the run gets stuck for many minutes in the early atmosphere initialization stage. The log output halts around species setup and block redistribution, and the model does not proceed past creating gsmap_cx for atm.
No apparent error messages are printed — the program simply appears to stall.
The configuration is:
NTASKS: ['CPL:448', 'ATM:448', 'LND:224', 'ICE:224', 'OCN:56', 'ROF:224', 'GLC:448', 'WAV:448', 'ESP:1']
TOTALPES: 504
NTHRDS: ['CPL:1', 'ATM:1', 'LND:1', 'ICE:1', 'OCN:1', 'ROF:1', 'GLC:1', 'WAV:1', 'ESP:1']
1772116744257.png
ATM log or screen shows last printed species during initialization:
39 so4_a3&IC kg/kg 32 I so4_a3
40 soa_a1&IC kg/kg 32 I soa_a1
41 soa_a2&IC kg/kg 32 I soa_a2

CESM output shows:
1 IMOD, NAPROC, NBLKRS, NSPEC, RSBLKS = 1 448 0 600 0
2 IMOD, NAPROC, NBLKRS, NSPEC, RSBLKS = 1 448 2 600 5

The cluster has 56 cores and 22 nodes, and my case uses 9 nodes. The program stalls at the MCT coupling grid creation (creating gsmap_cx for atm) and no MPI or system errors printed and Slurm shows normal RUNNING state, it happens specifically at ATM chemical initialization. Moreover, I haven't modified the run configuration and namelist; I only set the runtime to one month.

What do I need to set up to make it run successfully?
 

xli585

xuezhu
New Member
When running a BHIST_f09g17 case with CESM2.1.4, the run gets stuck for many minutes in the early atmosphere initialization stage. The log output halts around species setup and block redistribution, and the model does not proceed past creating gsmap_cx for atm.
No apparent error messages are printed — the program simply appears to stall.
The configuration is:
NTASKS: ['CPL:448', 'ATM:448', 'LND:224', 'ICE:224', 'OCN:56', 'ROF:224', 'GLC:448', 'WAV:448', 'ESP:1']
TOTALPES: 504
NTHRDS: ['CPL:1', 'ATM:1', 'LND:1', 'ICE:1', 'OCN:1', 'ROF:1', 'GLC:1', 'WAV:1', 'ESP:1']
View attachment 7352
ATM log or screen shows last printed species during initialization:
39 so4_a3&IC kg/kg 32 I so4_a3
40 soa_a1&IC kg/kg 32 I soa_a1
41 soa_a2&IC kg/kg 32 I soa_a2

CESM output shows:
1 IMOD, NAPROC, NBLKRS, NSPEC, RSBLKS = 1 448 0 600 0
2 IMOD, NAPROC, NBLKRS, NSPEC, RSBLKS = 1 448 2 600 5

The cluster has 56 cores and 22 nodes, and my case uses 9 nodes. The program stalls at the MCT coupling grid creation (creating gsmap_cx for atm) and no MPI or system errors printed and Slurm shows normal RUNNING state, it happens specifically at ATM chemical initialization. Moreover, I haven't modified the run configuration and namelist; I only set the runtime to one month.

What do I need to set up to make it run successfully?
 

Attachments

  • log.zip
    103.6 KB · Views: 1
Vote Upvote 0 Downvote

fischer

CSEG and Liaisons
Staff member
My best guess is you might be running out of memory. The first thing you should do is update to CESM2.1.5.

Here are somethings you can try doing.

Trying running BHIST_f19g17, this will use less memory.

Then try running your BHIST_f09g17 with debug turned on.

You Can also try FHIST_f09g17. With this case you can experiment with increase the number of nodes you're using for the atmosphere, until hopefully it works.

The runs on our system use the same nodes counts as your run, except we're using 2 nodes for WAV. The nodes on our system have 256GB of memory. Do you know how much memory each of your nodes have?

Thanks
Chris
 
Vote Upvote 0 Downvote

xli585

xuezhu
New Member
My best guess is you might be running out of memory. The first thing you should do is update to CESM2.1.5.

Here are somethings you can try doing.

Trying running BHIST_f19g17, this will use less memory.

Then try running your BHIST_f09g17 with debug turned on.

You Can also try FHIST_f09g17. With this case you can experiment with increase the number of nodes you're using for the atmosphere, until hopefully it works.

The runs on our system use the same nodes counts as your run, except we're using 2 nodes for WAV. The nodes on our system have 256GB of memory. Do you know how much memory each of your nodes have?

Thanks
Chris
Hello! I checked that the memory of each node is 256GB, and I have tried both BHIST_f19g17 and F2000climo_f09f09, and they both ran successfully. I will try your other suggestions to troubleshoot the issue; thank you very much.

Best,
Xuezhu
 
Vote Upvote 0 Downvote
Top