Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Running out of memory on Yellowstone? (ERROR: 0031-161 EOF on socket connection with node ys0418-ib)

Hi,I was trying to run CESM1.2 with FC5/ne240 configuration on Yellowstone. However, my model stopped running after initialization, having these error in ccsm.log file:=====2786: QNEG3 from vertical diffusion/SO2:m=  8 lat/lchnk= 440771 Min. mixing ratio violated at    3 points.  Reset to  1.0E-36 Worst =-2.1E-12 at i,k=   8 30
ERROR: 0031-161  EOF on socket connection with node ys0418-ib
INFO: 0031-639  Exit status from pm_respond = -1
ERROR: 0031-028  pm_mgr_handle; can't send a signal message to remote nodes
ERROR: 0031-619  No such file or directory
ERROR: 0031-028  pm_mgr_handle; can't send a signal message to remote nodes
ERROR: 0031-619  No such file or directory
INFO: 0031-029  Caught signal 2 (Interrupt), sending to tasks...
ERROR: 0031-028  pm_mgr_handle; can't send a signal message to remote nodes
ERROR: 0031-619  No such file or directory
=====(file is at: /glade/scratch/cytsai/Testv2/run/cesm.log.160220-010029)Just wondering what did the error message tell me? Did I run out of memory? Since I'm using -mach yellowstone and I'm assuming that it's out-of-box, so I didn't change any PE layout...ps. I also set my wall clock to 12:00 so that won't be the issue here...
Any thoughts would be appreciated!Thank you
 

jedwards

CSEG and Liaisons
Staff member
Yes, this message indicates that you exceeded memory on ys0418.   EOF on socket connection with node ys0418-ibWe don't have much experience running at ne240 resolution and I would expect that the pe layout has not been tuned.You might consider trying a case at ne120 first and moving to ne240 only after resolving any issues you run into there.
 

jedwards

CSEG and Liaisons
Staff member
Yes, this message indicates that you exceeded memory on ys0418.   EOF on socket connection with node ys0418-ibWe don't have much experience running at ne240 resolution and I would expect that the pe layout has not been tuned.You might consider trying a case at ne120 first and moving to ne240 only after resolving any issues you run into there.
 
Hi Jedwards,
Thank you for your reply. I have tried a case with ne120_f09_g16 but I nocticed that the default setting of PE layout for this case is too large for me to afford.(e.g. for ne120_ne120, total PE is 16384 indicated in http://www.cesm.ucar.edu/models/cesm1.2/timing/)
So I tried to run with 32 processors but I still got similar error. I'm wondering do you know why did I run out of memory? Or do you have other suggestions for PE layout using less processors?====   1: Opened existing file
   1: /glade/p/cesmdata/cseg/inputdata/atm/cam/inic/homme/cami-mam3_0000-01-ne120np4_
   1: L30_c110928.nc           0
   1: Opened existing file
   1: /glade/p/cesmdata/cseg/inputdata/atm/cam/topo/USGS-gtopo30_ne120np4_16xdel2-PFC
   1: -consistentSGH.nc           1
   0:
ERROR: 0031-161  EOF on socket connection with node ys0927-ib
INFO: 0031-639  Exit status from pm_respond = -1
ERROR: 0031-028  pm_mgr_handle; can't send a signal message to remote nodes
ERROR: 0031-619  No such file or directory
INFO: 0031-029  Caught signal 2 (Interrupt), sending to tasks...
ERROR: 0031-028  pm_mgr_handle; can't send a signal message to remote nodes
ERROR: 0031-619  No such file or directory====ps. my ccsm.log file is at:  /glade/scratch/cytsai/Test_ne120_f09_g16_v1/run/cesm.log.160220-145231my env_mach_pes file: /glade/scratch/cytsai/CESM/CESM_exp/GrISTopo/Test/Test_ne120_f09_g16_v1/env_mach_pes.xml Thank you.
 
Hi Jedwards,
Thank you for your reply. I have tried a case with ne120_f09_g16 but I nocticed that the default setting of PE layout for this case is too large for me to afford.(e.g. for ne120_ne120, total PE is 16384 indicated in http://www.cesm.ucar.edu/models/cesm1.2/timing/)
So I tried to run with 32 processors but I still got similar error. I'm wondering do you know why did I run out of memory? Or do you have other suggestions for PE layout using less processors?====   1: Opened existing file
   1: /glade/p/cesmdata/cseg/inputdata/atm/cam/inic/homme/cami-mam3_0000-01-ne120np4_
   1: L30_c110928.nc           0
   1: Opened existing file
   1: /glade/p/cesmdata/cseg/inputdata/atm/cam/topo/USGS-gtopo30_ne120np4_16xdel2-PFC
   1: -consistentSGH.nc           1
   0:
ERROR: 0031-161  EOF on socket connection with node ys0927-ib
INFO: 0031-639  Exit status from pm_respond = -1
ERROR: 0031-028  pm_mgr_handle; can't send a signal message to remote nodes
ERROR: 0031-619  No such file or directory
INFO: 0031-029  Caught signal 2 (Interrupt), sending to tasks...
ERROR: 0031-028  pm_mgr_handle; can't send a signal message to remote nodes
ERROR: 0031-619  No such file or directory====ps. my ccsm.log file is at:  /glade/scratch/cytsai/Test_ne120_f09_g16_v1/run/cesm.log.160220-145231my env_mach_pes file: /glade/scratch/cytsai/CESM/CESM_exp/GrISTopo/Test/Test_ne120_f09_g16_v1/env_mach_pes.xml Thank you.
 

jedwards

CSEG and Liaisons
Staff member
If you want to run CAM5 at high resolution you need to be prepared to pay for it.   I thnk that an ne30 resolution will run on 32 tasks, but for ne120 you might try 320 tasks and for ne240 you will probably need 1000+ tasks.    
 

jedwards

CSEG and Liaisons
Staff member
If you want to run CAM5 at high resolution you need to be prepared to pay for it.   I thnk that an ne30 resolution will run on 32 tasks, but for ne120 you might try 320 tasks and for ne240 you will probably need 1000+ tasks.    
 
Top